Why does emoji have two different utf-8 codes? How to convert emoji from utf-8 , use NSString in ios?

We have found an issue, that some emoji have two utf-8 codes, such as:

emoji   unicode    utf-8                another utf-8

😁      U+1F601    xf0x9fx98x81     xedxa0xbdxedxb8x81

But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.

ios code

In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.

Documents i referenced includes:

emoji code link

whole utf-8 code link

But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.

input code

ouput

So, my question is :

Why does there have two types of utf-8 codes for one emoji ?

Where has a document which includes the two types of utf-8 codes?

How to correctly convert string from utf-8, using NSString in ios language?

asked Dec 22 '15 at 5:34

pinchwang

831110

This had me intrigued as my first thought was that the long UTF-8 representation was two UTF-8 blocks. It turns out that there are two variations of UTF-8, CESU-8 and Modified UTF-8, which encode UTF-16 style. You may be able to use this article iphonedevsdk.com/forum/iphone-sdk-development/… to write a decoder if there's no suitable iOS/Objective-C native decoder.

– Alastair McCormack
Dec 22 '15 at 11:33

@AlastairMcCormack That's the answer I think. You should post that as an answer.

– roeland
Dec 22 '15 at 22:23

@user692793 Please never post text as images, especially not code or output.

– roeland
Dec 22 '15 at 22:24

Thanks @roeland. I think a proper answer should contain some working code, but as I'm not an Objective-C coder I'll leave it to someone else to pickup the glory :)

– Alastair McCormack
Dec 22 '15 at 22:26

add a comment |

We have found an issue, that some emoji have two utf-8 codes, such as:

emoji   unicode    utf-8                another utf-8

😁      U+1F601    xf0x9fx98x81     xedxa0xbdxedxb8x81

But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.

ios code

In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.

Documents i referenced includes:

emoji code link

whole utf-8 code link

But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.

input code

ouput

So, my question is :

Why does there have two types of utf-8 codes for one emoji ?

Where has a document which includes the two types of utf-8 codes?

How to correctly convert string from utf-8, using NSString in ios language?

asked Dec 22 '15 at 5:34

pinchwang

831110

This had me intrigued as my first thought was that the long UTF-8 representation was two UTF-8 blocks. It turns out that there are two variations of UTF-8, CESU-8 and Modified UTF-8, which encode UTF-16 style. You may be able to use this article iphonedevsdk.com/forum/iphone-sdk-development/… to write a decoder if there's no suitable iOS/Objective-C native decoder.

– Alastair McCormack
Dec 22 '15 at 11:33

@AlastairMcCormack That's the answer I think. You should post that as an answer.

– roeland
Dec 22 '15 at 22:23

@user692793 Please never post text as images, especially not code or output.

– roeland
Dec 22 '15 at 22:24

Thanks @roeland. I think a proper answer should contain some working code, but as I'm not an Objective-C coder I'll leave it to someone else to pickup the glory :)

– Alastair McCormack
Dec 22 '15 at 22:26

add a comment |

We have found an issue, that some emoji have two utf-8 codes, such as:

emoji   unicode    utf-8                another utf-8

😁      U+1F601    xf0x9fx98x81     xedxa0xbdxedxb8x81

But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.

ios code

In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.

Documents i referenced includes:

emoji code link

whole utf-8 code link

But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.

input code

ouput

So, my question is :

Why does there have two types of utf-8 codes for one emoji ?

Where has a document which includes the two types of utf-8 codes?

How to correctly convert string from utf-8, using NSString in ios language?

asked Dec 22 '15 at 5:34

pinchwang

831110

We have found an issue, that some emoji have two utf-8 codes, such as:

emoji   unicode    utf-8                another utf-8

😁      U+1F601    xf0x9fx98x81     xedxa0xbdxedxb8x81

But ios language can't decode the other type of utf-8, so resulting an error when i decode string from utf-8.

ios code

In all documents i found, i can just find one type of utf-8 code for a emoji, no where to find the other.

Documents i referenced includes:

emoji code link

whole utf-8 code link

But in a web tool bianma, all the two types of utf-8 code can be converted into emoji correctly.

input code

ouput

So, my question is :

Why does there have two types of utf-8 codes for one emoji ?

Where has a document which includes the two types of utf-8 codes?

How to correctly convert string from utf-8, using NSString in ios language?

ios unicode utf-8 nsstring emoji

asked Dec 22 '15 at 5:34

pinchwang

831110

asked Dec 22 '15 at 5:34

pinchwang

831110

asked Dec 22 '15 at 5:34

pinchwang

831110

asked Dec 22 '15 at 5:34

pinchwang

831110

asked Dec 22 '15 at 5:34

pinchwang

831110

This had me intrigued as my first thought was that the long UTF-8 representation was two UTF-8 blocks. It turns out that there are two variations of UTF-8, CESU-8 and Modified UTF-8, which encode UTF-16 style. You may be able to use this article iphonedevsdk.com/forum/iphone-sdk-development/… to write a decoder if there's no suitable iOS/Objective-C native decoder.

– Alastair McCormack
Dec 22 '15 at 11:33

@AlastairMcCormack That's the answer I think. You should post that as an answer.

– roeland
Dec 22 '15 at 22:23

@user692793 Please never post text as images, especially not code or output.

– roeland
Dec 22 '15 at 22:24

Thanks @roeland. I think a proper answer should contain some working code, but as I'm not an Objective-C coder I'll leave it to someone else to pickup the glory :)

– Alastair McCormack
Dec 22 '15 at 22:26

add a comment |

This had me intrigued as my first thought was that the long UTF-8 representation was two UTF-8 blocks. It turns out that there are two variations of UTF-8, CESU-8 and Modified UTF-8, which encode UTF-16 style. You may be able to use this article iphonedevsdk.com/forum/iphone-sdk-development/… to write a decoder if there's no suitable iOS/Objective-C native decoder.

– Alastair McCormack
Dec 22 '15 at 11:33

@AlastairMcCormack That's the answer I think. You should post that as an answer.

– roeland
Dec 22 '15 at 22:23

@user692793 Please never post text as images, especially not code or output.

– roeland
Dec 22 '15 at 22:24

Thanks @roeland. I think a proper answer should contain some working code, but as I'm not an Objective-C coder I'll leave it to someone else to pickup the glory :)

– Alastair McCormack
Dec 22 '15 at 22:26

This had me intrigued as my first thought was that the long UTF-8 representation was two UTF-8 blocks. It turns out that there are two variations of UTF-8, CESU-8 and Modified UTF-8, which encode UTF-16 style. You may be able to use this article iphonedevsdk.com/forum/iphone-sdk-development/… to write a decoder if there's no suitable iOS/Objective-C native decoder.

– Alastair McCormack
Dec 22 '15 at 11:33

@AlastairMcCormack That's the answer I think. You should post that as an answer.

– roeland
Dec 22 '15 at 22:23

@user692793 Please never post text as images, especially not code or output.

– roeland
Dec 22 '15 at 22:24

Thanks @roeland. I think a proper answer should contain some working code, but as I'm not an Objective-C coder I'll leave it to someone else to pickup the glory :)

– Alastair McCormack
Dec 22 '15 at 22:26

add a comment |

2 Answers
2

active

oldest

votes

0xF0, 0x9F, 0x98, 0x81

Is the correct UTF-8 encoding for U+1F601 😁.

0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81

Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.

This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.

This happens to return a working string only because the tool is written in JavaScript. Having decoded the above byte sequence to the bogus surrogate code point sequence U+D83D,U+DE01 it then converts that into a JavaScript string using a direct code-point-to-code-unit mapping giving uD83DxDE01. As this is the correct way to encode 😁 in a UTF-16 string it appears to have worked.

(*: It is a valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)

You should not usually encounter a sequence like this; it is typically not worth catering for unless you have a specific source of this kind of malformed data which you don't have the power to get fixed.

edited Dec 22 '15 at 23:08

answered Dec 22 '15 at 23:03

bobince

444k89571770

Thank you very much for answer. We read string data from our server which use C++ language, after server convert unicode string to utf-8, this issue occurs. One more thing need to mention is that, when our client receive data as a string value cstr, and printf("%s", cstr) it's correct. But when convert string to NSString, NSString *ocstr = [[NSString alloc] initWithBytes:cstr.c_str() length:cstr.length() encoding:NSUTF8StringEncoding]; ocstr results as nil. why apple do not support the CESU-8 sequence? Do we have function to resolve the issue?

– pinchwang
Dec 23 '15 at 2:12

I would first look at the C++ server UTF-8 encoder, to see if it can be fixed properly at source. CESU-8 is considered an undesirable anomaly that you'd never deliberately want to use; most systems don't support it. If you have to accept it you'll need to write your own CESU-8 decoder walking through the input byte array (or use an existing library, eg ICU though that would be a really heavy dependency just for this).

– bobince
Dec 23 '15 at 11:36

Just as a side note, there is one particularly bothersome source of encoding like this: JNI (Java Native Interface). If you attempt to retrieve "UTF-8" bytes from a Java string you will receive the "modified UTF-8" variant. That is a rather large source of malformed data that cannot be fixed, unfortunately.

– borrrden
Jul 12 '18 at 17:02

add a comment |

This worked for me in php to send a message with emoji to telegram bot:

$message_text = " xf0x9fx98x81 ";

answered Jun 12 '18 at 9:41

Polina

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f34409085%2fwhy-does-emoji-have-two-different-utf-8-codes-how-to-convert-emoji-from-utf-8%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

0xF0, 0x9F, 0x98, 0x81

Is the correct UTF-8 encoding for U+1F601 😁.

0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81

Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.

This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.

(*: It is a valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)

edited Dec 22 '15 at 23:08

answered Dec 22 '15 at 23:03

bobince

444k89571770

Thank you very much for answer. We read string data from our server which use C++ language, after server convert unicode string to utf-8, this issue occurs. One more thing need to mention is that, when our client receive data as a string value cstr, and printf("%s", cstr) it's correct. But when convert string to NSString, NSString *ocstr = [[NSString alloc] initWithBytes:cstr.c_str() length:cstr.length() encoding:NSUTF8StringEncoding]; ocstr results as nil. why apple do not support the CESU-8 sequence? Do we have function to resolve the issue?

– pinchwang
Dec 23 '15 at 2:12

I would first look at the C++ server UTF-8 encoder, to see if it can be fixed properly at source. CESU-8 is considered an undesirable anomaly that you'd never deliberately want to use; most systems don't support it. If you have to accept it you'll need to write your own CESU-8 decoder walking through the input byte array (or use an existing library, eg ICU though that would be a really heavy dependency just for this).

– bobince
Dec 23 '15 at 11:36

Just as a side note, there is one particularly bothersome source of encoding like this: JNI (Java Native Interface). If you attempt to retrieve "UTF-8" bytes from a Java string you will receive the "modified UTF-8" variant. That is a rather large source of malformed data that cannot be fixed, unfortunately.

– borrrden
Jul 12 '18 at 17:02

add a comment |

0xF0, 0x9F, 0x98, 0x81

Is the correct UTF-8 encoding for U+1F601 😁.

0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81

Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.

This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.

(*: It is a valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)

edited Dec 22 '15 at 23:08

answered Dec 22 '15 at 23:03

bobince

444k89571770

Thank you very much for answer. We read string data from our server which use C++ language, after server convert unicode string to utf-8, this issue occurs. One more thing need to mention is that, when our client receive data as a string value cstr, and printf("%s", cstr) it's correct. But when convert string to NSString, NSString *ocstr = [[NSString alloc] initWithBytes:cstr.c_str() length:cstr.length() encoding:NSUTF8StringEncoding]; ocstr results as nil. why apple do not support the CESU-8 sequence? Do we have function to resolve the issue?

– pinchwang
Dec 23 '15 at 2:12

I would first look at the C++ server UTF-8 encoder, to see if it can be fixed properly at source. CESU-8 is considered an undesirable anomaly that you'd never deliberately want to use; most systems don't support it. If you have to accept it you'll need to write your own CESU-8 decoder walking through the input byte array (or use an existing library, eg ICU though that would be a really heavy dependency just for this).

– bobince
Dec 23 '15 at 11:36

Just as a side note, there is one particularly bothersome source of encoding like this: JNI (Java Native Interface). If you attempt to retrieve "UTF-8" bytes from a Java string you will receive the "modified UTF-8" variant. That is a rather large source of malformed data that cannot be fixed, unfortunately.

– borrrden
Jul 12 '18 at 17:02

add a comment |

0xF0, 0x9F, 0x98, 0x81

Is the correct UTF-8 encoding for U+1F601 😁.

0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81

Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.

This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.

(*: It is a valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)

edited Dec 22 '15 at 23:08

answered Dec 22 '15 at 23:03

bobince

444k89571770

0xF0, 0x9F, 0x98, 0x81

Is the correct UTF-8 encoding for U+1F601 😁.

0xED, 0xA0, 0xBD, 0xED, 0xB8, 0x81

Is not a valid UTF-8 sequence(*). It should really be rejected; iOS is correct to do so.

This is a bug in the bianma tool: the convertUtf8BytesToUnicodeCodePoints function is more lenient about what input it accepts than the specified algorithm in eg RFC 3629.

(*: It is a valid CESU-8 sequence, but that encoding is just “bogus broken encoding for compatibility with badly-written historical tools” and should generally be avoided.)

edited Dec 22 '15 at 23:08

answered Dec 22 '15 at 23:03

bobince

444k89571770

edited Dec 22 '15 at 23:08

answered Dec 22 '15 at 23:03

bobince

444k89571770

answered Dec 22 '15 at 23:03

bobince

444k89571770

answered Dec 22 '15 at 23:03

bobince

444k89571770

Thank you very much for answer. We read string data from our server which use C++ language, after server convert unicode string to utf-8, this issue occurs. One more thing need to mention is that, when our client receive data as a string value cstr, and printf("%s", cstr) it's correct. But when convert string to NSString, NSString *ocstr = [[NSString alloc] initWithBytes:cstr.c_str() length:cstr.length() encoding:NSUTF8StringEncoding]; ocstr results as nil. why apple do not support the CESU-8 sequence? Do we have function to resolve the issue?

– pinchwang
Dec 23 '15 at 2:12

I would first look at the C++ server UTF-8 encoder, to see if it can be fixed properly at source. CESU-8 is considered an undesirable anomaly that you'd never deliberately want to use; most systems don't support it. If you have to accept it you'll need to write your own CESU-8 decoder walking through the input byte array (or use an existing library, eg ICU though that would be a really heavy dependency just for this).

– bobince
Dec 23 '15 at 11:36

Just as a side note, there is one particularly bothersome source of encoding like this: JNI (Java Native Interface). If you attempt to retrieve "UTF-8" bytes from a Java string you will receive the "modified UTF-8" variant. That is a rather large source of malformed data that cannot be fixed, unfortunately.

– borrrden
Jul 12 '18 at 17:02

add a comment |

Thank you very much for answer. We read string data from our server which use C++ language, after server convert unicode string to utf-8, this issue occurs. One more thing need to mention is that, when our client receive data as a string value cstr, and printf("%s", cstr) it's correct. But when convert string to NSString, NSString *ocstr = [[NSString alloc] initWithBytes:cstr.c_str() length:cstr.length() encoding:NSUTF8StringEncoding]; ocstr results as nil. why apple do not support the CESU-8 sequence? Do we have function to resolve the issue?

– pinchwang
Dec 23 '15 at 2:12

I would first look at the C++ server UTF-8 encoder, to see if it can be fixed properly at source. CESU-8 is considered an undesirable anomaly that you'd never deliberately want to use; most systems don't support it. If you have to accept it you'll need to write your own CESU-8 decoder walking through the input byte array (or use an existing library, eg ICU though that would be a really heavy dependency just for this).

– bobince
Dec 23 '15 at 11:36

Just as a side note, there is one particularly bothersome source of encoding like this: JNI (Java Native Interface). If you attempt to retrieve "UTF-8" bytes from a Java string you will receive the "modified UTF-8" variant. That is a rather large source of malformed data that cannot be fixed, unfortunately.

– borrrden
Jul 12 '18 at 17:02

Thank you very much for answer. We read string data from our server which use C++ language, after server convert unicode string to utf-8, this issue occurs. One more thing need to mention is that, when our client receive data as a string value cstr, and printf("%s", cstr) it's correct. But when convert string to NSString, NSString *ocstr = [[NSString alloc] initWithBytes:cstr.c_str() length:cstr.length() encoding:NSUTF8StringEncoding]; ocstr results as nil. why apple do not support the CESU-8 sequence? Do we have function to resolve the issue?

– pinchwang
Dec 23 '15 at 2:12

I would first look at the C++ server UTF-8 encoder, to see if it can be fixed properly at source. CESU-8 is considered an undesirable anomaly that you'd never deliberately want to use; most systems don't support it. If you have to accept it you'll need to write your own CESU-8 decoder walking through the input byte array (or use an existing library, eg ICU though that would be a really heavy dependency just for this).

– bobince
Dec 23 '15 at 11:36

Just as a side note, there is one particularly bothersome source of encoding like this: JNI (Java Native Interface). If you attempt to retrieve "UTF-8" bytes from a Java string you will receive the "modified UTF-8" variant. That is a rather large source of malformed data that cannot be fixed, unfortunately.

– borrrden
Jul 12 '18 at 17:02

add a comment |

This worked for me in php to send a message with emoji to telegram bot:

$message_text = " xf0x9fx98x81 ";

answered Jun 12 '18 at 9:41

Polina

add a comment |

This worked for me in php to send a message with emoji to telegram bot:

$message_text = " xf0x9fx98x81 ";

answered Jun 12 '18 at 9:41

Polina

add a comment |

This worked for me in php to send a message with emoji to telegram bot:

$message_text = " xf0x9fx98x81 ";

answered Jun 12 '18 at 9:41

Polina

This worked for me in php to send a message with emoji to telegram bot:

$message_text = " xf0x9fx98x81 ";

answered Jun 12 '18 at 9:41

Polina

answered Jun 12 '18 at 9:41

Polina

answered Jun 12 '18 at 9:41

Polina

answered Jun 12 '18 at 9:41

Polina

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

gRusKLX0rOchBvZ LpT6Va SrhEYga8cuwQ Bque 4S6RCKWWV

搜尋此網誌

Vfrdtyky