In Qt, how do I convert the Unicode codepoint U+1F64B to a QString holding its equivalent character ""?

2.3k Views Asked by At

Background:

I am making a hash that will allow you to lookup the description you see below by feeding it a QString containing its character.

Character map example

I got a full list of the relevant data, looking something like this:

QHash<QString, QString> lookupCharacterDescription;
...
lookupCharacterDescription.insert("003F","QUESTION MARK");
lookupCharacterDescription.insert("0040","COMMERCIAL AT");
lookupCharacterDescription.insert("0041","LATIN CAPITAL LETTER A");
lookupCharacterDescription.insert("0042","LATIN CAPITAL LETTER B");
...
lookupCharacterDescription.insert("1F648","SEE-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F649","HEAR-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64A","SPEAK-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64B","HAPPY PERSON RAISING ONE HAND");
...
lookupCharacterDescription.insert("FFFD","REPLACEMENT CHARACTER");
lookupCharacterDescription.insert("FFFE","<not a character>");
lookupCharacterDescription.insert("FFFF","<not a character>");
lookupCharacterDescription.insert("FFFFE","<not a character>");
lookupCharacterDescription.insert("FFFFF","<not a character>");

Now obviously "1F64B" needs to be wrapped in something here. I have tried playing around with things like 0x1F64B as a QChar, but I am honestly groping in the dark here. I could make it work with the lower values like the Latin Letters, but it fails with the 5 character addresses.

Questions:

  • How do I classify 1F64B?
  • Is this considered UTF-32?
  • What can I wrap this value "1F64B" in to produce the QString("")?
  • Will the wrappings also work for the lower values?
1

There are 1 best solutions below

0
On BEST ANSWER

When you use QString(0x1F64B) it'll call QString::QString(QChar ch). Since QChar is a 16-bit type, it'll truncate the value to 0xF64B and you get an invalid character since that code point is currently unassigned. I'm pretty sure you'll get an out-of-range warning at that line. You can see the value F64B easily in the character if you zoom in or use a hex editor. Since there's no way for 0x1F64B to fit into a single 16-bit QChar and must be represented by a surrogate pair, you can't initialize the string that way.

OTOH QString("") works since it's constructing the string from another string. You must construct the string with a string like that, or manually by assigning the UTF-8/16 code units.

Is this considered UTF-32?

No. UTF-32 is a Unicode encoding that uses 32 bits for a code unit. You only have QString and not a bare byte array, so you don't need to care about its underlying encoding (which is actually UTF-16)

What can I wrap this value "1F64B" in to produce the QString("")?

You shouldn't deal with the numeric values as string. Store it as a numeric type instead

QHash<qint32, QString> lookupCharacterDescription;
lookupCharacterDescription.insert(0x1F64B, "HAPPY PERSON RAISING ONE HAND");

and then to make a string that contains the character at code point 0x1F64B use

uint cp = 0x1F64B;
QString mystr = QString::fromUcs4(&cp, 1);

Will the wrappings also work for the lower values?

Yes, since UCS4, A.K.A. UTF-32, can store any possible Unicode characters

Alternatively you can construct the character from UTF-16 or UTF-8. U+1F64B is encoded in UTF-16 as D83D DE4B, or as F0 9F 99 8B in UTF-8, therefore you can use any of the below

QChar utf16[2] = { 0xD38D, 0xDE4B };
str1 = QString(utf16, 2);
char* utf8[4] = { 0xF0, 0x9F, 0x99, 0x8B };
str2 = QString::fromUtf8(utf8, 4);

If you want to include the string in its literal form in source code then either of the following will work

str1 = QString::fromWCharArray(L"\xD83D\xDE4B");
str2 = QString::fromUtf8("\xF0\x9F\x99\x8B");

If you have C++11 support then simply use the prefix u8, u and U for UTF-8, UTF-16 and UTF-32 respectively like

u8""
u""
U""
u8"\U0001F64B"
u"\U0001F64B"
u"\uD83D\uDE4B"
U"\U0001F64B" 

Mandatory article to understand text and encodings: There Ain't No Such Thing as Plain Text