Displaying Unicode characters above U+FFFF on Windows

1.3k Views Asked by At

the application I'm developing with EVC++ 4 runs on Windows CE 5 and should support unicode (AFAIK wchar_t uses UTF-16 on windows, so I'm using that), so I want to be able to test it with "more exotic" characters. Especially with characters that use 4 Byte in UTF-16 and not just 2. Therefore I'm trying to display such characters in a texteditor (atm on my desktop PC with Windows XP, not on the embedded device).

But I haven't managed it to do so yet. As an example I've chosen this character. Like mentioned here "MPH 2B Damase" should support this character. So I downloaded the font and put it into Windows\Fonts. I created a textfile using a hexeditor (just to be sure) with following content:

FFFE D802 DC00

When I open it with notepad (which should be unicode-capable, right?) and use the downloaded font it doesn't display 1 char, as intended, but this 2:

˘Ü

What am I doing wrong? :)

Thanks!

hrniels

Edit: Flipping the BOM, as suggested, doesn't work. Notepad (and all other editors I tried, too) displays two squares in this case. Interesting is that if I copy the two squares here (with firefox) I see the right character:


I've also tried it with Komodo Edit with the same result.

Using UTF-8 doesn't help notepad either.

3

There are 3 best solutions below

4
Skurmedel On

Your text editor might not like UTF-16. It probably assumes ANSI or UTF-8.

Try typing in the UTF-8 equivalent instead:

0xF0 0x90 0xA0 0x80

This won't help your testing, but will make sure your font isn't at fault. A text editor that does support UTF-16 is Komodo Edit.

3
AudioBubble On

What happens if you put the byte order mark the other way around?

FEFF D802 DC00

(At the moment the byte sequence is being interpreted as the two characters U+02D8 U+00DC, so hopefully flipping the BOM will cause the bytes to be read in the intended order)

0
sorin On

Probably you forgot to read the _wfopen() documentation. There they specify the encoding parameter. BTW, I assumed you are already using Unicode (wchars).

I would recommend you to use UTF-8 in files with or without BOM but forcing your fopen to use UTF-8 flag. It looks _wfopen("newfile.txt", "r, ccs=UTF-8"); will work with UTF-8 with or without BOM and also with UTF-16. Do not make the mistake of using the ccs=Unicode, it is a common thing to have UTF-8 files without BOM.

You should really read a little bit about Unicode before trying to work. This about this as a very good investment - it will save you time if you understand how Unicode works.

Here is a start http://blog.i18n.ro/newbie-guide-to-unicode/ and do not forget to read the links from the end of the article.

If you really need a simple text editor that allows you to play with Unicode encodings, use Notepad++ and forget about Notepad.