Chinese text encoding missing characters when viewed in web browser

778 Views Asked by At

I have a HTML file which contains Chinese text. When I open the file in any web browser, there are characters which appear to be missing.

Here's an example copied from the browser window:

本函旨在邀請您參�� 定於

I know for a fact that all other characters seen here are correct aside from the missing ones (confirmed by a native Chinese speaker).

In the HTML header, I have a tag which signifies the file contains UTF-8 encoded characters:

<META http-equiv="Content-Type" content="text/html; charset=utf-8">

I've already tried some other charsets in this META tag, but so far it seems any encoding method I try aside from UTF-8 ends up looking worse.

I also considered the possibility that it is a font issue, so I installed 3 different traditional Chinese fonts on my system and forced Chrome to use them. None of them made any difference - missing characters were still present.

If I open the HTML file with Notepad++, here's what I can see:

https://i.stack.imgur.com/Ex3C1.png

If I select and copy-paste this text into regular MS Notepad, I get this:

本函旨在邀請您參劦nbsp;定於

So you can see here that the "xE5 x8A" visible in Notepad++ seems to have been replaced by 劦.

Is there any reason why the browser would be showing �� instead of 劦 in this scenario?

1

There are 1 best solutions below

1
On BEST ANSWER

Look again at the HTML file.

I see the first 2 bytes of a character encoded in UTF-8, followed by   ... let's imagine there was originally a \xA0, and this was mutated to &nbsp; when the file was created by applying global substitutions to the UTF-8-encoded data.

However, \xE5\x8A\xA0 UTF-8 decodes to U+52A0 which is not the same as the alien character which is U+52A6 ... not close enough to an answer.