Converting from Windows MBCS to UTF-8

2k Views Asked by At

I have a very large (millions of lines) application which was developed using MBCS (codepage 1252) and assumes all strings are char* and each character is only one byte. We are now expanding our language set and need to move to Unicode. Since UTF-8 works in 1-byte increments, it seems that this would be a good fit. Per the usual, we would like to make this change with the minimal amount of code change. We would not like to change everything to wchar or _TCHAR and have to modify the way every source file is encoded if we can help it.

The only way these foreign characters would be used is if the user entered them in a field, such as name. Strings containing these characters are then saved to files as needed and are not manipulated. The files are read later and the contents displayed. Assuming that no characters outside of cp1252 (ie chinese characters, etc) are used in the source code, do we need to make any changes to the majority of the source code, or can we leave it as char* and just let the possibly multi-byte characters pass through the system until they reach the UI where they are displayed?

The application is developed on Visual Studio 2015 using MFC.

2

There are 2 best solutions below

0
On

Oracle provides a very detailed page talking about the topic. (search: CP1252 on the page, all 'Character Sets' are listed at the bottom.)

MBCS stands for : Multi Byte Character Sets.

cp-1252 is not MBCS :
cp-1252 encompasses the ASCII char-set (128 symbols), extended with 128 more symbols : 256 symbols, encoded on 1 byte per symbol.

As MBCS can hold 1 or 2 bytes per symbol, it includes cp-1252 (256 1-byte symbols), but it holds many more symbols than cp-1252.
See Microsoft, about Unicode and MBCS.

If you have python installed, inside the file your_path_to\Python27\Lib\encodings\cp1252.py you can see it clearly : from 0x00 to 0xFF : one single byte per symbol (2 * 4 bits), 256 symbols.

About internationalization, Microsoft helps ?

0
On

UTF-8 is a good choice to use to encode your data going forward. Support for it on Windows is getting better, but you still would want to convert your UTF-8 strings to and from strings of wchar_t (that is, UTF-16 on Windows) in order to use them with the Windows API. (There’s limited support in Windows for reading and writing UTF-8 with the console using CP 65001, but your app is probably not console-mode.) You can do this with <codecvt> (std::codecvt_utf8 or std::codecvt_utf8_utf16), widen() and narrow() in Boost, mbstowcs() in C, or various other libraries such as ICU or QT.

UTF-8 support on Windows seems to be improving. There is even a ".utf8" or ".utf-8" locale in the latest Windows 10SR4. You still probably won’t be able to use a UTF-8 locale in your apps for a long time, if they have to run on older versions.

You’ll also need to be able to convert your legacy data to UTF-8, but the same libraries can handle that as well. For example, you can get a codecvt facet from a std::locale object initialized to the code page the data was saved in. Or just use a lookup table.

There’s not much reason to save your data in anything but UTF-8. UTF-16 takes up more space, it isn’t even a fixed-width encoding, has trouble with endianness, and isn’t as widely-used elsewhere.