I have a very large (millions of lines) application which was developed using MBCS (codepage 1252) and assumes all strings are char* and each character is only one byte. We are now expanding our language set and need to move to Unicode. Since UTF-8 works in 1-byte increments, it seems that this would be a good fit. Per the usual, we would like to make this change with the minimal amount of code change. We would not like to change everything to wchar or _TCHAR and have to modify the way every source file is encoded if we can help it.
The only way these foreign characters would be used is if the user entered them in a field, such as name. Strings containing these characters are then saved to files as needed and are not manipulated. The files are read later and the contents displayed. Assuming that no characters outside of cp1252 (ie chinese characters, etc) are used in the source code, do we need to make any changes to the majority of the source code, or can we leave it as char* and just let the possibly multi-byte characters pass through the system until they reach the UI where they are displayed?
The application is developed on Visual Studio 2015 using MFC.
Oracle provides a very detailed page talking about the topic. (search:
CP1252
on the page, all 'Character Sets' are listed at the bottom.)MBCS
stands for : Multi Byte Character Sets.cp-1252
is notMBCS
:cp-1252
encompasses theASCII char-set
(128 symbols), extended with 128 more symbols : 256 symbols, encoded on 1 byte per symbol.As
MBCS
can hold 1 or 2 bytes per symbol, it includescp-1252
(256 1-byte symbols), but it holds many more symbols thancp-1252
.See Microsoft, about Unicode and MBCS.
If you have python installed, inside the file
your_path_to\Python27\Lib\encodings\cp1252.py
you can see it clearly : from0x00
to0xFF
: one single byte per symbol (2 * 4 bits), 256 symbols.About internationalization, Microsoft helps ?