Within Windows MBCS, what languages have 2 byte characters and what characters are they?

2.3k Views Asked by At

I have a legacy application that uses Window's old MBCS. The software is international, and uses code pages to make it work for other languages. I've read that Chinese contains multibyte characters. My question is, which ones and how do I generate them on a computer in the USA? I need this for testing.

2

There are 2 best solutions below

0
On

I think the standard of MBCS has the difference between Japan and China, Korea. It depends on each country's language. Though that is able to use by Windows OS in each country, for example Windows 7, xp. You should change language option on control panel.

2
On

What you should be writing nowadays are Unicode applications, which don't have to worry about MBCS encodings. I mean, sure, there are Unicode characters that use variable length encodings, like surrogates in UTF-16, but you shouldn't have to do anything special to make these work. If you want to test them with your app, just look up a table of Unicode characters on the web.

In your case, you're actually working with a legacy non-Unicode application. These use the default system codepage. The only multi-byte character sets (MBCS) supported by legacy Windows applications is a double-byte character set (DBCS)—in particular, Chinese, Japanese, and Korean:

  • Japanese Shift-JIS (932)
  • Simplified Chinese GBK (936)
  • Korean (949)
  • Traditional Chinese Big5 (950)

Since you are asking this question, I'm assuming that you don't speak any of these languages and don't have your system configured to use any of them. That means you will need to change your system's default codepage to one of these. You might want to do that in a VM. To do so, open the "Region" control panel (how to find it depends on your version of Windows), select the "Administrative" tab, and click "Change system locale." You'll need to reboot after making this change.

I've heard that you can use Microsoft's AppLocale utility to change the codepage for an individual application, but it does have some limitations and compatibility problems. I've never tried it myself. I also don't think it works on newer versions of Windows; the last supported versions are Windows XP/Server 2003. I would recommend sticking with an appropriately-localized VM.

Again, you can find tables of characters supported by these codepages online (see links below), or by using the Character Map utility on a localized installation. As Hans suggested in a comment, an even easier way to do it might be copying and pasting Simplified Chinese text (e.g., for CP 936) from a webpage on the Internet.

As far as the technical implementation goes, a DBCS encodes characters in two bytes. The first (lead) byte signals that it and the byte to follow are to be interpreted as a single character. MBCS-aware functions (with the _mbs prefix in Microsoft's string-manipulation headers) recognize this and process the characters accordingly. The lead bytes are specifically reserved and defined for each codepage. For example, CP 936 (Simplified Chinese) uses 0x81 through 0xFE as lead bytes, while CP 932 (Japanese) uses 0x81 through 0x9F as lead bytes. If you use the string functions designed to deal with MBCS, you shouldn't have a problem. You will only have difficulty if you were careless enough to have fallen back to naïve ACSII-style string manipulation, iterating through bytes and treating them as individual characters.

If at all feasible, though, you should really strongly consider upgrading the app to support Unicode. Obviously there is no guarantee that it will be easy, but it won't be any harder than fixing a lack of support for MBCS codepages in a legacy non-Unicode application, and as a bonus, the time you spend doing so will pay many more dividends.