Is there any way to do convert between utf-8 and plain string platform-independent?

664 Views Asked by At

Here the plain string has a kind of encoding which:

  • A plain string-literal such as "plainstring" encoded as;

  • All standard libraries return or accept. For example:


std::cout << "I'm ok." ; // plain string, ok on my system,
                            // VS2015 x64 default encoding setting.
std::cout << u8"I'm wrong."; // got error display on my system

std::experimental::filesystem::path path("Some Right specified Path contains non-ASCII chars"); // ok

std::experimental::filesystem::path path2(u8"Some Path specified Path contains non-ASCII chars"); // error

std::experimental::filesystem::directory_iterator r(path); // ok

std::experimental::filesystem::directory_iterator r2(path2); // will throw exception

As I know, my sysytem (windows 10 x64) use GB2312 encoding for such plain string.

But how to convert them into(and convert back) other encoding such as utf-8 in a platform-independent way??

1

There are 1 best solutions below

1
On

This is a simple-sounding question, but it is actually an extremely complex issue.

The short answer: A round trip from GB2312 to UTF-8 then back to GB2312 is possible, but you can't do a round-trip conversion from UTF-8 to GB2312 then back to UTF-8.

The longer answer: Any string that can be represented in a standards-compliant way can be expressed in Unicode, and any string that can be expressed in Unicode can be encoded in UTF-8.

The converse is not true. It is not possible to convert an arbitrary Unicode string into any other (standard) encoding.

Unicode contains 1,114,112 code points. It takes at least three bytes to represent this many different points. UTF-8 can represent any of these code points.

GB2312 (AKA Simplified Chinese) contains 6000 + code points, so there are many Unicode code points that have no corresponding entry in GB2312. That is why a UTF-8 to GB3213 encoding will always be lossy. So theoretically a round-trip conversion is not possible.

That being said, there are "best-effort" converters from UTF-8 to GB2312, and there is no reason why they shouldn't be platform independent. A google search of UTF-8 to GB2312 conversion finds many possibilities, most of which do not depend on any particular platform.

I suggest that you do this search and pick the result that meets your needs.

One platform-independent solution to converting between encodings is boost.locale A complete explanation of what it can do for you is beyond what would fit in a Stack Overflow answer <humor>even if I use the margins.</humor>.

For additional reading: this page provides useful background information for understanding string encoding issues.