UTF-8-compliant IOstreams

2.7k Views Asked by Nordlöw At 25 October 2011 at 12:07

Does GCC's standard library or Boost or any other library implement iostream-compliant versions of ifstream or ofstream that supports conversion between UTF-8-encoded (file-) streams and a std::vector<wchar_t> or std::wstring?

Original Q&A

There are 2 best solutions below

Kerrek SB On 25 October 2011 at 12:15 BEST ANSWER

Your question doesn't quite work. UTF-8 is a specific encoding, while wchar_t is a data type. Moreover, wchar_t is intended by the standard to represent the system's character set, but this is entirely left to platform, and the standard makes no requirements.

Therefore, the correct thing to ask for is first of all conversion between the system's narrow, multibyte encoding and the fixed-length encoding of the system's encoding into a wide string. This functionality is provided by std::mbstowcs and std::wcstombs. There may also be a locale facet somewhere that wraps this, but that's a bit of a niche area of the library.

If you want to convert between the opaque "system's encoding" prescribed by the standard and a definite encoding prescribed by your serialized data source/sink, you need an extra library. I'd recommend Posix's iconv(), which is widely available. (The Windows API has a different approach and offers special functions for conversion.)

C++11 alleviates the issue slightly by adding an explicit family of UTF-encoded string types and literals, and presumably also transcoding facilities among those (though I've never seen them implemented by anyone).

Here's my standard response of past posts on the subject: Q1, Q2, Q3. C++11 will be a joy once its fully available :-)

Cubbi On 25 October 2011 at 13:20

The C++11 solution is to wrap the UTF-8 stream in an appropriate wbuffer_convert

#include <fstream>
#include <string>
#include <codecvt>
int main()
{
    std::ifstream utf8file("test.txt"); // if the file holds UTF-8 data
    std::wbuffer_convert<std::codecvt_utf8<wchar_t>> conv(utf8file.rdbuf());
    std::wistream ucsbuf(&conv);
    std::wstring line;
    getline(ucsbuf, line); // then line holds UCS2 or UCS4, depending on the OS
}

This works with Visual Studio 2010 and with clang++/libc++, but, unfortunately, not with GCC.

Until this becomes widespread, third-party libraries are indeed the best solution.

UTF-8-compliant IOstreams

There are 2 best solutions below

Related Questions in C++

Related Questions in UNICODE

Related Questions in UTF-8

Related Questions in IOSTREAM

Related Questions in FILESTREAMS

Trending Questions

Popular # Hahtags

Popular Questions