I am trying to read a UTF-8 encoded file into a UTF-32 (UCS-4) string. Basically internally I want a fixed size character internally to the application.
Here I want to make sure the translation is done as part of the stream processes (because that is what the Locale is supposed to be used for). Alternative questions have been posted to do the translation on the string (but this is wasteful as you have to do a translation phase in memory then you have to do a second pass to send it to the stream). By doing it with the locale in the stream you only have to do a single pass and there is not requirement for a copy to made (assuming you want to maintain the original).
This is what I tried.
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
int main()
{
std::locale converter(std::locale(), new std::codecvt_utf8<char32_t>);
std::basic_ifstream<char32_t> iFile;
iFile.imbue(converter);
iFile.open("test.data");
std::u32string line;
while(std::getline(iFile, line))
{
}
}
Since thes are all standard types I was surprized with this compilation error:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/istream:275:41:
error: no matching function for call to 'use_facet'
const ctype<_CharT>& __ct = use_facet<ctype<_CharT> >(__is.getloc());
^~~~~~~~~~~~~~~~~~~~~~~~~
Compiled with:
g++ -std=c++14 test.cpp
Seems like
char32_t
is not what I wanted. Simply moving towchar_t
worked for me. I suspect that this only works the way I want onLinux
like system and Windows this conversion will be to UTF-16 (UCS-2) (but I can't test that).A comment above suggested this would be slower than reading the data than translating it inline. So I did some tests:
// read1.cpp Translation in stream using codecvt and Locale
// read2.cpp Translation using codecvt after reading.
// read3.cpp Using UTF-8
The test file was 58M of unicode japanese
Doing the translation in stream is faster but not significantly so (not it was a lot of data). So choose the one that is easies to read.