char8_t and utf8everywhere: How to convert to const char* APIs without invoking undefined behaviour?

737 Views Asked by At

As this question is some years old Is C++20 'char8_t' the same as our old 'char'?

I would like to know, what is the recommended way to handle the char8_t and char conversion right now? boost::nowide (1.80.0) doesn´t not yet understand char8_t nor (AFAIK) boost::locale.

As Tom Honermann noted that

reinterpret_cast<const char   *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text");   // Undefined behavior.

So: How do i interact with APIs that just accept const char* or const wchar_t* (think Win32 API) if my application "default" string type is std::u8string? The recommendation seems to be https://utf8everywhere.org/.

If i got a std::u8string and convert to std::string by

std::u8string convert(std::string str)
{
    return std::u8string(reinterpret_cast<const char8_t*>(str.data()), str.size());
}
std::string convert(std::u8string str)
{
    return std::string(reinterpret_cast<const char_t*>(str.data()), str.size());
}

This would invoke the same UB that Tom Honermann mentioned. This would be used when i talk to Win32 API or any other API that wants some const char* or gives some const char* back. I could go all conversions through boost::nowide but in the end i get a const char* back from boost::nowide::narrow() that i need to cast.

Is the current recommendation to just stay at char and ignore char8_t?

2

There are 2 best solutions below

5
On BEST ANSWER

This would invoke the same UB that Tom Honermann mentioned.

As pointed out in the post you referred to, UB only happens when you cast from a char* to a char8_t*. The other direction is fine.

If you are given a char* which is encoded in UTF-8 (and you care to avoid the UB of just doing the cast for some reason), you can use std::transform to convert the chars to char8_ts by converting the characters:

std::u8string convert(std::string str)
{
    std::u8string ret(str.size());
    std::ranges::transform(str, ret.begin(), [](char c) {return char8_t(c);});
    return ret;
}

C++23's ranges::to will make using a named return variable unnecessary.

For dealing with wchar_t interfaces (which you shouldn't have to, since nowadays UTF-8 support exists through narrow character interfaces on Windows), you'll have to do an actual UTF-8->UTF-16 conversion. Which you would have had to do anyway.

2
On

Personally, I think all the char8_t stuff in C++ is unusable practically!

With the current standard combined with OS support, I would recommend to avoid it, if possible.

But that is not all yet. There is more critic:

Unfortunately the C++ standard itself deprecates its own conversion support before it offers a replacement! For example, the support in std::filesystem by using an utf-8 encoded standard string (not u8string) is deprecated (std::filesystem::u8path). With that even to use utf-8 encoded std::string is a pain because you must always convert it from one to another and back again!

To your questions. It depends what you want to do. If you want have a std::string which is utf-8 encoded but you only have an std::u8string, then you can simply do the following (no reinterpret_cast needed):

std::string convert( std::u8string str )
{
    return std::string(str.begin(), str.end());
}

But here, I personally would expect, that the standard would offer a move constructor in std::string taking a std::u8string. Because otherwise you always must make a copy with an extra allocation for the unchanged data. Unfortunately the standard does not offer such simple things. They are forcing the users to do uncomfortable and expensive stuff.

The same is true, if you have a std::string and you have 100% verified that it is valid utf-8 then you can direct convert it:

std::u8string  convert( std::string str )
{
    return std::u8string( str.begin(), str.end() );
}

During writing the long answer I realized that it is even more bad than I though when it comes to conversion! If you need to do a real conversion of the encoding it turns out that std::u8string is not supported at all.

The only way possible (that is my research result so far) is to use std::string as the data holder for the conversion, since the available routines are working on char and NOT on char8_t!

So, for the conversion from std::string to std::u8string you must do the following:

  1. Use std::mbrtoc16 or std::std::mbrtoc32 for convert narrow char to either UTF-16 or UTF-32.
  2. Use std::codecvt_utf8 to produce an UTF-8 encoded std::string.
  3. Finally use the routine above to convert from UTF-8 encoded std::string to std::u8string.

For the other way round from std::u8string to std::string you must do the following:

  1. Use the routine above to create a UTF-8 encoded std::string.
  2. Use std::codecvt_utf8 to create an UTF-16/32 string.
  3. And finally use std::c16rtomb or std::c32rtomb to produce a narrow encoded std::string.

But guess what? The codecvt routines are deprecated without a replacement...

So, personally, I would recommend to use the Windows API for it and use std::string only (or on Windows std::wstring). Usually only on Windows the std::string / char is encoded with a Windows code page and everywhere else you can normally expect it is UTF-8 (except maybe for Mainframes and some very rare old systems).

The conclusion can only be: Don't mess around with char8_t and std::u8string at all. It is practically unusable.