Will we have a size_t strlen(const char8_t*) in a future C++ version

688 Views Asked by At

char8_t in C++20 fixes some problems of char, so I was considering using char8_t instead of char for utf8 text (e.g. text from command line). But then I noticed that strlen was not specified in the standard to be used with char8_t, actually none of the functions in the cstring library are. Can I expect this to happen in a next standard update? Or is char8_t never intended to replace char in the way I had in mind?

4

There are 4 best solutions below

0
On BEST ANSWER

I'm the author of the P0482 and P1423 char8_t proposals.

The intent of those proposals was to introduce the char8_t type with the same level of support present for char16_t and char32_t and then to follow up with additional functionality later. These proposals were adopted late in the C++20 development cycle (at the San Diego and Cologne meetings respectively), so there wasn't opportunity to deliver additional features for C++20.

One of the directives for SG16 as described in P1238 is to standardize new encoding aware text container and view types. Work is progressing in this area and we hope to deliver it for C++23. It is hoped that these new containers and views will supplant much raw string handling in C++.

With regard to strlen specifically, strlen is a C API. N2231 is a proposal to add char8_t support to C (again, at the same level as the existing support for char16_t and char32_t). That proposal has not yet been accepted by WG14. Assuming it is eventually accepted, then it would make sense to follow up with additional char8_t-based C string management functions (perhaps enhancing support for char16_t and char32_t as well).

At present, I'm working on completing an implementation of N2231 in gcc and glibc. Once that is complete, I intend to submit a revision of N2231 to WG14.

You can help! SG16 is an open group. Please feel free to subscribe to our mailing list, join us on Slack, share your ideas, needs, and wants, and write proposals for new functionality (we can help with how to do that).

1
On

0 can still act as null-terminator in utf8-strings, so technically nothing prevents you (except a lack of appropriate function) from using strlen to count the amount of bytes(!) in utf8 sequence. If you want to find the number of chars you would need a separate function.

0
On

char8_t is intended for UTF-8-encoded strings. As such, APIs that consume them will be assumed by users to be Unicode aware on some level. Quite a lot of the contents of the <cstring> header would be inappropriate for char8_t, as their behavior is very much not in line with Unicode (would strcmp do proper Unicode collation?).

If you want access to functions that work similarly to the <cstring> functions, then you'll find std::char_trait<char8_t> to contain some useful ones, in particular length (exactly like strlen) and compare (explicitly lexicographical). Most of the rest of <cstring> can be handled adequately through C++ algorithms.

0
On

These new char types are intended to use C++ string template std::basic_string, namely to define std::u8string. So the best in your case is use C++ strings.

As for the future support of char8_t in cstring library, I suppose this question is more suitable to the future C standard. I'm afraid, it will not be an easy and will be unlikely update, since C does not have overloaded functions, and this update will require new functions like c8slen in addition to strlen and wcslen.