I work on Linux. I have to read from the console to char16_t buffer. Currently my code looks like this:
char tempBuf[1024] = {0};
int readBytes = read(STDIN_FILENO, tempBuf, 1024);
char16_t* buf = convertToChar16(tempBuf, readBytes);
Inside the convert function I use mbrtoc16
std library function to convert each character separately. Is it the only way to read from the console to char16_t buf ? Do you know any alternative solution ?
Multi-byte Characters
The main thing you want to be careful of reading into a fixed-length buffer is accidentally truncating "multi-byte characters" in your "multi-byte string"
What is a multi-byte character you ask? In my environment they're UTF-8 characters. For example, if I run
echo $LANG
I geten_US.UTF-8
. These are exactly what they sound like, they are characters that can be stored over multiple bytes. Anything other than the 7-bit ascii set is stored in 2 or more bytes that follow each other sequentially. If you read only part of the multi-byte character (truncating it) then you end up with garbage on both sides of the read.So let's see a concrete example:
Example Code
In the complete runnable file below, I purposefully shorten the buffer to only be 5 characters wide so I can easily hold a full 4-byte UTF-8 multi-byte character and a null terminator.
Running an example
Taking the above code I can construct an input that I know will break (truncate) a character on purpose, like so, to see what is going on.
So what happened?
In the example above, I purposefully positioned the UTF-8 character "é", which expands to two bytes
0xC3
,0xA9
such that it would get cut off by your read call. I then usedungetc
to put0xC3
back into stdin, and read it again with it's partner0xA9
. Only when they're next to each other do they make any sense. You see an0x0a
following it which we know and love as'\n'
because the read captured my return as well.