Why in C++ are some characters in a multibyte UTF-8 string represented by negative numbers?

236 Views Asked by At

The following C++ source code I compile and run in 'Windows 10' and 'Ubuntu' (via 'WSL 2'):

#include <cstring>
#include <iostream>

int main()
{
    char str[] = "Hello, привет, !";

    std::cout << str << "\n\n";

    for (int i = 0; i < std::strlen(str); i++) {
        std::cout << (int) str[i] << ' ';
    } std::cout << "\n\n";

    for (int i = 0; i < std::strlen(str); i++) {
        std::cout << std::hex << (int) str[i] << ' ';
    } std::cout << "\n\n";

    for (int i = 0; i < std::strlen(str); i++) {
        std::cout << std::hex << (str[i] & 0xff) << ' ';
    } std::cout << '\n';

    return 0;
}

I save this source code in a file chars.cpp in UTF-8 encoding without BOM. In Windows 10, I use the MSVC compiler (cl.exe) from 'Microsoft C++ Build Tools' from the command line cl /EHsc /utf-8 "chars.cpp". In 'Ubuntu' (via 'WSL 2') I am using the g++ compiler from the "GCC" set from the command line g++ /mnt/c/Users/Илья/source/repos/test/chars.cpp -o chars.

I got the following result (in 'Windows 10', you need to configure the code page in the console cmd.exe using the chcp 65001 command, in 'Ubuntu' (via 'WSL 2') this is not necessary):

Hello, привет, !

72 101 108 108 111 44 32 -48 -65 -47 -128 -48 -72 -48 -78 -48 -75 -47 -126 44 32 -16 -97 -104 -114 33

48 65 6c 6c 6f 2c 20 ffffffd0 ffffffbf ffffffd1 ffffff80 ffffffd0 ffffffb8 ffffffd0 ffffffb2 ffffffd0 ffffffb5 ffffffd1 ffffff82 2c 20 fffffff0 ffffff9f ffffff98 ffffff8e 21

48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21

I'm curious why negative numbers are used to represent some characters. I tried to find an explanation in the cppreference.com and read two articles there:

https://en.cppreference.com/w/cpp/language/types, quote:

char - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type). Multibyte characters strings use this type to represent code units. For every value of type unsigned char in range [0, 255], converting the value to char and then back to unsigned char produces the original value. (since C++11) The signedness of char depends on the compiler and the target platform: the defaults for ARM and PowerPC are typically unsigned, the defaults for x86 and x64 are typically signed.

and

https://en.cppreference.com/w/cpp/string/multibyte

But I didn't find a direct explanation there.

My questions. For what purpose are some characters represented by negative numbers? Is it in the standard or is it system-specific?

1

There are 1 best solutions below

2
Ilya Chalov On

Thanks to the people for the comments, I think I understood what was going on here. Please correct me if I'm wrong.

As far as I understand, the C++ language standard allows the compiler to interpret char as either signed char or unsigned char.

The MSVC and g++ compilers interpret char as signed char by default. Thus, the char type in my program can represent values in the range -128..127. Consider the example of the Cyrillic small letter 'п': in the Unicode table it is U+043F; in UTF-8 encoding, this is 2 bytes d0 bf (hex) or 208 191 (dec).

Since the numbers 208 191 do not fit into the range -128..127, they are converted to -48 -65 (208 - 256, 191 - 256). This is how all characters are processed. It turns out that if the character code falls into the range 0..127, then it does not change (ASCII table).

This behavior of the MSVC and g++ compilers can be changed using special switches (options). There is a /J option for the MSVC compiler:

cl /EHsc /utf-8 /J "chars.cpp"

And for the g++ compiler there is an option -funsigned-char:

g++ /mnt/c/Users/Илья/source/repos/test/chars.cpp -o chars -funsigned-char

As a result, the same source code after compiling and running with new options will give a different result:

Hello, привет, !

72 101 108 108 111 44 32 208 191 209 128 208 184 208 178 208 181 209 130 44 32 240 159 152 142 33

48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21

48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21

With the new options, compilers interpret char as unsigned char (range 0..255), so there are no negative numbers in the string representation.