The following C++ source code I compile and run in 'Windows 10' and 'Ubuntu' (via 'WSL 2'):
#include <cstring>
#include <iostream>
int main()
{
char str[] = "Hello, привет, !";
std::cout << str << "\n\n";
for (int i = 0; i < std::strlen(str); i++) {
std::cout << (int) str[i] << ' ';
} std::cout << "\n\n";
for (int i = 0; i < std::strlen(str); i++) {
std::cout << std::hex << (int) str[i] << ' ';
} std::cout << "\n\n";
for (int i = 0; i < std::strlen(str); i++) {
std::cout << std::hex << (str[i] & 0xff) << ' ';
} std::cout << '\n';
return 0;
}
I save this source code in a file chars.cpp in UTF-8 encoding without BOM. In Windows 10, I use the MSVC compiler (cl.exe) from 'Microsoft C++ Build Tools' from the command line cl /EHsc /utf-8 "chars.cpp". In 'Ubuntu' (via 'WSL 2') I am using the g++ compiler from the "GCC" set from the command line g++ /mnt/c/Users/Илья/source/repos/test/chars.cpp -o chars.
I got the following result (in 'Windows 10', you need to configure the code page in the console cmd.exe using the chcp 65001 command, in 'Ubuntu' (via 'WSL 2') this is not necessary):
Hello, привет, !
72 101 108 108 111 44 32 -48 -65 -47 -128 -48 -72 -48 -78 -48 -75 -47 -126 44 32 -16 -97 -104 -114 33
48 65 6c 6c 6f 2c 20 ffffffd0 ffffffbf ffffffd1 ffffff80 ffffffd0 ffffffb8 ffffffd0 ffffffb2 ffffffd0 ffffffb5 ffffffd1 ffffff82 2c 20 fffffff0 ffffff9f ffffff98 ffffff8e 21
48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21
I'm curious why negative numbers are used to represent some characters. I tried to find an explanation in the cppreference.com and read two articles there:
https://en.cppreference.com/w/cpp/language/types, quote:
char - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type). Multibyte characters strings use this type to represent code units. For every value of type unsigned char in range [0, 255], converting the value to char and then back to unsigned char produces the original value. (since C++11) The signedness of char depends on the compiler and the target platform: the defaults for ARM and PowerPC are typically unsigned, the defaults for x86 and x64 are typically signed.
and
https://en.cppreference.com/w/cpp/string/multibyte
But I didn't find a direct explanation there.
My questions. For what purpose are some characters represented by negative numbers? Is it in the standard or is it system-specific?
Thanks to the people for the comments, I think I understood what was going on here. Please correct me if I'm wrong.
As far as I understand, the C++ language standard allows the compiler to interpret
charas eithersigned charorunsigned char.The MSVC and g++ compilers interpret
charassigned charby default. Thus, thechartype in my program can represent values in the range-128..127. Consider the example of the Cyrillic small letter'п': in the Unicode table it isU+043F; in UTF-8 encoding, this is 2 bytesd0 bf(hex) or208 191(dec).Since the numbers
208 191do not fit into the range-128..127, they are converted to-48 -65(208 - 256, 191 - 256). This is how all characters are processed. It turns out that if the character code falls into the range0..127, then it does not change (ASCII table).This behavior of the MSVC and g++ compilers can be changed using special switches (options). There is a
/Joption for the MSVC compiler:And for the g++ compiler there is an option
-funsigned-char:As a result, the same source code after compiling and running with new options will give a different result:
With the new options, compilers interpret
charasunsigned char(range0..255), so there are no negative numbers in the string representation.