If I take the length of a string containing a character outside the 7-bit ASCII table, I get different results on Windows and Linux:
Windows: strlen("ö") = 1
Linux: strlen("ö") = 2
On a Windows machine the string is obviously encoded in the "extended" ascii format as 0xF6
, whereas on a Linux machine it gets encoded in UTF-8 with 0xC3 0x96
, which gives the length of 2 characters.
Question:
Why does a C string gets differently encoded on a Windows and a Linux machine?
The question came up in a discussion I had with a fellow forum member on Code Review (see this thread).
First, this is not a Windows/Linux (Operating Systems) issue, but a compiler one as compilers exist on Windows that encode like gcc (common on Linux).
This is allowed by C and the two compiler makers have charted different implementations per their own programing goals, MS using CP-1252 and Linux using Unicode. @Danh. MS's selection pre-dates Unicode. Not surprising that various compilers makers employ different solutions.
"ö"
is encoded per the compiler's source character extended characters.I suspect MS is focused on maintaining their code base and encourages other languages. Linux is simply an earlier adapter of Unicode into C, even though MS has been an early Unicode influencer.
As Unicode support grows, I expect that to be the solution of the future.