Most C compilers use signed characters. Most C libraries define EOF as -1.
Despite being a long-time C programmer I had never before put these two facts together and so in the interest of robust and international software I would ask for a bit of help in spelling out the implications.
Here is what I have discovered thus far:
- fgetc() and friends cast to unsigned characters before returning as int to avoid clashing with EOF.
- Therefore care needs to be taken with the results, e.g.
getchar() == (unsigned char) 'µ'
. Theoretically I believe that not even the basic character set is guaranteed to be positive.- The
<ctype.h>
functions are designed to handle EOF and expected unsigned characters. Any other negative input may cause out-of-bounds addressing. - Most functions taking character parameters as integers ignore EOF and will accept signed or unsigned characters interchangeably.
- String comparison (strcmp/strncmp/memcmp) compares unsigned character strings.
- It may not be impossible to discriminate EOF from a proper characters on systems where sizeof(int) = 1.
- The wide characters functions are not used for binary I/O and so WEOF is defined within the range of wchar_t.
Is this assessment correct and if so what other gotchas did I miss?
Full disclosure: I ran into an out-of-bounds indexing bug today when feeding non-ASCII characters to isspace() and the realization of the amount of lurking bugs in my old code both scared and annoyed me. Hence this frustrated question.
The basic execution character set is guaranteed to be nonnegative - the precise wording in C99 is: