Is converting from unsigned char to signed char and vice versa in C89 well defined?

232 Views Asked by At

Note: The suggested duplicate deals with unsigned int and signed int, not unsigned char and signed char. The suggested duplicate question deals with C11. This question is concerned with C89 only. Can this question be reopened?

My code:

#include <stdio.h>

int main()
{
    signed char c;
    unsigned char d;

    c = (signed char) -2;
    d = (unsigned char) c;
    printf("%d %d\n", c, d);

    d = (unsigned char) 254;
    c = (signed char) d;
    printf("%d %d\n", c, d);

    return 0;
}

Output:

$ clang -Wall -Wextra -pedantic -std=c89 foo.c && ./a.out
-2 254
-2 254

Is the output guaranteed to be -2 254 in a standard-conforming C89 compiler for both conversions shown above? Or is the output dependent on the implementation?

3

There are 3 best solutions below

1
chux - Reinstate Monica On BEST ANSWER

Is converting from unsigned char to signed char and vice versa in C89 well defined?

Conversions to unsigned types is well defined. To signed types has implementation details.

Is the output guaranteed to be -2 254 in a standard-conforming C89 compiler for both conversions shown above?

No.

Or is the output dependent on the implementation?

Yes.


Not all implementations use 8-bit char and conversions to signed types incur implementation details.

Spec details: C89 Conversions. This wording differs from recent C specs. I have not found a significant difference.


When UCHAR_MAX <= INT_MAX, code could use below and let the compiler emit optimized, well defined code.

c = (signed char) (d > SCHAR_MAX ? d - UCHAR_MAX - 1 : d);

Likely needs some more thought to cover all cases.

39
Tom On

If I say anything wrong, please correct me.

Your problem have a flag with "undefined-behavior". I think it's not right.

If you have any doubts about the program, I suggest looking at the disassembly code of the program. All your confusion may be easily resolved by examining it.

The output:

-2 254
-2 254

It is right and it's centain behavior. This behavior is determined by the C language itself or the C language standard.

The key to outputting depends on how the programmer wants to interpret the stored value of FE.If you see FF as a unsigned char, it's 255(or FFFF as a unsigned short it's 65535 or FFFFFFFF as a unsigned int it's 4294967295). And see FF as a signed char, it's -1(or FFFF as a signed short it's -1 or FFFFFFFF as a signed int it's -1).

The same as you see FE as a unsigned char, it's 254. And see FE as a signed char, it's -2. And so on ......

When you ask a computer to store -2 and 254, the computer doesn't recognize positive or negative numbers, it only recognizes 0(In circuitry, it could perhaps be said to be "disconnected" or "broken.") and 1(In circuitry, it could perhaps be said to be "closed" or "connected."). If you ask the computer to store -2, it will store FE(Because of variable c and variable d is type of char,it occupy 1 byte) somewhere in memory(As @David C. Rankin point out that on computers that encode negative signed values in two-compliment.). Similarly, if you ask it to store 254, it will also store FE somewhere in memory.

See below code:

#include <stdio.h>

int main()
{
    signed char c;
    unsigned char d;

    c = (signed char) 0xFE;
    d = (unsigned char) c;
    printf("%d %d\n", c, d);

    d = (unsigned char)0xFE;
    c = (signed char) d;
    printf("%d %d\n", c, d);

    return 0;
}

Run it with below command:

clang -Wall -Wextra -pedantic -std=c89 foo.c && ./a.out

will output:

-2 254
-2 254

Why output double -2 254?

There is no -2 and 254 in the code.

It seems that only the number 0xFF was observed.

c = (signed char) 0xFE;

d = (unsigned char)0xFE;

So where does -2 and 254 come from?

Simple explanation: (Below have a more detailed explanation)

enter image description here

We find thatvariable c and variable d is char type, but %d is output int(or signed int) , how should compiler proceed now? The answer is signed extension and unsigned extension .

So now the value 0xFE stored in variable c has been transformed to 0xFFFFFFFE through an sign extension, and the value 0xFE stored in variable d has been transformed to 0x000000FE through an zero extension. When 0xFFFFFFFE printed is -2 with %d, and 0x000000FE printed is 254 with %d.(Are you not quite familiar with or don't quite understand 0xFFFFFFFE? Let's continue reading, as there's an explanation below.)

Or code like below:

#include <stdio.h>

int main()
{
    signed char c;
    unsigned char d;

    c = (signed char) 254;
    d = (unsigned char) c;
    printf("%d %d\n", c, d);

    d = (unsigned char)254;
    c = (signed char) d;
    printf("%d %d\n", c, d);

    return 0;
}

Run it with below command:

clang -Wall -Wextra -pedantic -std=c89 foo.c && ./a.out

will output:

-2 254
-2 254

In order to better explain your confusion, please take a look at the following code.

#include <stdio.h>

int main()
{
    signed char c;
    unsigned char d;

    c = (signed char) -2;
    d = (unsigned char) c;
    printf("%d %d %u %u\n", c, d, c, d);

    d = (unsigned char) 254;
    c = (signed char) d;
    printf("%d %d %u %u\n", c, d, c, d);

    return 0;
}

Run it with below command:

clang -Wall -Wextra -pedantic -std=c89 foo.c && ./a.out

will output:

-2 254 4294967294 254
-2 254 4294967294 254

Or run it with below command:

gcc -g -o foo foo.c && ./foo

will output:

-2 254 4294967294 254
-2 254 4294967294 254

Output is right.

More details explanation:

enter image description here

We find that variable c or variable d is char type, but %u is output unsigned int , how should compiler proceed now? The answer is signed extension and unsigned extension .

When we examine the disassembly code, we do indeed discover sign extension and zero extension. See below picture:

enter image description here

The other picture:

enter image description here

We found that use char type(BYTE) when assign value to variable c and variable d, but at printf the value of variable c and variable d before, there are some instruction like:

movzx  esi,BYTE PTR [rbp-0x1]
movsx  ecx,BYTE PTR [rbp-0x2]
movzx  edx,BYTE PTR [rbp-0x1]
movsx  eax,BYTE PTR [rbp-0x2]

movzx is zero extension, and movsx is sign extension. Like esi,ecx,edx,eax is equal to int(ecx occupy 4 byte, the type of int also occupy 4 byte).

So now the value 0xFE stored in variable c has been transformed to 0xFFFFFFFE(saved in ecx or eax) through an sign extension, and the value 0xFE stored in variable d has been transformed to 0x000000FE(saved in esi or edx) through an zero extension. When 0xFFFFFFFE printed is 4294967294 with %u, 0xFFFFFFFE printed is -2 with %d , and 0x000000FE printed is 254 with %u, 0x000000FE printed is 254 with %d.

The representation of 4294967294 see below picture.

enter image description here

The representation of -2 see below picture.

enter image description here

So now you see that when outputting the value of variable c or variable d, using %d and %u to print them out will yield different results. However, both representations refer to the same value stored in memory. The key point is how you choose to interpret the value of c or d.

0
supercat On

The authors of the Standard almost certainly expected that an implementation would implement conversions between signed and unsigned character types in such a manner that round-trip conversions between them would be value-preserving on any implementation which did not have a compelling reason for handling them in some other fashion, and almost certainly expected that such implementations, if they existed at all, would be quite rare. There was thus no need for the Committee to worry about whether an implementations that had a good reason for processing such conversions in an unusual manner should be required to process them in value-preserving fashion anyhow. If no implementations would actually have a good reason to deviate from the common behavior, nobody should care whether the Standard mandates the commonplace treatment, and if an implementation did have a good reason to deviate, people working with it would be better placed than the Committee to judge the pros and cons of such a deviation.