How are char arrays / strings stored in binary files?

3.8k Views Asked by At

When I compile this code using different compilers and inspect the output in a hex editor I am expecting to find the string "Nancy" somewhere.

#include <stdio.h>

int main()
{
    char temp[6] = "Nancy";
    printf("%s", temp);

    return 0;
}
  1. The output file for gcc -o main main.c looks like this:

    sdf

  2. The output for g++ -o main main.c, I can't see to find "Nancy" anywhere.

  3. Compiling the same code in visual studio (MSVC 1929) I see the full string in a hex editor:

Why do I get some random bytes in the middle of the string in (1)?

3

There are 3 best solutions below

8
On

There is no single rule about how a compiler stores data in the output files it produces.

Data can be stored in a “constant” section.

Data can be built into the “immediate” operands of instructions, in which data is encoded in various fields of the bits that encode an instruction.

Data can be computed from other data by instructions generated by the compiler.

I suspect the case where you see “Nanc” in one place and “y” in another is the compiler using a load instruction (may be written with “mov”) that loads the bytes forming “Nanc” as an immediate operand and another load instruction that loads the bytes forming “y” with a trailing null character, along with other instructions to store the loaded data on the stack and pass its address to printf.

You have not provided enough information to diagnose the g++ case: You did not name the compiler or its version number or provide any part of the generated output.

0
On

Generally a compiled program is split into different types of "section". The assembler file will use directives to switch between them.

  • Code (".text")
  • Static read-only data (".section .rodata")
  • Initialised global or static variables (".data")
  • Uninitialised (or zero-initialized) global or static variables (".bss")

String literals in C can be used in two different ways.

  • As a pointer to constant data.
  • As an initaliser for an array.

If a string literal is used as a pointer then it is likely the compiler will place the string data in the read only data section.

If a string literal is used to initialise a global/static array then it is likely the compiler will place the array in the initilised data section (or the read-only data section if the array is declared as const).

However in your case the array you are initialising is an automatic local variable. So it can't be pre-initialised before program start. The compiler must include code to initialise it each time your function runs.

The compiler might choose to do that by storing the string in a read-only data location and then using a copy routine (either inlined or a call) to copy it to the local array. It may chose to simply generate instructions to set the elements of the array one by one. It may choose to generate instructions that set several array elements at the same time.

In your example it looks like MSVC has chosen to use a copy routine, so the string appears sequentially in the file. gcc on the other hand has chosen to use a 4 byte move instruction followed by a two byte move instruction, both with literals as inputs. So the literal is split up into two parts.

P.S. I've noticed some people posting https//godbolt.org/ links on other answers to this question. The Compiler Explorer is a useful tool but be aware that it hides the section switching directives from the assembler output by default.

4
On

I reproduced it, using gcc 9.3.0 (Linux Mint 20.2), on x86-64 system (Intel

Result of hexdump -C:

enter image description here

Note the byte sequence is the same.

So I use gcc -S -c:

    .file   "teststr.c"
    .text
    .section    .rodata
.LC0:
    .string "%s"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    endbr64
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movq    %fs:40, %rax
    movq    %rax, -8(%rbp)
    xorl    %eax, %eax
    movl    $1668178254, -14(%rbp) # NOTE THIS PART HERE
    movw    $121, -10(%rbp)        # AND HERE
    leaq    -14(%rbp), %rax
    movq    %rax, %rsi
    leaq    .LC0(%rip), %rdi
    movl    $0, %eax
    call    printf@PLT
    movl    $0, %eax
    movq    -8(%rbp), %rdx
    xorq    %fs:40, %rdx
    je  .L3
    call    __stack_chk_fail@PLT
.L3:
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
    .section    .note.GNU-stack,"",@progbits
    .section    .note.gnu.property,"a"
    .align 8
    .long    1f - 0f
    .long    4f - 1f
    .long    5
0:
    .string  "GNU"
1:
    .align 8
    .long    0xc0000002
    .long    3f - 2f
2:
    .long    0x3
3:
    .align 8
4:

The highlighted value 1668178254 is hex 636E614E or "cnaN" (which, due to the endian reversal as x86 is a little-endian system, becomes "Nanc") in ASCII encoding, and 121 is hex 79, or "y".

So it uses two move instructions instead of a loop copy from a byte string section of the file given it's a short string, and the intervening "garbage" is (I believe) the following movw instruction. Likely a way to optimize the initialization, versus looping byte-by-byte through memory, even though no optimization flag was "officially" given to the compiler - that's the thing, the compiler can do what it wants to do in this regard. Microsoft's compiler, then, seems to be more "pedantic" in how it compiles because it does, in fact, apparently forgo that optimization in favor of putting the string together contiguously.