clang vs gcc for copying 3 bytes on x86_64 - number of mov's

642 Views Asked by At

What should optimized compiled code for copying 3 bytes from one place to another, say, using memcpy(,,3) look like, in terms of assembly instructions?

Consider the following program:

#include <string.h>
int main() {
  int* p = (int*) 0x10;
  int x = 0;
  memcpy(&x, p, 4);
  x = x * (x > 1 ? 2 : 3);
  memcpy(p, &x, 4);  
  return 0;
}

it's a bit contrived a will cause a segmentation violation, but I need those instructions so that compiling with -O3 doesn't make all of it go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:

main:
        mov     edx, DWORD PTR ds:16
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

great - a single mov of a DWORD (= 4 bytes) from memory to a register. Nice and optimized. Now let's change the memcpy(&x, p1, 4) into memcpy(&x, p1, 3)? The compilation result becomes:

main:
        mov     DWORD PTR [rsp-4], 0
        movzx   eax, WORD PTR ds:16
        mov     WORD PTR [rsp-4], ax
        movzx   eax, BYTE PTR ds:18
        mov     BYTE PTR [rsp-2], al
        mov     edx, DWORD PTR [rsp-4]
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

I'm not much of an exprt on Intel X86_64 assembly (read: I can't even read it properly when it's complicated), so - I don't quite get this. I mean, I get what's happening in the first 6 instructions and why so many of them are necessary. Why aren't two moves sufficient? A mov WORD PTR int al and a mov BYTE PTR into ah?

... so, I came here to ask. As I was writing the question I noticed GodBolt also has clang as an option. Well, clang (3.9.0 -O3) does this:

main:                                   # @main
        movzx   eax, byte ptr [18]
        shl     eax, 16
        movzx   ecx, word ptr [16]
        or      ecx, eax
        cmp     ecx, 2
        sbb     eax, eax
        and     eax, 1
        or      eax, 2
        imul    eax, ecx
        mov     dword ptr [16], eax
        xor     eax, eax
        ret

which looks more like what I expected. What explains the difference?

Notes:

  • It's the same behavior essentially if I don't initialize x = 0.
  • Other GCC versions do about the same thing as GCC 6.3, but GCC 7 is down to 5 instead of 6 mov's.
  • Other versions of clang (starting from 3.4) do about the same thing.
  • The behavior is similar if we forego memcpy'ing for the following:

    #include <string.h>
    
    typedef struct {
      unsigned char data[3];
    }  uint24_t;
    
    int main() {
      uint24_t* p = (uint24_t*) 0x30;
      int x = 0;
      *((uint24_t*) &x) = *p;
      x = x * (x > 1 ? 2 : 3);
      *p = *((uint24_t*) &x);
      return 0;
    } 
    
  • If you want to contrast with what happens when the relevant code is in a function, have a look at this or the uint24_t struct version (GodBolt). Then have a look at what happens for 4-byte values.

3

There are 3 best solutions below

10
On BEST ANSWER

The size three is an ugly size and compilers are not perfect.

The compiler cannot generate an access to a memory location you haven't requested, so it needs two moves.

While it seems trivial for you, remember that you asked for memcpy(&x, p, 4); which is a copy from memory to memory.
Evidently GCC and the older versions of Clang are not smart enough to figure it out there is no reason to pass for a temporary in memory.

What GCC is doing with the first six instructions is basically constructing a DWORD at [rsp-4] with the three bytes, as you requested

mov     DWORD PTR [rsp-4], 0              ;DWORD is 0

movzx   eax, WORD PTR ds:16               ;EAX = byte 0 and byte 1
mov     WORD PTR [rsp-4], ax              ;DWORD has byte 0 and byte 1

movzx   eax, BYTE PTR ds:18               ;EAX = byte 2
mov     BYTE PTR [rsp-2], al              ;DWORD has byte 0, byte 1 and byte 2

mov     edx, DWORD PTR [rsp-4]            ;As previous from henceon

It is using movzx eax, ... to prevent a partial register stall.

The compilers did a great job already by eliding the call to memcpy and as you said the example is "a bit contrived" to follow, even for a human. The memcpy optimisations must work for any size, including those that cannot fit a register. It's not easy to get it right every time.

Considering that L1 access latencies have been lowered considerably in the recent architectures and that [rsp-4] is very likely to be in the cache, I'm not sure it's worth messing with the optimisation code in the GCC source.
It is surely worth filing a bug for a missed optimisation and see what the developers has to say.

0
On

(not a real answer, as I can't add anything to what others already answered, so just example of how I would do such code by hand ... probably mostly for my own curiosity)

If the functions is:

f(24b unsigned n):

  • f(0) → 0
  • f(1) → 3
  • f(n) → n*2, n > 1

(looks to me to be this from your question).

Then I would in hand written assembly (nasm syntax) do this:

    mov     eax,[16]    ; reads 4 bytes from address 16

    ; f(n) starts here, n = low 24b of eax, modifies edx
    xor     edx,edx
    and     eax,0x00FFFFFF
    dec     eax
    setz    dl
    lea     eax,[edx+2*eax+2]
    ; output = low 24b of eax, b24..b31 undefined

    ; writes 3 bytes back to address 16
    mov     [16],ax
    shr     eax,16
    mov     [18],al
1
On

You should get much better code from copying 4 bytes and masking off the top one, e.g. with x & 0x00ffffff. That lets the compiler know it's allowed to read 4 bytes, not just the 3 that the C source reads.

Yes, that helps a ton: it saves gcc and clang from storing a 4B zero, then copying three bytes and reloading 4. They just load 4, mask, store, and use the value that's still in the register. Part of that might be from not knowing if *p aliases *q.

int foo(int *p, int *q) {
  //*p = 0;
  //memcpy(p, q, 3);
  *p = (*q)&0x00ffffff;
  return *p;
}

    mov     eax, DWORD PTR [rsi]     # load
    and     eax, 16777215            # mask
    mov     DWORD PTR [rdi], eax     # store
    ret                              # and leave it in eax as return value

Why aren't two moves sufficient? A mov WORD PTR into al followed by a mov BYTE PTR into ah?

AL and AH are 8-bit registers. You can't put a 16-bit word into AL. This is why your last clang-output block loads two separate registers and merges with a shift+or, in the case where it knows it's allowed to mess with all 4 bytes of x.

If you were merging two separate one-byte values, you could load them to AL and AH and then use AX, but that leads to a partial-register stall on Intel pre-Haswell.

You could do a word-load into AX (or preferably movzx into eax for various reasons including correctness and avoiding a false dependency on the old value of EAX), left-shift EAX, and then a byte-load into AL.

But compilers don't tend to do that, because partial-register stuff has been very bad juju for many years, and is only efficient on recent CPUs (Haswell, and maybe IvyBridge). It would cause serious stalls on Nehalem and Core2. (See Agner Fog's microarch pdf; search for partial-register or look for it in the index. See other links in the tag wiki.) Maybe in a few years, -mtune=haswell will enable partial-register tricks to save the OR instruction that clang uses to merge.


Instead of writing such a contrived function:

Write functions that take args and return a value so you don't have to make them super-weird to not optimize away. e.g. a function that takes two int* args and does a 3 byte memcpy between them.

This on Godbolt (with gcc and clang), with colour highlighting

void copy3(int *p, int *q) { memcpy(p, q, 3); }

 clang3.9 -O3 does exactly what you expected: a byte and a word copy.
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax
    ret

To get the sillyness you managed to generate, zero the destination first, and then read it back after a three byte copy:

int foo(int *p, int *q) {
  *p = 0;
  memcpy(p, q, 3);
  return *p;
}

  clang3.9 -O3
    mov     dword ptr [rdi], 0       # *p = 0
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al   # byte copy
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax       # word copy
    mov     eax, dword ptr [rdi]     # read the whole thing, causing a store-forwarding stall
    ret

gcc doesn't do any better (except on CPUs that don't rename partial regs, since it avoids a false dependency on the old value of EAX by using movzx for the byte copy as well).