Why is the "mov" with complex addressing faster than the corresponding "lea"?

91 Views Asked by At

I looked up in the instruction tables and found that in Coffee Lake, the RThroughput of the lea with 3 components is 1. I think it’s very slow, so I guessed that the RThroughput of the mov with complex addressing is greater than 1. To my surprise, the mov with complex addressing is actually faster than lea, which confuses me a lot.

Below is the test code I used. My computer’s microarchitecture is Comet Lake, which is not much different from Coffee Lake.

; 8 * 10^9 cycles
mov ecx, 1000000000
xor rax, rax
sub rsp, 40
.align 32
loop:
    lea r8, [rsp + rax + 4]
    lea r9, [rsp + rax + 8]
    lea r10, [rsp + rax + 12]
    lea r11, [rsp + rax + 16]
    lea r12, [rsp + rax + 20]
    lea r13, [rsp + rax + 24]
    lea r14, [rsp + rax + 28]
    lea r15, [rsp + rax + 32]
    sub ecx, 1
    jnz loop
add rsp, 40
; 4 * 10^9 cycles
mov ecx, 1000000000
xor rax, rax
sub rsp, 40
.align 32
loop:
    mov r8d, DWORD PTR [rsp + rax + 4]
    mov r9d, DWORD PTR [rsp + rax + 8]
    mov r10d, DWORD PTR [rsp + rax + 12]
    mov r11d, DWORD PTR [rsp + rax + 16]
    mov r12d, DWORD PTR [rsp + rax + 20]
    mov r13d, DWORD PTR [rsp + rax + 24]
    mov r14d, DWORD PTR [rsp + rax + 28]
    mov r15d, DWORD PTR [rsp + rax + 32]
    sub ecx, 1
    jnz loop
add rsp, 40
0

There are 0 best solutions below