I looked up in the instruction tables and found that in Coffee Lake, the RThroughput of the lea with 3 components is 1. I think it’s very slow, so I guessed that the RThroughput of the mov with complex addressing is greater than 1. To my surprise, the mov with complex addressing is actually faster than lea, which confuses me a lot.
Below is the test code I used. My computer’s microarchitecture is Comet Lake, which is not much different from Coffee Lake.
; 8 * 10^9 cycles
mov ecx, 1000000000
xor rax, rax
sub rsp, 40
.align 32
loop:
lea r8, [rsp + rax + 4]
lea r9, [rsp + rax + 8]
lea r10, [rsp + rax + 12]
lea r11, [rsp + rax + 16]
lea r12, [rsp + rax + 20]
lea r13, [rsp + rax + 24]
lea r14, [rsp + rax + 28]
lea r15, [rsp + rax + 32]
sub ecx, 1
jnz loop
add rsp, 40
; 4 * 10^9 cycles
mov ecx, 1000000000
xor rax, rax
sub rsp, 40
.align 32
loop:
mov r8d, DWORD PTR [rsp + rax + 4]
mov r9d, DWORD PTR [rsp + rax + 8]
mov r10d, DWORD PTR [rsp + rax + 12]
mov r11d, DWORD PTR [rsp + rax + 16]
mov r12d, DWORD PTR [rsp + rax + 20]
mov r13d, DWORD PTR [rsp + rax + 24]
mov r14d, DWORD PTR [rsp + rax + 28]
mov r15d, DWORD PTR [rsp + rax + 32]
sub ecx, 1
jnz loop
add rsp, 40