I'm trying to identify a performance baseline for memory-bound vectorized loops. I'm doing this on an Intel Broadwell chip with AVX2 instructions in a 32byte aligned environment.
A baseline loop uses 8 YMM registers at a time to load from one location and nontemporally store to another:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovapd ymm0, ymmword ptr [ (32) + rdi +8*rax]
vmovapd ymm2, ymmword ptr [ (64) + rdi +8*rax]
vmovapd ymm4, ymmword ptr [ (96) + rdi +8*rax]
vmovapd ymm6, ymmword ptr [ (128) + rdi +8*rax]
vmovapd ymm8, ymmword ptr [ (160) + rdi +8*rax]
vmovapd ymm10, ymmword ptr [ (192) + rdi +8*rax]
vmovapd ymm12, ymmword ptr [ (224) + rdi +8*rax]
vmovapd ymm14, ymmword ptr [ (256) + rdi +8*rax]
vmovntpd ymmword ptr [ (32) + rsi +8*rax], ymm0
vmovntpd ymmword ptr [ (64) + rsi +8*rax], ymm2
vmovntpd ymmword ptr [ (96) + rsi +8*rax], ymm4
vmovntpd ymmword ptr [ (128) + rsi +8*rax], ymm6
vmovntpd ymmword ptr [ (160) + rsi +8*rax], ymm8
vmovntpd ymmword ptr [ (192) + rsi +8*rax], ymm10
vmovntpd ymmword ptr [ (224) + rsi +8*rax], ymm12
vmovntpd ymmword ptr [ (256) + rsi +8*rax], ymm14
add rax, (4*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
I assemble it with YASM and then do a test with the Intel Architecture Code Analyzer (IACA), which tells me:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 8.0 4.0 | 8.0 4.0 | 8.0 | 0.5 | 0.5 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm0, ymmword ptr [rdi+rax*8+0x20]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm2, ymmword ptr [rdi+rax*8+0x40]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm4, ymmword ptr [rdi+rax*8+0x60]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm6, ymmword ptr [rdi+rax*8+0x80]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm8, ymmword ptr [rdi+rax*8+0xa0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm10, ymmword ptr [rdi+rax*8+0xc0]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm12, ymmword ptr [rdi+rax*8+0xe0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm14, ymmword ptr [rdi+rax*8+0x100]
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x20], ymm0
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x40], ymm2
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x60], ymm4
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x80], ymm6
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xa0], ymm8
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xc0], ymm10
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xe0], ymm12
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x100], ymm14
| 1 | | 0.5 | | | | 0.5 | | | | add rax, 0x20
| 1 | 0.5 | | | | | | 0.5 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff78
I was under the impression that I could get 2x loads at a time with broadwell with simultaneous loads on ports 2&3. Why isn't that happening?
Thanks
UPDATE
Per the suggestion below, pd's were replaced with ps's and the addresses were consolidated to a single register, the new code looks like:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
xor rbx,rbx
xor rcx,rcx
or rbx, rdi
or rcx, rsi
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovaps ymm0, ymmword ptr [ (32) + rbx ]
vmovaps ymm2, ymmword ptr [ (64) + rbx ]
vmovaps ymm4, ymmword ptr [ (96) + rbx ]
vmovaps ymm6, ymmword ptr [ (128) + rbx ]
vmovaps ymm8, ymmword ptr [ (160) + rbx ]
vmovaps ymm10, ymmword ptr [ (192) + rbx ]
vmovaps ymm12, ymmword ptr [ (224) + rbx ]
vmovaps ymm14, ymmword ptr [ (256) + rbx ]
vmovntps ymmword ptr [ (32) + rcx], ymm0
vmovntps ymmword ptr [ (64) + rcx], ymm2
vmovntps ymmword ptr [ (96) + rcx], ymm4
vmovntps ymmword ptr [ (128) + rcx], ymm6
vmovntps ymmword ptr [ (160) + rcx], ymm8
vmovntps ymmword ptr [ (192) + rcx], ymm10
vmovntps ymmword ptr [ (224) + rcx], ymm12
vmovntps ymmword ptr [ (256) + rcx], ymm14
add rax, (4*8)
add rbx, (4*8*8)
add rcx, (4*8*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
Then IACA tells me:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 5.3 4.0 | 5.3 4.0 | 8.0 | 1.0 | 1.0 | 5.3 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm0, ymmword ptr [rbx+0x20]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm2, ymmword ptr [rbx+0x40]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm4, ymmword ptr [rbx+0x60]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm6, ymmword ptr [rbx+0x80]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm8, ymmword ptr [rbx+0xa0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm10, ymmword ptr [rbx+0xc0]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm12, ymmword ptr [rbx+0xe0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm14, ymmword ptr [rbx+0x100]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x20], ymm0
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x40], ymm2
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x60], ymm4
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x80], ymm6
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xa0], ymm8
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xc0], ymm10
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xe0], ymm12
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0x100], ymm14
| 1 | 1.0 | | | | | | | | | add rax, 0x20
| 1 | | 1.0 | | | | | | | | add rbx, 0x100
| 1 | | | | | | 1.0 | | | | add rcx, 0x100
| 1 | | | | | | | 1.0 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff7a
Which tells me that the stores can now use port 7 for addresses and the operations were stored. IACA tells me that the "Block throughput" is still 8 operations due to the extra operations to get the addresses onto a single register. Maybe I'm doing this wrong?
I still don't understand why the load operations cannot be fused
The store-AGU on port7 can only handle "simple" effective addresses, so your stores also need the AGU on the load ports. IACA does show your loads not actually competing with each other; it's the stores that are competing.
Note that there are only ~10 fill buffers per core for MOVNT stores, so those will very quickly fill up and be the bottleneck.
See also Micro fusion and addressing modes. Your stores could micro-fuse and take fewer fused-domain uops if you used a one-register addressing mode for them.
Also, I guess it doesn't matter for VEX-coded instructions, but the SSE
pd
versions take an extra byte of x86 machine code.clang
tends to usemovaps
for loads/stores because it's shorter, even on integer vectors. Every existing CPU runsmovaps
/movapd
the same. So I'd suggest just usingvmovaps
/vmovntps
. It won't make any difference at all, though. Just one less set bit in the VEX prefix.