Given the following C program (MSVC does not optimize away the "work" for me, for other compilers you may need to add an asm statement):
#include <inttypes.h>
#include <stdlib.h>
#define SIZE 10000
typedef struct {
int32_t a, b, c;
} Struct;
void do_work(Struct* data) {
int32_t* a = malloc(sizeof(int32_t) * SIZE),
* b = malloc(sizeof(int32_t) * SIZE),
* c = malloc(sizeof(int32_t) * SIZE);
int32_t* a_ptr = a, * b_ptr = b, * c_ptr = c;
for (size_t i = 0; i < SIZE; i++, a_ptr++, b_ptr++, c_ptr++, data++) {
*a_ptr = data->a;
*b_ptr = data->b;
*c_ptr = data->c;
}
free(a);
free(b);
free(c);
}
int main() {
Struct* data = malloc(sizeof(Struct) * SIZE);
for (size_t i = 0; i < SIZE; i++) {
data[i].a = i;
data[i].b = i;
data[i].c = i;
}
for (int i = 0; i < 500000; i++) {
do_work(data);
}
free(data);
}
Edit: Disassembly of do_work():
do_work PROC ; COMDAT
$LN12:
mov QWORD PTR [rsp+8], rbx
mov QWORD PTR [rsp+16], rbp
mov QWORD PTR [rsp+24], rsi
push rdi
sub rsp, 32 ; 00000020H
mov rbx, rcx
mov ecx, 40000 ; 00009c40H
call QWORD PTR __imp_malloc
mov ecx, 40000 ; 00009c40H
mov rsi, rax
call QWORD PTR __imp_malloc
mov ecx, 40000 ; 00009c40H
mov rbp, rax
call QWORD PTR __imp_malloc
mov r10, rsi
lea rcx, QWORD PTR [rbx+8]
sub r10, rax
mov r11, rbp
sub r11, rax
mov rdi, rax
mov r8, rax
mov r9d, 10000 ; 00002710H
npad 6
$LL4@do_work:
mov edx, DWORD PTR [rcx-8]
lea rcx, QWORD PTR [rcx+12]
mov DWORD PTR [r10+r8], edx
lea r8, QWORD PTR [r8+4]
mov eax, DWORD PTR [rcx-16]
mov DWORD PTR [r11+r8-4], eax
mov eax, DWORD PTR [rcx-12]
mov DWORD PTR [r8-4], eax
sub r9, 1
jne SHORT $LL4@do_work
mov rcx, rsi
call QWORD PTR __imp_free
mov rcx, rbp
call QWORD PTR __imp_free
mov rcx, rdi
mov rbx, QWORD PTR [rsp+48]
mov rbp, QWORD PTR [rsp+56]
mov rsi, QWORD PTR [rsp+64]
add rsp, 32 ; 00000020H
pop rdi
rex_jmp QWORD PTR __imp_free
do_work ENDP
Here's a godbolt: https://godbolt.org/z/nWTPfKPd6.
(I have a similar program in Rust with the same conclusions).
Intel VTune reports that this program is 63.1% memory bound, and 52.4% store bound, with store latency of 26%. It recommends to search for false sharing, but I fail to see how there could be false sharing here. There is no concurrency, all data is owned by one core, the access patterns should be easily predicted and prefetched. I don't see why the CPU needs to stall on the stores here.
I thought that maybe the low and high bits of the addresses of the three allocations are the same and that causes them to be mapped to the same cache lines, but I remember reading that modern CPUs don't just drop some bits to assign a cache line but do more complex calculations.
Another thought was that maybe after the allocations are freed the CPU is still busy flushing the stores, and in the next run they are assigned the same address (or a close one) by the allocator and that brings problems for the CPU as it has to wait before storing new data. So I tried to not free the allocations, but that caused the code to be much slower.
I'm on Windows 11, laptop Intel Core i9-13900HX, 32 logical cores, 8 Performance Cores and 16 Efficient Cores.
Most likely theint32_t * SIZEin themalloccall. If you use a bit-shift likeSIZE << 2instead, your code should be much faster and more efficient.OK, in all seriousness,
a_ptr,b_ptr, andc_ptrare all separate blocks of memory. In your loop you are accessing one, then the other, then the other, then repeating. I hear that is like bad for cache or something, because you are jumping around in memory, instead of accessing them sequentially.I am skeptical that, even if a human could predict that pattern, a CPU could.