I am starting to use functions like _mm_clflush
, _mm_clflushopt
, and _mm_clwb
.
Say now as I have defined a struct name mystruct and its size is 256 Bytes. My cacheline size is 64 Bytes. Now I want to flush the cacheline that contains the mystruct variable. Which of the following way is the right way to do so?
_mm_clflush(&mystruct)
or
for (int i = 0; i < sizeof(mystruct)/64; i++) {
_mm_clflush( ((char *)&mystruct) + i*64)
}
The
clflush
CPU instruction doesn't know the size of your struct; it only flushes exactly one cache line, the one containing the byte pointed to by the pointer operand. (The C intrinsic exposes this as aconst void*
, butchar*
would also make sense, especially given the asm documentation which describes it as an 8-bit memory operand.)You need 4 flushes 64 bytes apart, or maybe 5 if your struct isn't
alignas(64)
so it could have parts in 5 different lines. (You could unconditionally flush the last byte of the struct, instead of using more complex logic to check if it's in a cache line you haven't flushed yet, depending on relative cost ofclflush
vs. more logic and a possible branch mispredict.)Your original loop did 4 flushes of 4 adjacent bytes at the start of your struct.
It's probably easiest to use pointer increments so the casting is not mixed up with the critical logic.
x^y
is 1 in bit-positions where they differ.x & -LINESIZE
discards the offset-within-line bits of the address, keeping only the line-number bits. So we can see if 2 addresses are in the same cache line or not with just XOR and TEST instructions. (Or clang optimizes that to a shortercmp
instruction).Or rewrite that into a single loop, using that if logic as the termination condition:
I used a C++
struct foo &var
reference so I could follow your&var
syntax but still see how it compiles for a function taking a pointer arg. Adapting to C is straightforward.Looping over every cache line of an arbitrary size unaligned struct
With GCC10.2 -O3 for x86-64, this compiles nicely (Godbolt)
GCC doesn't unroll, and doesn't optimize any better if you use
alignas(64) struct foo{...};
unfortunately. You might useif (alignof(mystruct) >= 64) { ... }
to check if special handling is needed to let GCC optimize better, otherwise just useend = p + sizeof(mystruct);
orend = (const char*)(&mystruct+1) - 1;
or similar.(In C,
#include <stdalign.h>
for #define foralignas()
andalignof()
like C++, instead of ISO C11_Alignas
and_Alignof
keywords.)Another alternative is this, but it's clunkier and takes more setup work.
A struct that was 257 bytes would always touch exactly 5 cache lines, no checking needed. Or a 260-byte struct that's known to be aligned by 4. IDK if we can get GCC to optimize away the checks based on that.