I have a program which allocates some memory (200 million integers), does some quick compute and then writes the data to the allocated memory.
When run on a single thread the process takes about 1 second. When run across 10 threads the process takes about 77 seconds. There is no concurrency between the threads. Each thread allocates its own data and writes to it.
The cache lines on my machine are 64 byte aligned. I write the data into an 16 integer array (allocated on stack). And then flush this array to the memory via memcpy.
My hypothesis is along these lines :
The aggregate processing across CPUS is faster than the speed of memory. Hence, memory speed is the bottle neck here.
I am deliberately writing to memory in 64 bytes of data (tmp_arr) so that cache lines should not be shared across threads. But I suspect the cache lines maybe thrashing since multiple threads are trying to write into memory very quickly (cpu compute is low). When i wrote single integer at a time to memory instead of an array of 16 integers via memcpy, the time was 250 seconds. Hence pushing 16 integers (64 bits) at a time to memory via memcpy improved speed 3x. But the time of 77 seconds is nearly 77x more than that of single threaded execution. Hence, some thrashing must be going on.
Another observation is that single thread execution shows high stalled-cycles-backend (30%) whereas when run on ten threads we see high stalled-cycles-frontend (39%). Can someone please explain why there are backend stalls with singlethread while we see front end stalls with multiple threads?
I also tried to use _mm_store_si128 (vectorized instructions) but that didn't improve speed.
Here are screenshots of times taken along with hardware counters.
One thread timings , ten thread timings , machine diagram
I am running on linux and using AMD Ryzen 9 3900XT 12-Core Processor. I am using DDR4 memory operating at speed: 3600 MT/s
This code is written to understand how fast memory writes can be achieved in multithreaded application that works on large data sets.
Here is the code:
//Had tried vectorized instructions but they had no noticeable performance improvements
//__m128i tmp_arr ;
//__m128i b = _mm_set_epi32(1,2,3,4);
//_mm_store_si128( (__m128i*) (arr + (i*16)), tmp_arr);
void runLargeMemoryAlloOutside(benchmark::State & s) {
long size = 200'000'000; //200 million integers
//each array is about 0.8GB
//setting a pragma to alignas(64).
//giving this pragma or not didn't have any noticeable effect on performance.
alignas(64) int * arr = new int [size];
//will write to this temporary array on stack. and then will flush this to the above array via memcpy.
int tmp_arr[16];
for (auto _ : s) {
for (long i = 0; i < size/16; i++) {
for (int j =0; j < 16; j++) {
//very small compute.
int tmp = rand();
tmp = log(tmp);
//arr[i] = tmp;
tmp_arr[j] = tmp;
}
memcpy (arr + (i*64), tmp_arr, sizeof(tmp_arr));
}
}
delete [] arr;
}
//running on 10 threads via Google benchmark
BENCHMARK(runLargeMemoryAlloOutside)->Threads(10);
Code is compiled with gcc (c++ 20) using flags "-Ofast -O5".
I don't fully understand why time of the process when run on 10 threads increases to 77x that of single threaded execution. And hope that someone can shed some light on the increase in time here. And what else can be done to speed up memory writes.
I was expecting that multi threaded execution to not be as slow.
int tmp = rand();- There's your problem. GNU/Linux (glibc)rand()shares one seed state program-wide, protected by a lock. So all your threads contend for that lock, making overall throughput much worse than with a single thread.Uncontended locks are not good (when the cache line holding the lock stays in Modified state on the core doing the locking and unlocking), but contention for a lock is much slower.
Storing the result to memory vs. adding it to a per-thread temporary would run about the same speed because your loop body is way slower than memory bandwidth, especially with the contention of all threads competing for a lock.
Floating-point
logis also slow-ish, but maybe you're trying to do something slow enough that one thread can't come close to saturating memory bandwidth, but maybe come close with all cores?If you want random numbers, a simple cheap PRNG like
xorshift*can be good. (https://en.wikipedia.org/wiki/Xorshift#xorshift*). Or some of the later variations. That should be enough to stop the compiler from auto-vectorizing the random-number generation, if you're trying to make your code intentionally slow instead of e.g. just filling witharr[i] = i;. If you want fast random numbers,xorshift+is easy to vectorize.I think that just aligns the storage for the pointer object itself, not the allocated space. (
alignas()does count in bytes, not bits, so that is correct for the size of a cache line. You wrote "bits" in the text of your question, but 16x 32-bitints are 64 bytes.)Check the compiler-generated asm to see what the compiler did, whether it's actually storing to a local buffer and then copying to the big array, or whether it optimizes away the
memcpy. (If it copies, it probably inlines thememcpysince the size is small and fixed. Like fourxmmcopies since you didn't use-march=nativeor-march=x86-64-v3to let it use AVX 32-byte vectors.)