Memory throughput for strided memory accesses

304 Views Asked by Bogi At 27 June 2022 at 12:33

I am measuring memory throughput and runtimes using _mm256_i32gather_epi32 intrinsic. Here is the loop I use for testing:

for (int i = 0; i < len; i+=8) {
    const __m256i* indexes_2 = reinterpret_cast<const __m256i*>(indexes_ptr + i);
    __m256i index_reg = _mm256_loadu_si256(indexes_2);
    __m256i values = _mm256_i32gather_epi32(data_ptr, index_reg, 4);
    sum = _mm256_add_epi32(sum, values);
}

I use the index array (specified through indexes_ptr) to change the access pattern into data_ptr array. The data_ptr array is 256 MB in size, so everything misses the caches. Here are possible values for indexes_ptr:

sequential - 0, 1, 2, 3, etc
stride 4 - 0, 4, 8, 12
stride 16 - 0, 16, 32, 48, etc
stride 32
stride 64
stride 128

So, the intrinsic _mm256_i32gather_epi32 will load 8 values. In my system, the size of a cache line is 64 bytes, so:

sequential touches one cache line
stride 4 touches two cache lines
stride 16 touches eight cache lines
stride 64 touches eight cache lines
stride 128 touches eight cache lines

My expectations is that the stride 16, 64 and 128 will have similar runtimes and memory throughputs. This is however not the case. Here are the numbers:

sequential, 0.13 s, 16828.2607 MB/s
strided 4, 0.07 s, 17246.1914 MB/s
strided 16, 0.918406, 5205.1085 MB/s
strided 32, 1.650566s, 4756.5279 MB/s
stride 64, 1.798604, 5440.2228 MB/s
stride 128, 2.186620, 4672.1329 MB/s

Where does the difference between stride 16, 32, 64 and 128 come from, since they all are accessing exactly 8 cache lines in each instructions?

Original Q&A

Memory throughput for strided memory accesses

There are 0 best solutions below

Related Questions in X86-64

Related Questions in CPU-ARCHITECTURE

Related Questions in MEMORY-BANDWIDTH

Trending Questions

Popular # Hahtags

Popular Questions