I am measuring memory throughput and runtimes using _mm256_i32gather_epi32 intrinsic. Here is the loop I use for testing:
for (int i = 0; i < len; i+=8) {
const __m256i* indexes_2 = reinterpret_cast<const __m256i*>(indexes_ptr + i);
__m256i index_reg = _mm256_loadu_si256(indexes_2);
__m256i values = _mm256_i32gather_epi32(data_ptr, index_reg, 4);
sum = _mm256_add_epi32(sum, values);
}
I use the index array (specified through indexes_ptr) to change the access pattern into data_ptr array. The data_ptr array is 256 MB in size, so everything misses the caches. Here are possible values for indexes_ptr:
- sequential - 0, 1, 2, 3, etc
- stride 4 - 0, 4, 8, 12
- stride 16 - 0, 16, 32, 48, etc
- stride 32
- stride 64
- stride 128
So, the intrinsic _mm256_i32gather_epi32 will load 8 values. In my system, the size of a cache line is 64 bytes, so:
- sequential touches one cache line
- stride 4 touches two cache lines
- stride 16 touches eight cache lines
- stride 64 touches eight cache lines
- stride 128 touches eight cache lines
My expectations is that the stride 16, 64 and 128 will have similar runtimes and memory throughputs. This is however not the case. Here are the numbers:
- sequential, 0.13 s, 16828.2607 MB/s
- strided 4, 0.07 s, 17246.1914 MB/s
- strided 16, 0.918406, 5205.1085 MB/s
- strided 32, 1.650566s, 4756.5279 MB/s
- stride 64, 1.798604, 5440.2228 MB/s
- stride 128, 2.186620, 4672.1329 MB/s
Where does the difference between stride 16, 32, 64 and 128 come from, since they all are accessing exactly 8 cache lines in each instructions?