I want to xor large shifted arrays, following is portable version of that function for easy of explanation. How I can improve this computation? I have tried using AVX2 but didnt see much improvement. Currently for the DB showing in the example below it takes 50ms to process everything, which is 12 GB/s, I will appreciate any tips to improve the computation.
#include <iostream>
uint64_t partition_size = 4096;
uint64_t entry_size = 32; // bytes
uint64_t DB_size = 16777216;
uint64_t *DB = new uint64_t[DB_size * entry_size/64];
//partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
//random_offset will be a random number in [0, partition_size]
void xor_shifted_arrays(uint32_t partition_index, uint32_t random_offset, uint64_t *result)
{
auto uint64_per_entry = entry_size / sizeof(uint64_t);
int shift_offset;
uint32_t shift;
for (int i = 0; i < partition_size ; i = i + 1)
{
shift = (i + random_offset) & (partition_size - 1);
shift_offset = shift * uint64_per_entry;
for (int j = 0; j < uint64_per_entry; j=j+1){
result[shift_offset + j] = result[shift_offset + j] ^ DB[partition_index + j];
}
partition_index = partition_index + uint64_per_entry;
}
}
Update: Here is godbolt :https://godbolt.org/z/j14av3fGq Also ran this code on two instances.
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz running MacOS 13.6, 16 GB DDR4 RAM, compiler Apple clang version 15.0.0 (clang-1500.0.40.1)
AWS r7i.2xlarge Intel(R) Xeon(R) Platinum 8488C with Ubuntu, 64 GB DDR5 RAM, compiler g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Surprisingly it is 2x slower on Xeon !!
I am using compiler flag O3
Update2: I feel this may be useful, the above code gets called from outer functions something like this (not running code)
void outer_function(){
uint64_t *result1 = new uint64_t[partition_size];
uint64_t *result2 = new uint64_t[partition_size];
uint64_t number_partitions = 4096;
for (int i=0; i< number_partitions; i++){
xor_shifted_arrays(i*partition_size, some_rnd_gen(), result1)
}
for (int i=0; i< number_partitions; i++){
xor_shifted_arrays(i*partition_size, some_rnd_gen(), result2)
}
}
These are mostly generic tips, which will help your compiler to optimize code (many of them were already in the comments):
size_t(orptrdiff_tif you want to allow negative indexes). This will save the compiler from dealing with potential wrap-around or overflow behavior.const. If the compiler knows their value, it can optimize multiplications or division with them, especially if they are powers of 2.__restrictkeyword (only do this, if you are sure that this array does not alias/overlap with other arrays). This often helps the compiler with unrolling and SIMD optimization.-march=native(or-march=target_architecture) to optimize for your local or some specific CPU architecture (additionally to-O3, of course).A = A + B, writeA += B, (same with^/^=, etc). This usually does not matter for the compiler.Assuming the
j=j+4in your godbolt-link was a typo, and you meantj+=1(or++j) this results in very clean AVX2 code:Godbolt-Link: https://godbolt.org/z/Khs5aTPaK
I don't see much further possible improvements for this code. You need to read and write every
resultentry once together with the correspondingDBentry. If you have control over memory management, you could make sure that pointers are aligned to page sizes (or at least to your SIMD-width)Also if (for a different problem) some entries would be read or modified multiple times, you can try to re-arrange your loops, so that variables are less often read or written from/to memory.