How to Improve XORing of large uint64 arrays?

184 Views Asked by At

I want to xor large shifted arrays, following is portable version of that function for easy of explanation. How I can improve this computation? I have tried using AVX2 but didnt see much improvement. Currently for the DB showing in the example below it takes 50ms to process everything, which is 12 GB/s, I will appreciate any tips to improve the computation.

#include <iostream>

uint64_t partition_size = 4096;
uint64_t entry_size = 32; // bytes
uint64_t DB_size = 16777216;
uint64_t *DB = new uint64_t[DB_size * entry_size/64];


//partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
//random_offset will be a random number in [0, partition_size]
void xor_shifted_arrays(uint32_t partition_index, uint32_t random_offset, uint64_t *result)
{
    auto uint64_per_entry = entry_size / sizeof(uint64_t);

    int shift_offset;
    uint32_t shift;
    
    for (int i = 0; i < partition_size  ; i = i + 1)
    {
        shift = (i + random_offset) & (partition_size - 1);
        shift_offset = shift * uint64_per_entry;
        
        for (int j = 0; j < uint64_per_entry; j=j+1){
            result[shift_offset + j] = result[shift_offset + j] ^ DB[partition_index + j];  
        }
        partition_index = partition_index + uint64_per_entry;
    }
}

Update: Here is godbolt :https://godbolt.org/z/j14av3fGq Also ran this code on two instances.

Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz running MacOS 13.6, 16 GB DDR4 RAM, compiler Apple clang version 15.0.0 (clang-1500.0.40.1)

AWS r7i.2xlarge Intel(R) Xeon(R) Platinum 8488C with Ubuntu, 64 GB DDR5 RAM, compiler g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Surprisingly it is 2x slower on Xeon !!

I am using compiler flag O3

Update2: I feel this may be useful, the above code gets called from outer functions something like this (not running code)

void outer_function(){
  uint64_t *result1 = new uint64_t[partition_size];
  uint64_t *result2 = new uint64_t[partition_size];
  uint64_t number_partitions = 4096;
  for (int i=0; i< number_partitions; i++){
      xor_shifted_arrays(i*partition_size,           some_rnd_gen(), result1)
  }
  for (int i=0; i< number_partitions; i++){
     xor_shifted_arrays(i*partition_size,                some_rnd_gen(), result2)
  }
}
2

There are 2 best solutions below

5
chtz On

These are mostly generic tips, which will help your compiler to optimize code (many of them were already in the comments):

  • For array indexes/sizes always use size_t (or ptrdiff_t if you want to allow negative indexes). This will save the compiler from dealing with potential wrap-around or overflow behavior.
  • Declare variables at the point you initialize them/in the scope you use them
  • Declare all variables which never change after initialization as const. If the compiler knows their value, it can optimize multiplications or division with them, especially if they are powers of 2.
  • Pass the pointer of arrays you modify as function parameter with a __restrict keyword (only do this, if you are sure that this array does not alias/overlap with other arrays). This often helps the compiler with unrolling and SIMD optimization.
  • Compile your code with -march=native (or -march=target_architecture) to optimize for your local or some specific CPU architecture (additionally to -O3, of course).
  • Just for readability: Instead of A = A + B, write A += B, (same with ^/^=, etc). This usually does not matter for the compiler.

Assuming the j=j+4 in your godbolt-link was a typo, and you meant j+=1 (or ++j) this results in very clean AVX2 code:

// assuming these are known at compile-time:
const size_t partition_size = 4096; 
const size_t entry_size = 256; // bits
const size_t DB_size = 16777216;

uint64_t *DB = new uint64_t[DB_size * entry_size/64];
const size_t uint64_per_entry = entry_size / sizeof(uint64_t);
//uint64_t *result = new uint64_t[partition_size]; // pass this as parameter

//partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
//random_offset will be a random number in [0, partition_size]
void xor_shifted_arrays(size_t partition_index, size_t random_offset, uint64_t *__restrict result)
{

    for (size_t i = 0; i < partition_size  ; i++)
    {
        const size_t shift = (i + random_offset) & (partition_size - 1);
        const size_t shift_offset = shift * uint64_per_entry;
        
        for (size_t j = 0; j < uint64_per_entry; j++){
            result[shift_offset + j] ^= DB[partition_index + j];    
        }
        partition_index += uint64_per_entry;
    }
}

Godbolt-Link: https://godbolt.org/z/Khs5aTPaK

I don't see much further possible improvements for this code. You need to read and write every result entry once together with the corresponding DB entry. If you have control over memory management, you could make sure that pointers are aligned to page sizes (or at least to your SIMD-width)

Also if (for a different problem) some entries would be read or modified multiple times, you can try to re-arrange your loops, so that variables are less often read or written from/to memory.

2
huseyin tugrul buyukisik On

OpenCL can vectorize simple codes easily and distributes the work to all cores. For example, following code computes XOR of 16 million 64-bit integers in 3 milliseconds with Ryzen 7900 (12 cores), making ~44GB/s bandwidth on dual-channel 4800MHz DDR5 memory (this is just a simple XOR of all elements, not exactly your algorithm, so it's not using caches efficiently but simply streaming data):

int main()
{
    constexpr int N = 4096 * 4096;
    constexpr int numWorkers = 4096*32;
    GPGPU::Computer computer(GPGPU::Computer::DEVICE_CPUS);
    std::string nStr = std::string("constant int N = ") + std::to_string(N) + R"(;
    )";
    std::string nwStr = std::string("constant int stride = ") + std::to_string(numWorkers) + R"(;
    )";
    computer.compile(nStr+nwStr+R"(
    
        kernel void Xor(global unsigned long * dataIn, global unsigned long * dataOut)
        { 
            int id = get_global_id(0);
            unsigned long xorData = dataIn[id];
            for(int i=1;i<N/stride;i++)
            {
                xorData = xorData ^ dataIn[id + i*stride];
            }

            // reduced result
            dataOut[id]=xorData;
        })", "Xor");

    GPGPU::HostParameter dataIn = computer.createArrayInput<uint64_t>("dataIn",N);
    GPGPU::HostParameter dataOut = computer.createArrayOutput<uint64_t>("dataOut", numWorkers);
    auto parameters = dataIn.next(dataOut);

    uint64_t result = 0;
    size_t t;
    for (int rep = 0; rep < 20; rep++)
    {
        for (int init = 0; init < N; init++)
            dataIn.access<uint64_t>(init) = init;
        {
            GPGPU::Bench bench(&t);
            computer.compute(parameters, "Xor", 0, numWorkers, 256);

            // final pass on host-side with just numWorkers elements
            uint64_txorData = dataOut.access<uint64_t>(0);
            for (int i = 1; i < numWorkers; i++)
            {
                xorData ^= dataOut.access<uint64_t>(i);
            }
            result = xorData;
        }
        std::cout << "computation took " << t / 1000000000.0 << "s" << std::endl;
        std::cout << "xor: " << result << std::endl;
        std::cout << "------------" << std::endl;
    }
    return 0;
}

output:

computation took 0.0079244s
xor: 0
------------
computation took 0.0030533s
xor: 0
------------
computation took 0.0030468s
xor: 0
------------
computation took 0.0031714s
xor: 0
------------
computation took 0.0030102s
xor: 0
------------
computation took 0.0030884s
xor: 0
------------
computation took 0.0029352s
xor: 0
------------
computation took 0.0029854s
xor: 0
------------
computation took 0.0029936s
xor: 0
------------
computation took 0.0029326s
xor: 0
------------
computation took 0.0030838s
xor: 0
------------
computation took 0.0031311s
xor: 0
------------
computation took 0.0030022s
xor: 0
------------
computation took 0.0031073s
xor: 0
------------
computation took 0.0029577s
xor: 0
------------
computation took 0.0030004s
xor: 0
------------
computation took 0.0031038s
xor: 0
------------
computation took 0.0029255s
xor: 0
------------
computation took 0.002938s
xor: 0
------------
computation took 0.0029927s
xor: 0
------------

I had to install Intel's Runtime to be able to use the CPU. Also tried GPU but pcie v4.0 has some bottleneck (being only around 32GB/s and outside of the CPU cache).

To control caching on L2 & L3 or to select cores explicitly, you can use device-fission feature of OpenCL. If you don't need OpenCL, you can still observe assembly output of program binary to see how it vectorizes the codes and take it for your own C++ hand-tuned AVX library.