Optimized way to perform AVX2 VPXOR and popcount in minimum clock cycles

416 Views Asked by At

We have to perform bit wise XOR operation on two arrays each containing 5 elements of uint64_t (unsigned long long) and then perform counting (pop count) of 1's. What is the optimized way by using AVX2 256 bit wide YMM registers, AVX2 VPXOR and popcount to achieve this in minimum clock cycles.

Right now we are doing this by following code snippet

for (j = 0; j < 5; j++){
 xorResult = cylinderArrayVectorA[j] ^ cylinderArrayVectorB[j];
 noOfOnes = _mm_popcnt_u64(xorResult);
 sumOfOnes += noOfOnes;

We have 260 bits in array A and array B. What is the optimized way to perform AVX2 VPXOR and popcount in minimum clock cycles.

0

There are 0 best solutions below