How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

124 Views Asked by At

EDIT: Now I realize I didn't explain my algorithm well enough. I'll try again.

What I'm doing is something very similar to dot product of two vectors, but there is a difference. I've got two vectors: one vector of bits and one vector of floats of the same length. So I need to calculate sum: float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1], BUT the difference from a classic dot product is that I need to skip some fixed number of elements after each set bit.

Example:

vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits   = {1, 0, 1, 0, 1 }
nSkip = 2

in this case sum is calculated as follows:

sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5

The following code is called many times with different data so I need to optimize it to run as fast as possible on my Core i7 (I don't care much about compatibility with anything else). It is optimized to some extent but still slow, but I don't know how to further improve it. Bit array is implemented as an array of 64 bit unsigned ints, it allows me to use bitscanforward to find the next set bit.

code:

unsigned int i = 0;
float fSum = 0;
do
{
  unsigned int nAddr = i / 64;
  unsigned int nShift = i & 63;
  unsigned __int64 v = bitarray[nAddr] >> nShift;
  unsigned long idx;
  if (!_BitScanForward64(&idx, v))
  {
    i+=64-nShift; 
    continue;
  }
  i+= idx;
  fSum  += floatarray[i];
  i+= nSkip;
}   while(i<nEnd);

Profiler shows 3 slowest hotspots :

1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v) 
3. fSum += floatarray[i]; (memory access)

But probably there is a different way of doing this. I was thinking about just resetting nSkip bits after each set bit in the bit vector and then calculating classical dot product - didn't try yet but honestly don't belive it will be faster with more memory access.

2

There are 2 best solutions below

1
On

I think you should check "how to ask questions" first. You will not gain many upvotes for this, since you are asking us to do the work for you instead of introducing a particular problem.

I cannot see why you are incrementing the same variable in two places instead of one (i). Also think you should declare variables only once, not in every iteration.

0
On

You have too many of your operations inside of the loop. You also only have one loop, so many of the operations that do need to happen for each flag word (the 64 bit unsigned integer) are happening 63 extra times.

Consider division an expensive operation and try to not do that too often when optimizing code for performance.

Memory access is also considered expensive in terms of how long it takes, so this should also be limited to required accesses only.

Tests that allow you to exit early are often useful (though sometimes the test itself is expensive relative to the operations you'd be avoiding, but that's probably not the case here.

Using nested loops should simplify this a lot. The outer loop should work at the 64 bit word level, and the inner loop should work at the bit level.


I have noticed a mistake in my earlier recommendations. Since the division here is by 64, which is a power of 2, this is not actually an expensive operation, but we still need to get as many operations as far out of the loops as we can.

/* this is completely untested, but incorporates the optimizations
   that I outlined as well as a few others.
   I process the arrays backwards, which allows for elimination of
   comparisons of variables against other variables, which is much
   slower than comparisons of variables against 0, which is essentially
   free on many processors when you have just operated or loaded the
   value to a register.
   Going backwards at the bit level also allows for the possibility that
   the compiler will take advantage of the comparison of the top bit
   being the same as test for negative, which is cheap and mostly free
   for all but the first time through the inner loop (for each time
   through the outer loop.
 */
double acc = 0.0;

unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;

if (i_end == 0)
{
     return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;

do
{
    unsigned __int64 v = bitarray[i_word_end];
    unsigned i_upper = i_word_end << 64;
    while (v)
    {
         if (v & 0x80000000000000)
         {
              // The following code is semantically the same as
              // unsigned i = i_bit_end + (i_word_end * sizeof(v));
              unsigned i = i_bit_end | i_upper;
              acc += floatarray[i];
         }
         v <<= 1;
         i--;
     }
     i_bit_end = 63;
     i_word_end--;
} while (i_word_end >= 0);