When to use parallel counting - MIT HAKMEM for bitcount when memory is an issue?

2.9k Views Asked by At

Bitcounting can be done in several ways, eg. with set bit iterator, unset bit iterator, pre-computed bits with lookup tables or parallel counting. As I have figured out by searching the web, unset bit iterator is fast when there are less unset bits, and set bit iterator the opposite. But when should you use parallel counting, MIT HAKMEM (seen below) in particular? It seems quite fast, although probably slower then lookup tables. Is it always better compared to set/unset bit in terms of speed? Are there some other conserns regarding which one to choose than speed and memory?

 int BitCount(unsigned int u) {
     unsigned int uCount;

     uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111);
     return ((uCount + (uCount >> 3)) & 030707070707) % 63;
 }
4

There are 4 best solutions below

1
On BEST ANSWER

Why choose one bit counting method over another? Well it really depends on your machine and the problem you're trying to solve. Note that all the instruction counts I give below are for a basic RISC processor and might not translate well to a more complicated beast like x86.

The HAKMEM algorithm you quoted will execute in 13 instructions but is unlikely to be very fast due to the modulus operator. By eye-balling it, it does look like it has some pretty good instruction level parallelism which should help if your processor is capable of exploiting that.

The algorithm Bo Persson presented is quite fast (2 + 5*pop(x) instructions) but only if the word is sparsely populated. It can also be modified to work on densely populated words. It also contain branches and doesn't have any significant instruction level parallelism.

EDIT: The table lookup method can also be very fast but does make memory accesses. If the entire table is in the L1 cache then it's probably one of the fastest algorithms. If the table isn't in cache then it's almost certainly one of the slowest.

The algorithm below is a variation of one of the HAKMEM algorithm and is presented in the book Hacker's Delight (I highly recommend this book if you like this sort of things). It executes in 19 instructions and is branch-free. It also doesn't use a division but does have a multiplication. It's also very economical in the way it uses registers by re-using the same mask as much as possible. Still no significant instruction level parallelism here that I can see.

int pop(unsigned x) {
  unsigned n;

  n = (x >> 1) & 0x77777777;
  x = x - n;
  n = (n >> 1) & 0x77777777;
  x = x - n;
  n = (n >> 1) & 0x77777777;
  x = x - n;
  x = (x + (x >> 4)) & 0x0F0F0F0F;
  x = x * 0x01010101;
  return x >> 24;
}

The Hacker's Delight book also presents a couple of even more specialised algorithms for 9-8-7 bit wide fields or using floating point operators. Note that most of the analysis I presented above were also partially taken from that book as well.

The fact is that there's a truck load of methods and the only way to be sure which works best in your particular situation is to measure and compare. I do realise that this is a pretty canned answer but the alternative is to know your processor and compiler inside out.

4
On

This is very simple, but amazingly fast if there are just a few bits set:

int popCount (U64 x) {
   int count = 0;
   while (x) {
       count++;
       x &= x - 1; // reset LS1B
   }
   return count;
}

From Chess programming wiki: Brian Kernighan's way

0
On

If you have a x86-CPU which supports SSE4 you already have a single instruction to do all the work: POPCNT. See Intel® SSE4 Programming Reference. (PDF, 698KB)

Another remark: Parallel algorithms like HAKMEM 169 do not contain conditional branches. This is a huge advantage. Modern processors predict the direction of conditional branches, but have trouble with random branching patterns. Mispredictions are very costly (cost can be more than the equivalent of 100 instructions). It is best to avoid loops with a variable loop count and/or conditional statements if performance is important. For more information:

I also second the recommendation of the Book Hackers Delight.

0
On

How about something like the following:

  1. From each raw word r, compute a pair-munged word as `w = r - (r & 0x55555555)`. Each pair of bits will hold the number of set bits in that pair (0-2).
  2. Next, from each pair-munged word, compute a quad-munged word `q = (w & 0x33333333) + ((w >> 2) & 0x33333333)`. Each quartet of set bits will hold the number of set bits in that quartet (0-4).
  3. Sum together the quad-munged words in groups of two, and from each sum `s` compute an octet-munged sum `o = (s & 0x0F0F0F0F) + ((s >> 4) & 0x0F0F0F0F)`. Each octet will hold the total number of set bits in the corresponding octet in both input words (0-16).
  4. Sum together octet-munged sums in groups of eight, and from each sum `s` compute a halfword-munged sum `h = (s & 0x00FF00FF) + ((s >> 8) & 0x00FF00FF)`. Each halfword will contain a total number of set bits in the corresponding halfword of all 16 input words (0-256).
  5. Sum together the halfword-munged sums in groups of 128, and from each sum `s` compute a total sum `t = (s & 0x0000FFFF) + ((s >> 16) & 0xFFFF)`. This will hold the number of set bits in 1024 input words (0-32768)

Note that two of the steps are performed for every word, one for every other word, one for every sixteen, and one for every 1024. If words are 64 bits instead of 32, an extra step would be necessary for the final sum, but the word-munged sums could be added in groups of 65,536 (representing 67,108,864 input words).