I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes.
I know that AVX-512 has a VPOPCNTDQ instruction which counts the number of 1 bits in each consecutive 64 bits within a 512-bit vector, and IIANM it should be possible to issue one of these every cycle (if an appropriate SIMD vector register is available) - but I don't have any experience writing SIMD code (I'm more of a GPU guy). Also, I'm not 100% sure about compiler support for AVX-512 targets.
On most CPUs, still, AVX-512 is not (fully) supported; but AVX-2 is widely-available. I've not been able to find an less-than-512-bit vectorized instruction similar to VPOPCNTDQ, so even theoretically I'm not sure how to count bits fast with AVX-2 capable CPUs; maybe something like this exists and I just missed it somehow?
Anyway, I'd appreciate a short C/C++ function - either using some intristics-wrapper library or with inline assembly - for each of the two instruction sets. The signature is
uint64_t count_bits(void* ptr, size_t size);
Notes:
- Related to How to quickly count bits into separate bins in a series of ints on Sandy Bridge? , but not a dupe.
- We can assume the input is well-aligned, if that matters.
- Forget about multiple cores or sockets, I want code for a single (thread on a single) core.
AVX-2
@HadiBreis' comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It's based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles.
AVX-512
The same repository also has a VPOPCNT-based AVX-512 implementation. Before listing the code for it, here's the simplified and more readable pseudocode:
And now for the real deal:
The
lookup8bit
is a popcnt lookup table for bytes rather than bits, and is defined here. edit: As commenters note, using an 8-bit lookup table at the end is not a very good idea and can be improved on.