I want to sum all indexes of set bits.
http://bitmath.blogspot.com/2023/01/weighted-popcnt.html?m=1 has an interesting implementation:
// sum of indexes of set bits
int A073642(uint64_t n)
{
return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
(__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
(__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
(__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
(__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
(__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}
(Godbolt: compiler-generated asm for x86-64-v3 (AVX2 like Haswell) for MSVC, GCC, and clang, which interestingly auto-vectorizes four of the popcounts.)
However, I am looking for a way to do it without using popcount so many times.
I try to do this in assembly. The popcount operation is quite fast, but it can be done in smaller number of instructions, because in every popcount we repeat the same phase (especially if hardware popcount isn't available, like on RISC-V perhaps, or x86 before Nehalem).
This is something like a "bit puzzle” and I should probably use some smart masks and only elementary instructions (arithmetic/logical operations, conditional move/set/jump) of assembly, but I don’t know how.
I can’t use AVX-512.
A naive but portable way to do this would be to use a look-up table. I tried to compare the performance with the function above. When compiled with -mpopcnt flag, the popcount version is understandably faster, about 2-3x at around 30 cycles. Without the popcount compiler flag, this function is something like twice as fast.