Popcount assembly / sum indexes of set bits

695 Views Asked by At

I want to sum all indexes of set bits.

http://bitmath.blogspot.com/2023/01/weighted-popcnt.html?m=1 has an interesting implementation:

// sum of indexes of set bits
int A073642(uint64_t n)
{
    return __popcnt64(n & 0xAAAAAAAAAAAAAAAA) +
          (__popcnt64(n & 0xCCCCCCCCCCCCCCCC) << 1) +
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
}

(Godbolt: compiler-generated asm for x86-64-v3 (AVX2 like Haswell) for MSVC, GCC, and clang, which interestingly auto-vectorizes four of the popcounts.)

However, I am looking for a way to do it without using popcount so many times.

I try to do this in assembly. The popcount operation is quite fast, but it can be done in smaller number of instructions, because in every popcount we repeat the same phase (especially if hardware popcount isn't available, like on RISC-V perhaps, or x86 before Nehalem).

This is something like a "bit puzzle” and I should probably use some smart masks and only elementary instructions (arithmetic/logical operations, conditional move/set/jump) of assembly, but I don’t know how.

I can’t use AVX-512.

3

There are 3 best solutions below

8
Simon Goater On

A naive but portable way to do this would be to use a look-up table. I tried to compare the performance with the function above. When compiled with -mpopcnt flag, the popcount version is understandably faster, about 2-3x at around 30 cycles. Without the popcount compiler flag, this function is something like twice as fast.

static inline uint32_t A073642_nopopcnt(uint64_t n) {
  static const uint8_t nibbles[16][16] = {
{0, 0, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6},
{0, 4, 5, 9, 6, 10, 11, 15, 7, 11, 12, 16, 13, 17, 18, 22},
{0, 8, 9, 17, 10, 18, 19, 27, 11, 19, 20, 28, 21, 29, 30, 38},
{0, 12, 13, 25, 14, 26, 27, 39, 15, 27, 28, 40, 29, 41, 42, 54},
{0, 16, 17, 33, 18, 34, 35, 51, 19, 35, 36, 52, 37, 53, 54, 70},
{0, 20, 21, 41, 22, 42, 43, 63, 23, 43, 44, 64, 45, 65, 66, 86},
{0, 24, 25, 49, 26, 50, 51, 75, 27, 51, 52, 76, 53, 77, 78, 102},
{0, 28, 29, 57, 30, 58, 59, 87, 31, 59, 60, 88, 61, 89, 90, 118},
{0, 32, 33, 65, 34, 66, 67, 99, 35, 67, 68, 100, 69, 101, 102, 134},
{0, 36, 37, 73, 38, 74, 75, 111, 39, 75, 76, 112, 77, 113, 114, 150},
{0, 40, 41, 81, 42, 82, 83, 123, 43, 83, 84, 124, 85, 125, 126, 166},
{0, 44, 45, 89, 46, 90, 91, 135, 47, 91, 92, 136, 93, 137, 138, 182},
{0, 48, 49, 97, 50, 98, 99, 147, 51, 99, 100, 148, 101, 149, 150, 198},
{0, 52, 53, 105, 54, 106, 107, 159, 55, 107, 108, 160, 109, 161, 162, 214},
{0, 56, 57, 113, 58, 114, 115, 171, 59, 115, 116, 172, 117, 173, 174, 230},
{0, 60, 61, 121, 62, 122, 123, 183, 63, 123, 124, 184, 125, 185, 186, 246}};
  uint32_t i, sumsetbitpos = 0;
  for (i=0;i<16;i++) {
    sumsetbitpos += nibbles[i][(n >> (i << 2)) & 0xf];
  }
  return sumsetbitpos;
}
16
Peter Cordes On

If you're doing total += A073642(arr[i]) over a whole array (like 256 bytes or preferably larger), AVX2 with Harley Seal would be very very good. The bitwise boolean work stays the same cost, only the popcount part gets more expensive (to do the weighting).

If you want separate weighted-popcount results for multiple input numbers (not summing them), you can't use Harley Seal, but SIMD is still effective. This answer shows a way to use SIMD for just a single number (which may be about equal speed to 6x scalar popcnt), but it would be not much more expensive to get 2 results for 2 inputs, or with AVX2, 4 results for 4 inputs. You might need 2 different vector constants for the weights of odd vs. even nibbles.


The SSSE3 / AVX2 pshufb parallel-lookup popcount strategy can be adapted. It might not actually be better than scalar for a single input. (Especially not if you have hardware popcnt support, which has 1/clock throughput on Intel, 3 or 4 per clock on AMD Zen 2 or Zen 3 respectively: https://uops.info/ )

On a CPU with fast SSSE3 pshufb but without hardware popcnt (e.g. 2nd-gen Core 2, aka Penryn), this might even be good. All other CPUs either have fast scalar popcnt or don't have SSSE3, or have it but 128-bit shuffles like pshufb are slower (like on 1st-gen Core, Conroe / Merom).

Instead of doing 6 separate popcounts on 6 AND results, you can do two sets of lookups: one for the sub-nibble masks, giving a result from 0 to 6 within each nibble for the cnt(x & 0xa) + 2*cnt(x & 0xc), and another that just counts normally within each nibble. But instead of just adding all nibbles into one popcount for the whole integer, you keep those nibble popcounts separate and combine them in different ways, so you don't have to redo the splitting and lookup part.

#include <stdint.h>
#include <immintrin.h>

// https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-sse-lookup.cpp  was a starting point for this code.
uint64_t weighted_popcnt_SSE_lookup(uint64_t n) {

    size_t i = 0;

    const __m128i flat_lookup = _mm_setr_epi8(
        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
    );
    const __m128i weighted_lookup = _mm_setr_epi8(
        0, 0, 1, 1, 2, 2, 3, 3,
        3, 3, 4, 4, 5, 5, 6, 6
    );

    const __m128i low_mask = _mm_set1_epi8(0x0f);

    // normally we'd split odd/even nibbles from 128 bits of data.
    // but we're using only a single 64-bit scalar input.
//    __m128i v = _mm_set_epi64x(n>>4, n);  // high nibbles in the high half, low nibbles in the low half of a vector.
  //                                   // movq xmm0, rdi ; movdqa ; psrlw ; punpcklqdq

    __m128i v = _mm_cvtsi64_si128(n);                 // movq
    v = _mm_unpacklo_epi8( v, _mm_srli_epi16(v,4) );  // movdqa ; psrlw ; punpcklbw
               // nibbles unpacked in order: byte 3 of v is n>>(3*4) & 0xf

    __m128i vmasked = _mm_and_si128(v, low_mask);  // pand
    __m128i counts01 = _mm_shuffle_epi8(weighted_lookup, vmasked);  // 0 to 6 according to popcnt(v & 0xa) + 2*popcnt(v & 0xc) in each nibble
    __m128i nibble_counts = _mm_shuffle_epi8(flat_lookup, vmasked);  // SSSE3 pshufb

/*
Each nibble has a different weight, so scale each byte differently when or before summing
          (__popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2) +
          (__popcnt64(n & 0xFF00FF00FF00FF00) << 3) +
          (__popcnt64(n & 0xFFFF0000FFFF0000) << 4) +
          (__popcnt64(n & 0xFFFFFFFF00000000) << 5);
                                     ^    ^
                                     |    \ 2nd nibble counted once
                                  this nibble scaled by <<3 and <<4.
                  Or <<1 and <<2 if we factor out the <<2 common to all nibble-granularity chunks
*/

   // oh duh, these are just the integers from 15 down to 0.
   __m128i scale_factors = _mm_set_epi8(0b1111<<2, 0b1110<<2, 0b1101<<2, 0b1100<<2, 0b1011<<2, 0b1010<<2, 0b1001<<2, 0b1000<<2,
                                        0b0111<<2, 0b0110<<2, 0b0101<<2, 0b0100<<2, 0b0011<<2, 0b0010<<2, 0b0001<<2, 0b0000<<2);

   // neither input has its the MSB set in any byte, so it doesn't matter which one is the unsigned or signed input.
   __m128i weighted_nibblecounts = _mm_maddubs_epi16(nibble_counts, scale_factors);  // SSSE3 pmaddubsw
   // Even if we didn't scale by the <<2 here, max element value is 15*15 + 14*15 = 435 so wouldn't fit in a byte.
   // Unfortunately byte multiply is only available with horizontal summing of pairs.


    counts01 = _mm_sad_epu8(counts01, _mm_setzero_si128()); // psadbw  hsum unsigned bytes into qword halves

    weighted_nibblecounts = _mm_madd_epi16(weighted_nibblecounts, _mm_set1_epi32(0x00010001));  // pmaddwd sum pairs of words to 32-bit dwords.  1 uop but worse latency than shift/add

    // sum the dword elements and counts01
    __m128i high = _mm_shuffle_epi32(weighted_nibblecounts, _MM_SHUFFLE(3,3, 1,1));  // pshufd or movhlps

    weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, counts01);  // add to the bottom dword of each qword, in parallel with pshufd latency
    weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, high);

    high = _mm_shuffle_epi32(weighted_nibblecounts, _MM_SHUFFLE(3,2, 3,2));  // pshufd, movhlps, or punpckhqdq
    weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, high);

    unsigned result = _mm_cvtsi128_si32(weighted_nibblecounts);  // movd  extract the low nibble
    return result;


   // A different nibble interleave order, like 15,0, 14,1 etc. would make the max sum 225 (15*15), allowing psadbw for hsumming if we factor out the <<2
   // weighted_nibblecounts = _mm_sad_epu8(weighted_nibblecounts, _mm_setzero_si128());  // psadbw
}

Compiled (Godbolt) with clang -O3 -march=penryn (where it might be the best option):

weighted_popcnt_SSE_lookup:             # @weighted_popcnt_SSE_lookup
        movq    xmm0, rdi
        movdqa  xmm1, xmm0
        psrlw   xmm1, 4
        punpcklbw       xmm0, xmm1              # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_1] # xmm1 = [0,0,1,1,2,2,3,3,3,3,4,4,5,5,6,6]
        movdqa  xmm2, xmmword ptr [rip + .LCPI0_2] # xmm2 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
        pshufb  xmm2, xmm0
        pmaddubsw       xmm2, xmmword ptr [rip + .LCPI0_3]
        pshufb  xmm1, xmm0
        pxor    xmm0, xmm0
        pmaddwd xmm2, xmmword ptr [rip + .LCPI0_4]
        psadbw  xmm0, xmm1
        pshufd  xmm1, xmm2, 245                 # xmm1 = xmm2[1,1,3,3]
        paddd   xmm2, xmm0
        paddd   xmm2, xmm1
        pshufd  xmm0, xmm2, 238                 # xmm0 = xmm2[2,3,2,3]
        paddd   xmm0, xmm2
        movd    eax, xmm0
           # 19 single-uop instructions, 17 not counting loading constants
           # Or 16 not counting pxor-zeroing, but that gets overwritten without AVX non-destructive destination encodings.
        ret

For Conroe, the other CPU with SSSE3 but not popcnt, you'd choose different shuffles. 64-bit granularity punpcklqdq is more efficient than punpcklbw although pshufb is still multiple uops. e.g. you'd _mm_set_epi64x(n>>4, n) with movq / movdqa / psrlw / punpcklqdq so the high nibbles of each pair would be in the high half of the vector, not alternating high/low in place-value order. And in the horizontal sum, you'd use movhlps into an old XMM register instead of pshufd to copy-and-shuffle, because 128-bit-wide shuffles with granularity narrower than 64-bit are slower. (See Fastest way to do horizontal SSE vector sum (or other reduction) for details). The Conroe-friendly version would probably not be any worse on Penryn.

TODO: Try an extra pshufb to put bytes into 15,0,14,1,13,2,... order so we can psadbw after pmaddubsw instead of needing to successively widen elements from word to dword to qword. Probably still have to psadbw the counts01 vector separately, since there's no earlier point where we can scale the whole-nibble counts by <<2 before their psadbw without overflowing a byte and losing the ability to use psadbw.

If you were doing this over an array, you'd want to work on 128 bits of input data at once, splitting an input vector to odd/even rather than loading 64 bits at a time and unpacking. And keep the per-nibble counts separate until they're close to overflowing a byte, like the Mula's popcnt_AVX2_lookup which uses vpaddb over a few iterations, deferring vpsadbw until later. (Same technique as in How to count character occurrences using SIMD except you're adding values from 0..4 instead of 0 or 1.) And don't do the full hsum down to one scalar until after a loop. e.g. you want paddd or paddq in the outer loop to accumulate vectors of counts. So again, the same techniques as Mula's popcnt_AVX2_lookup or popcnt_SSE_lookup

This needed 5 vector constants, vs. the 6x popcnt version needing 6 scalar integer constants in registers. Actually 5 constants, since the final mask can instead be shr rax, 32 / popcnt eax, eax. It should be possible to chain LEAs to shift-and-add popcnt results, like lea eax, [rax*2 + rdx], like Horner's Rule for polynomials. e.g. the <<2 and <<3 counts can combine that way, so can the <<4 and <<5 counts. Then lea eax, [rax + rax*4] to combine those pairs.

With masks already in registers (already inverted for use with andn to copy-and-AND, if BMI1 is available), a scalar popcnt version really doesn't take many instructions, assuming you have hardware popcnt. Probably just 5x andn, 1x shr, 6x popcnt, and 5x lea. So that's 17 instructions, the same number of front-end uops as the SIMD version.


Footnote 1: SIMD popcount and Harley-Seal

See http://0x80.pl/articles/sse-popcount.html / https://github.com/WojciechMula/sse-popcount for details of the strategy, which they call sse-lookup or avx2-lookup. Along with others such as harley-seal which accumulate weighted counters across vectors of input data to amortize the lookup costs.

If you're summing this weighted-popcount across multiple inputs, perhaps Harley-Seal could be even more useful than for normal popcnt. It should be a nearly drop-in replacement since Harley-Seal already works by keeping counts for each bit-position separate in its temporaries. The vertical bitwise operations work like a 4-bit binary carry-save adder, with 1 bit in each vector.

So actually 128 or 256 of these counters in parallel, a separate one for each bit position. Inside the loop you do total += 16 * popcount(sixteens); after processing 16 vectors of input, keeping the ones, twos, fours, and eights vectors rolling across iterations. See https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-sse-harley-seal.cpp

A weighted sum only has to change the actual popcount step to weighted_popcount; the bitwise operations stay the same cost. With AVX2 and normal popcount on Skylake, Harley-Seal is only a bit faster than vpshufb lookup over decent-sized arrays (like 2048 bytes). On Intel CPUs, AVX2 lookup and Harley-Seal are almost twice as fast as scalar popcnt. That might not be the case for AMD Zen where scalar popcnt has 3 to 4 per clock throughput (and loads have 3/clock throughput on Zen 3 and later). (https://uops.info/)

Making the popcount step more complicated (and slower) only affects a small part of the cost of Harley Seal, but almost all the work in a pure lookup version that doesn't use Harley-Seal, so it's much more attractive.


Other possible ideas:

If you were summing over many input n values, positional popcount could set you up to apply the weights at the end. https://github.com/lemire/FastFlagStats and https://github.com/mklarqvist/positional-popcount

But probably it's more work to keep all 64 counts fully separate until the end, instead better to only partially apply these techniques and combine according to your weights at times. I expect Harley-Seal is the best bet.

If AVX-512 were available - especially Ice Lake AVX512VPOPCNTDQ or AVX512BITALG, you could use vpopcntb instead of vpshufb, using similar ideas to what I did for SSSE3.


Clang can optimize the 6x popcnt version into asm that does two scalar popcounts, and one AVX2 popcount of the input broadcasted and masked 4 different ways, in different elements of a YMM vector. In general it can vectorize sums of popcounts, including this case of the sum of 6 popcounts.

0
Peter Cordes On

TL:DR: a classic bithack, using two multiplier constants, one for a flat sum of weights-within-bytes, and one for a weighted sum of flat counts-within-bytes, applying the right weight for each 8-bit group.

This is still slower on a machine with fast hardware popcount. It appears my SSSE3 version for a single number is fastest, especially without inlining and hoisting the constants out of the loop. (And without auto-vectorizing a bithack over multiple numbers, which is quite effective with AVX2). This is vastly better than naively calling a bithack popcount 6 times: only about twice the throughput cost of one flat-popcount done with bithacks. See


If you have hardware support for popcount, like the Intel / MSVC __popcnt64 intrinsic you show in your example, one x86 popcnt instruction costs the same as a scalar integer multiply on Intel (1/clock throughput, 3 cycle latency). Or the same as an integer add on Zen (4/clock throughput with 1 cycle latency on Zen 3 and later, before that 3/clock - https://uops.info). So most of the cost is in the masks (10-byte mov r64, imm64), and applying them and adding the results (with lea to shift-and-add in a single cheap instruction.) So probably 5x andn + 1x shr to feed 6x popcnt, and 5x lea to accumulate the results.

Anything you do with only traditional bit-manipulation operations (shifts, bitwise booleans, add/sub) is going to be sub-optimal on modern x86, or any ISA with hardware popcount or SIMD that's usable without big stalls to get data between it and GP integer domains. So maybe modern RISC-V without extension B's cpop instruction, or backwards-compatible x86-64 code without SSSE3 or popcnt.

Using pure bithacks instead of hardware popcount, you're right that there's some scope for overlapping work between the popcounts, instead of starting each one from scratch with a different input. I'm posting this answer partly since you seemed to be curious about doing this with purely basic bit-manipulation stuff. Maybe there's room for more optimization, but it doesn't look good compared to haredware popcnt.


Consider the standard SWAR bithack as described in Count the number of set bits in a 32-bit integer and https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

// standard SWAR popcount for uint64_t i
     uint64_t ones = (n >> 1) & 0x5555555555555555;   // 0xaa>>1
        // ones is actually just the high half of each pair, the <<0 weighted bits
     uint64_t pairs = n - ones;   // optimization over (n&0x555...) + ones
     uint64_t quads = (pairs & 0x3333333333333333) + ((pairs >> 2) & 0x3333333333333333);
     uint64_t octs = (quads + (quads >> 4)) & 0x0F0F0F0F0F0F0F0F;  // fields are wide enough relative to their values that we can defer masking until after adding.
     return (octs * 0x0101010101010101uLL) >> (64-8);     // horizontal sum of bytes

 // weighted might need another shift/add step to widen to 16-bit before multiply,
 // since the weighted-sum result won't fit in a byte.

There are obvious similarities to the masking you're doing to feed each separate popcount. With the temporaries involving popcount within 2-bit, 4-bit, and 8-bit chunks of the input, we can accumulate weighted sums alongside the flat sums. At each step, we can add pairs of weighted elements like we're doing for the flat elements, and add in the next higher weight.

Since the narrowest elements have the smallest weight, it turns out the elements are always wide enough at every step to hold the weighted sum without overflow. For example, a 4-bit group with count(n & 0xa) + 2*count(n & 0xc) can have value 0..6, which fits in 3 bits so the high bit of each group is always zero. (This lets later steps defer masking until after adding, without carry-out corrupting the next group.)

At each step, we can shift and add the previous flat sums to get weighted sums within 4-bit groups, for example. The chain of flat sums is also needed. e.g. the 0xaa and 0xcc masks discard the low bit, but that bit does get counted as part of later masks, except in the lowest nibble that isn't part of any wider group that isn't discarded.

Fortunately the shifting/masking to reduce the weighted sums to wider elements can use the same masks as the flat steps, saving instructions (especially for the case where this can't inline inside a loop, although you'd want to use SIMD there anyway.) It's the same shift/mask/add (or shift/add/mask) as for the flat popcount.

Unlike flat sums, applying weights along the way leads to the values getting quite large, so we do have to keep doing some masking at each step. And it limits the options for summing with a multiplier constant like (x * 0x01010101...)>>(64-8), since that only works when the sum fits in the top chunk the size of 1 element. It also has to continue past that point of 8-bit chunks because we need weights for 16-bit and 32-bit chunks.

Working version using just plain C integer bit-manipulation

#include <stdint.h>
// value-range 0 to 2016 fits in a 16-bit integer, so masking wider than that can be deferred
//inline
unsigned A073642_bithack(uint64_t n)
{
    uint64_t ones = (n >> 1) & 0x5555555555555555;   // 0xaa>>1
    uint64_t pairs = n - ones;  // standard popcount within pairs, not weighted

    uint64_t weighted_pairs = ((ones + (ones>>2)) & 0x3333333333333333)
        + ((pairs >> 2) & 0x3333333333333333) * 2;   // 0..6 range within 4-bit nibbles
             // aka  (pairs>>1) & (0x3333333333333333<<1).  But x86-64 can efficiently use LEA to shift-and-add in one insn so it's better to shift to match the same mask.
    uint64_t quads = (pairs & 0x3333333333333333) + ((pairs >> 2) & 0x3333333333333333);   // standard popcount within each nibble, 0..4
     
    // reduce weighted pairs (0xaa and 0xcc masks) and add __popcnt64(n & 0xF0F0F0F0F0F0F0F0) << 2)
    // resulting in 8 buckets of weighted sums, each a byte wide
    uint64_t weighted_quads = ((weighted_pairs + (weighted_pairs >> 4)) & 0x0F0F0F0F0F0F0F0F) 
              + (4*((quads >> 4) & 0x0F0F0F0F0F0F0F0F));  // 0 to 2*6 + 4*4 = 28 value-range
                // need some masking before adding.  4*quads can be up to 16 so can't use (quads >> (4-2)) & 0x0F0F0F0F0F0F0F0F
    uint64_t octs = (quads + (quads >> 4)) & 0x0F0F0F0F0F0F0F0F;  //  0..8   quad fields don't have their high bit set, so we can defer masking until after adding without overflow into next field

    // two separate sums of 8-bit groups, one of the first 3 weights
    unsigned sum_weighted_quads = (weighted_quads * 0x0101010101010101uLL)>>(64-8); // barely fits in 1 byte: max val 28 * 8 = 224
    // the second applying the last 3 relative weights to the flat sums
    unsigned sum_octweights = (octs * 0x0001020304050607uLL) >> (64-8);  // separate weight for each byte group, with <<3 factored out so the max sum is also 224 = 32*(1<<0 + 1<<1 + 1<<2)
    return sum_weighted_quads + 8 * sum_octweights; // apply the <<3 factored out of byte weights
}

I finally came up with a good cleanup strategy for octs and weighted_quads, with just 2 multipliers. An earlier version of that which widened to 16-bit chunks is in the revision history. The Godbolt link below has two #if alternate cleanups that keep widening with shifts / masks, perhaps good on a machine with quite slow multiply.

AArch64 can do this efficiently (27 instructions vs. 36 for x86-64) with its repeated-bit-pattern immediates for bitwise boolean operations allowing and-immediate for those 64-bit repeating constants. And a shifted source operand for some instructions like add x1, x1, x1, lsr 4. (On AArch64, the versions with more shift/add and fewer multiplies didn't cost many more instructions. But the change that introduced it saved 8 x86-64 instructions vs. saving 4 on AArch64.)

BMI1 andn will help on x86-64, allowing copy-and-add instead of two separate instructions (mov / and). Put ~0x3333... in a register and use the constant as the first operand so you're actually anding with 0x3333.... like we need to.

BMI2 rorx is a handy copy-and-shift if you're going to mask the result anyway. Normally you'd never use this SWAR bithack for popcount if BMI1/BMI2 were available, because that would imply popcnt was available. But here it's not a straight popcount so we can take full advantage. e.g. the first few steps would look like this:

Unfortunately neither GCC nor clang use these tricks, so their asm output looks like: (Godbolt with test harness)

# GCC
A073642_bithack:
        mov     rcx, rdi
        movabs  rdx, 6148914691236517205   # 0x5555555555555555
        mov     rax, rdi
        movabs  rdi, 3689348814741910323   # 0x3333333333333333
        shr     rcx               # better: rorx rcx, rdi, 1 since we're masking the result
        and     rcx, rdx
        sub     rax, rcx
        mov     rdx, rcx
        mov     rsi, rax
        shr     rdx, 2
        and     rax, rdi          # copy-and-AND with `andn` with an inverted mask could avoid the mov rsi, rax to preserve a copy.
        add     rdx, rcx
        shr     rsi, 2
        and     rsi, rdi
        and     rdx, rdi
        lea     r8, [rdx+rsi*2]
        add     rax, rsi
        movabs  rsi, 1085102592571150095  # 0x0F0F0F0F0F0F0F0F
        mov     rdx, rax
        mov     rcx, r8
        shr     rdx, 4
        shr     rcx, 4
        add     rcx, r8
        mov     rdi, rdx
        add     rdx, rax
        movabs  rax, 283686952306183      # 0x0001020304050607
        and     rdi, rsi
        and     rcx, rsi
        and     rdx, rsi
        lea     rcx, [rcx+rdi*4]
        imul    rdx, rax
        movabs  rdi, 72340172838076673    # 0x0101010101010101
        imul    rcx, rdi
        shr     rdx, 56
        shr     rcx, 56
        lea     eax, [rcx+rdx*8]
        ret

This is 36 instructions not counting the ret (GCC for x86-64), vs. 19 for a classic flat popcount like you might use without HW popcount (or like GCC uses for __builtin_popcountll without -mpopcnt or -march=native). 5 of the 36 instructions are movabs r64, imm64, which can hoist out of a loop if we're doing this repeatedly. (But ideally use SIMD for that.)

Using (octs * 0x0001020304050607uLL) >> (64-16) to do a weighted sum of eight 8-bit chunks saved many shifting and masking instructions.

The alternate cleanup versions (included in the Godbolt link) are about 51 or about 56 instructions, doing more shift / mask / add steps with 0 or 1 multiply instead of 2. I haven't thrown this into uiCA, LLVM-MCA, or IACA to analyze for throughput, but these are all single-uop instructions and a lot of them can run on any port. There are a lot of shifts, though, so we might have p0/p6 bottlenecks on Intel.

By comparison, a classic bithack popcount (-mno-popcnt to defeat optimizing this idiom into the HW instruction) compiles to 19 x86-64 instructions including four mov r64, imm64 (3 masks and a multiplier). A073642_popcnt (your function) using hardware popcnt is 26 instructions (including the mask setup).

So it's not great, probably also worse than my SSSE3 SIMD answer even for a single input.

Clang does worse with some versions, inventing another constant to mask with (0x1C1C1C1C1C1C1C1C from "optimizing" 4*((quads >> 4) & 0x0F0F0F0F0F0F0F0F) into ((quads >> 2) & (0x0F0F0F0F0F0F0F0F<<2)) instead of using the 0x0F mask we already needed, and using LEA to shift-and-add for nearly the same cost as add.) This is bad on x86-64 where it takes a 10-byte instruction to put it in a register; unlike AArch64 where it can get used as an immediate for an and instructions. Clang also sometimes invented 0x1111... = 0x5555 & 0x3333... when I wrote an early step a different way, masking ones and ones>>2 each separately with 0x3333 before adding.

SIMD shifts would make some of the masking unnecessary, and make hsumming easier. e.g. vpsrlw xmm1, xmm0, 8 would implicitly mask with 0x00ff00ff00ff00ff since bits get discarded at the bottom of each 16-bit element. It would also make horizontal sum of bytes without overflow cheap ([v]psadbw against zero), letting us add two things before one hsum, instead of needing to keep them separate to avoid the sums overflowing a byte. [v]pmaddubsw could apply separate weights to each byte before that, without overflowing the total if we factor out *8 from it like we do here.


PS: weighted sum within a 4-bit nibble looks like this:

0000 : 0 + 0
0001 : 0 + 0
0010 : 1 + 0    = 1
0011 : 1 + 0

0100 : 0 + 1*2  = 2
0101 : 0 + 1*2
0110 : 1 + 1*2  = 3
0111 : 1 + 1*2

1000 : 1 + 1*2  = 3
1001 : 1 + 1*2
1010 : 2 + 1*2  = 4
1011 : 2 + 1*2

1100 : 1 + 2*2  = 5
1101 : 1 + 2*2
1110 : 2 + 2*2  = 6
1111 : 2 + 2*2

Initial design notes:

If we can generate those 0..6 values instead of the usual 0..4 integer values in each 4-bit SWAR accumulator, we can carry on with the rest of the popcount pattern just once to get popcnt(n & 0xAA...AA) + (popcnt(n & 0xCC...CC) << 1).

If we can add in the popcnt(n & 0xF0F0F0F0F0F0F0F0)<<2 weights at some earlier stage, as soon as they'll fit without overflowing, then the later stages of that don't need to be done separately either. 4<<2 is 16, so it won't fit in a nibble but will fit in a byte. (Even when added to 2 * 6 = 12 from the sum of weights from bit-within-nibble place values for two nibbles of all-ones.)

To some degree we actually need to keep things separate because the higher weights need combinations of just the flat popcounts for nibbles, bytes, words, etc. without contributions from the small-granularity weighted sums. Still, it looks like we can use the temporaries from the flat popcnt to feed the weighted sums along the way.