I'm on an apple m1 processor.
What i'm trying to do is efficiently count the 1 bits in a large char array in rust. I looked up the arm neon instructions, and I think I can do it via a cnt instruction (which counts 1's per 8 bit block) and then a addv to add each of the 8 bit blocks.
to start with i figured i would feed in a 64 bit uint.
fn aarch64_hamming_weight(inp: u64) -> u64 {
let mut outp: u64;
unsafe {
asm!(
"dup v0.2d, {input}",
"cnt v0.16b, v0.16b",
"addv b1, v0.16b",// this adds all the bytes of the vector, how do i get the result?!
//"fmov {output2}, b1",
"fmov {output}, v0.d[1]",
input = in(reg) inp,
output = out(reg) outp,
)
}
return outp;
}
this nearly works, but if your number is greater than 8 bits then the result isn't quite right; 225's binary count is 4 425's binary count is 260
so
- I need a way to get the addv's output
- I need a way for it to work with an arbitrary char array (which would need a loop, pulling 128 bits at a time)
When I started writing an answer, I thought it was going to show that inline asm was totally pointless, and RustC was going to do a good job vectorizing
u8.count_ones()
. That is unfortunately not the case, but you should only use asm if you're going to write a whole loop, not oneu64
at a time. (Possibly still inline asm, but writing a whole function might be reasonable.)If Rust has SIMD intrinsics for ARM/AArch64, that would be a much better choice than inline asm, and definitely worth using.
For counting a whole array, you don't want to reduce every
cnt
result to scalar separately, especially not all the way to a general-purpose integer register. Especially not for a singleu64
at a time instead of a full 128 bits. Add into a vector of counts that you only reduce at the end.clang/LLVM knows how to auto-vectorize popcount
__builtin_popcount()
in C over an array in C, so I was hoping LLVM-based RustC would do ok for AArch64. It is somewhat ok foru64
, but terrible foru8
unfortunately. If you can safely point au64
span at your data (or reference or however Rust does things?), then you can get a good fraction of the benefit portably with no brittle inline asm. And in a way that future compiler versions can hopefully improve on.compiled on https://godbolt.org/z/7qP7cK6bv with
-O --target aarch64-unknown-linux-gnu
(With the default-C --opt-level=3
).You'd think
sum: u32
would help, requiring less widening inside the loop. (If you have huge arrays that could overflow that, use an outer loop). But actually, RustC still widens to 64-bit I think, but then does even more work to truncate those counts to 32-bit. Almost certainly a missed-optimization. (Perhaps a strategy built around x86psadbw
which does sum bytes into u64 chunks in one step; LLVM auto-vectorizes popcount with pshufb and that for AVX2 on x86.)And you'd think doing the same thing for a
u8
array should vectorize to the same code, with just some extra scalar cleanup, right? Well it should, but actually it still only unrolls by 4 like LLVM likes to do, but that's 4u8
elements with the inner loop becoming total garbage:So it's vectorizing
(v as u64).count()
, and using canned recipes that it doesn't know how to optimize into. (e.g. theuaddlp
widening is pointless if thecnt
results are zero except in the low byte of each 64-bit chunk, they could just use verticaladd
with 64-bit chunks.)Compare against what you get from a C compiler for manually vectorized ARM code from https://github.com/WojciechMula/sse-popcount/. If the ARM Neon code is as well-tuned as the x86 code in the same repo, that's what you should be aiming for in terms of final asm output from your compiler, however you get there.
I'd guess that an inner loop that only widens the per-byte counts to 16-bit would be good, running up to the number of iterations that are possible without overflowing a
u16
with+=16
from seeing all-ones in the pair of bytes that feed into it. i.e. 65535/16 rounded down = 4095 vectors before widening out to 64-bit chunks.Mula's version does only 31 inner loop iterations, accumulating into 8-bit elements with
vaddq_u8
. Butuadalp v0.16b, v2.8h
horizontal-pair accumulation of u8 bytes into u16 halfword elements wasn't available for 32-bit NEON, only for AArch64 ASIMD.The key point is to do the minimum amount of work in the inner-most loop, ideally using only one cheap instruction per vector of
cnt
results. If you can get some widening for free as you accumulate (or cheap enough not to be a bottleneck), that lets execution stay in the inner loop for much longer without possible overflow. (And means the later horizontal reduction step in the outer loop is cheaper.)uadalp
has somewhat different performance from pure-verticaladd
on some CPUs, but very likely worth using. The Cortex-A57 optimization guide says it has 4 (1) cycle latency, 1/clock throughput. The (1) part is the accumulate latency for the destination operand; it allows late forwarding from a previous operation of the same type, after the horizontal-pair addition of source elements has already happened. So in a loop usingsum += pairs
withuadalp
, the loop-carried dependency chain is only 1 cycle long. This is ideal.Regular
add
on integer vectors is 3 cycle latency, 2/clock throughput, so it has better throughput but creates a 3-cycle loop-carried dependency chain. (And doesn't get a step of horizontal accumulation work done, and will overflow much sooner because it's only using 8-bit sums.)On Cortex-A57,
cnt
is also only 1/clock throughput, souadalp
's 1/clock throughput isn't an overall bottleneck. Unless they compete for the same execution port.uadalp
runs on F1, SIMD-integeradd
runs on F0/F1,cnt
runs on F0/F1. So either way, addition operations need to steal somecnt
throughput, and the question becomes whethercnt
can get scheduled efficiently to mostly just port F0, when F1 has a bunch of futureuadalp
operations queued up. (On data that isn't ready yet; out-of-order exec looks ahead. On most CPUs, operations are scheduled to ports as they rename/allocate from the front-end into the back-end. The CPU doesn't know what order data will become ready, but it can see the queue lengths different for ports then.It would be possible to do (pseudocode)
That means only one
uadalp
in the loop, in case that's a throughput bottleneck, while still only usinguadalp
into the accumulator which keeps the loop-carried dependency chain through the accumulator short. (Assuming other AArch64 CPUs also do early forwarding for accumulate instructions).And it makes the short independent dependency chain within one iteration slightly longer (from load to input of accumulation), instead of keeping it as two separate dep chains.) Low-end ARM CPUs generally have very limited out-of-order execution capabilities (small scheduler buffers) to find instruction-level parallelism and overlap work across loop iterations. Keeping the non-loop-carried dep chains short makes that easier for the CPU. And it totally sucks for an in-order AArch64 like Cortex-A53 unless you're unrolling a lot.