There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass):
int f(const Vec32uc &a) {
Vec32uc b{horizontal_max(a)};
return horizontal_find_first(a == b);
}
and compiled on https://godbolt.org/z/Gfjo7zo1c with -O3 -march=alderlake. The result looks a bit long:
f(Vec32uc const&):
vmovdqa ymm1, YMMWORD PTR [rdi]
vextracti128 xmm0, ymm1, 0x1
vpmaxub xmm2, xmm1, xmm0
vpunpckhqdq xmm0, xmm2, xmm2
vpmaxub xmm0, xmm0, xmm2
vpsrldq xmm2, xmm0, 4
vpmaxub xmm0, xmm0, xmm2
vpsrldq xmm2, xmm0, 2
vpmaxub xmm0, xmm0, xmm2
vpsrldq xmm2, xmm0, 1
vpmaxub xmm0, xmm0, xmm2
vpbroadcastb ymm0, xmm0
vpcmpeqb ymm1, ymm1, ymm0
vpmovmskb eax, ymm1
test eax, eax
je .L3
bsfl eax, eax
.L1:
vzeroupper
ret
.L3:
mov eax, -1
jmp .L1
compared to intrinsic solutions, e.g. Find index of maximum element in x86 SIMD vector, but I may be wrong. Is there a more efficient implementation using a SIMD library and modern X86-64 instruction set.
SSE4.1
phminposuwis a horizontal min of 8x 16-bit elements. 8 elements otherwise takes 3 shuffle/min steps, 6 instructions, so if we can use fewer uops than that to massage our data into an input for_mm_minpos_epu16, we come out ahead. It's single-uop with 3 to 4 cycle latency across Zen / Skylake / Gracemont, so it's also a latency win vs. 6 shuffle/min ops. (https://uops.info/)It's not available for wider vectors even with AVX-512, so we do need two reduction steps to get our 32 elements down to 8, but we can get them zero-extended to 16-bit for free by being clever about it. (It does also find the position of the min, but only on our already-reduced input which isn't useful here.)
Bitwise NOT turns UINT8_MIN (
0) into UINT8_MAX (255) and vice versa. It's the same thing as255-xwhich more obviously does what we want, a 1:1 transform that reverses an unsigned<=compare, without wrapping for anyuint8_tvalue. (Thanks to @harold and @Homer512 in comments for fixing my initial idea.)@chtz had a good idea: unpack the data with indices, so the index is part of the max or min u16 element we find, skipping the broadcast/compare/movemask/bit-scan. This is orthogonal to using phminpos, and works for any ISA and element-size as long as we have a vertical min or max for elements that are twice as wide. The first reduction step changes to doing something to both inputs, widening elements, instead of just shuffling the high half down to the bottom. The biggest benefit is critical path latency, but this does also help throughput slightly.
In case of a tie, when the original 8-bit value is equal, the low half, the index, will be the tie-break. If you want the first occurrence of the max element, min with inverted elements and normal indices works: the smaller index will be the min. Or max with normal elements and inverted indices, which you flip after some reduction.
Godbolt with a test loop (uiCA) to see how they inline and hoist constants, i.e. to get a loop I can copy/paste into https://uica.uops.info/
I was going to write the indices as normal numbers, and use
unpack(~indices, v). But MSVC didn't constant-propagate through the bitwise not, doing it at run-time!Clang is too clever for it's own good: it sees that some of the
indicesinput elements aren't used by_mm256_unpacklo_epi8so it zeros them out. Same for the unpack hi. But that means it needs to load 2 different vector constants instead of having unpack read parts of the same one! (TODO: report on https://github.com/llvm/llvm-project/issues/)There are multiple strategies for unpacking:
chtz's suggestions of
vpunpckl/hbw, as above. The in-lane behaviour is funky for 256-bit shuffles, but they do pair corresponding elements of the two input vectors. This way needs fewer vector constants, just one non-trivial one except when clang pessimizes it, and fewer vector ALU uops, but more of them are shuffles, so on Skylake and earlier, port 5 could be a bottleneck depending on the surrounding code. Not a problem as cleanup for you loop in Why performance of this function is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X? (where chtz's strategy could be expanded to your whole loop so the cleanup is just amaxreduction of u16 elements, including the inversion for phminposuw.)Ice Lake and later, and Gracemont, can run
vpunpckinteger shuffles on more than one port so back-end port throughtput bottlenecks aren't a problem.Odd/even by shifting the even elements left to the top of a u16, and mask to isolate the odds. (Then XOR to both flip the data and apply the index into the other half.) Fewer of the uops are shifts, but 1 more total uop than unpack. And more of the instructions are 256-bit YMM, worse on Zen 1 or your Gracemont.
With AVX-512 (or AVX10 256-bit),
vpternlogdis useful, but not much else.Compared to not using chtz's trick at all, using bcast/cmp/movemask/scan: no vector constants needed at all. Perhaps good as a one-off, like cleanup for an infrequently-run loop. It's 10 uops to get a scalar integer result (starting from data in a YMM vector) vs. 9 for vpunpckl/hbw on an Intel P-core. (7 uops for that to get the index in the bottom byte of an XMM register, or 8 total (including a
vpandorvpmovzxbq) to get it zero-extended into a qword at the bottom, where you couldvpadddorvpaddqto accumulate it into an XMM total which you retrieve after your outer loop.)shift/XOR strategy, fewer shuffles but more uops and constants, possible use-cases on Skylake
Earlier version, with bcast / cmp / movemask / tzcnt cleanup
AVX2 basically implies BMI1/2, so requiring it for
tzcnttoo won't exclude any AVX2 CPUs. Compile with GCC/Clang -march=x86-64-v3 (https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) or MSVC -arch:AVX2 which actually implies BMI2 as well. Or usestd::countr_zerofrom C++20#include <bit>, but make sure to enable BMI so compilers can usetzcnt.I changed the arg type to take a vector by value (in a YMM register). This means the non-inline version needs to shuffle instead of loading the 128-bit halves separately to reduce to 128. This is generally a good convention since callers might pass vectors that aren't already in memory, although you want small functions like this to inline.
Godbolt
According to https://uica.uops.info/, it's 10 uops for the front-end on Skylake (not counting the
vpcmpeqdones idiom that can be hoisted out of loops, and not counting theret.) vs. the VCL version (usingtzcntinstead of branch around bsf) being 14 uops.It has 15 cycle latency from input to the final
ymm0result. (Plus another maybe 4 or 6 cycles for movemask + tzcnt, which is the same for both versions if we improve the VCL cleanup, or even withbsfon Intel.)The VCL version is 16 cycle latency from ymm0 input to
vpcmpeqbresult, so my version is better by 1 cycle for latency. (Probably by 2 cycles on Zen-family, wherephminposuwis only 3c latency instead of 4 on Intel.)My version is 10 uops (not counting
vpcmpeqd, assuming it was hoisted out of a loop) vs. 14 for the VCL version (assuming it's changed to usetzcnt).6 of the uops for the VCL version can only run on port 5 on Skylake, so a loop doing only this would bottleneck on that for throughput. On Ice Lake they're fairly evenly distributed since the shuffles can run on port 1 or 5.
Machine-code size for my version is smaller by 13 bytes (counting the
vpcmpeqdset1(-1)since it's part of static code size); smaller is generally good for I-cache density. (The VCL shuffles could have avoided immediate operands for some, saving a couple bytes.)This worked out better than I thought it might. I was worried we were going to have to invert again after
phminposuwto recover the actual max element, but matching against the transformedvinv32avoids that. (If you do want the value as well,uint8_t min = ~_mm_cvtsi128_si32(minpos_result);-movd+not, and maybe amovzxif you need it as 32-bit.) Andphminposuwleaves the value at the bottom of the register, position above that, so a broadcast to 256 worked without an extra shift or shuffle.Also, at first I though we might need an AND or something to zero-extend to 16-bit, or that I'd have to shift left so the value we wanted was in most-significant half of the u16 elements (so low garbage wouldn't matter). But again, then we'd have needed to broadcast a byte that wasn't the lowest, which would have required a shift + vpbroadcast, or AVX-512
vpermb.VCL uses
bsf!?!?The most obvious thing to improve on is VCL's insane use of inline
asmfor a legacy BSF instruction (slow on AMD), instead of using GNU C__builtin_ctzon non-MSVC. That can compile totzcnton CPUs that have it (all CPUs with AVX2 have BMI1/2). Even on CPUs that don't have it, since unlikelzcnt, thetzcntresult is compatible withbsffor all cases where__builtin_ctzhas a well-defined result (when the input is non-zero), andtzcnt=rep bsfwhich runs asbsfon older CPUs. (There's some possible justification for using inline asm forbsrinbit_scan_reverse, but probably still better to use63-or31 - __builtin_clzthere.)And we know there will be a match, so
horizontal_find_first'sif (bits == 0) return -1;is totally useless even if we were usingbsfinstead oftzcnt. (tzcntproduces32or64in that case, the operand-size.)Avoiding the broadcast: stay 256-bit the whole time?
This isn't better for u8 or u16 elements. (Also, chtz's idea of unpacking with indices made it a moot point.)
If you weren't using
vphminposuw, you might avoid thevpbroadcastbby doing all your shuffles at 256-bit, so you finish with the max already broadcast. e.g.vpermi128instead ofvextracti128. But I'm not sure how you swap bytes within words or dwords without needing a control vector forvpshufb. Perhapsvpalignr ymm, same,same, imm8with 8 / 4 / 2 / 1? It's port 5 only even on Ice Lake, and higher latency (2 cycles) on Zen 4 vs.vpshufd. And the first two shuffles (by 8 and 4) can be done withvpshufdwhich can run on more port 1 or 5 on Ice Lake and later, for better balance of uops for execution ports, which is generally good unless the surrounding code under-uses port 5.But this strategy is worse for throughput on Zen 1 and Alder Lake E-cores (Gracemont), since all the 256-bit ops will be 2 uops. They're wide enough that it's ok for latency, you don't get resource conflicts.