I've run the same binaries compiled with gcc-13 (https://godbolt.org/z/qq5WrE8qx) on Intel i3-N305 3.8GHz and AMD Ryzen 7 3800X 3.9GHz PCs. This code uses VCL library (https://github.com/vectorclass/version2):
int loop_vc_nested(const array<uint8_t, H*W> &img, const array<Vec32uc, 8> &idx) {
int sum = 0;
Vec32uc vMax, iMax, vCurr, iCurr;
for (int i=0; i<H*W; i+=W) {
iMax.load(&idx[0]);
vMax.load(&img[i]);
for (int j=1; j<8; j++) {
iCurr.load(&idx[j]);
vCurr.load(&img[i+j*32]);
iMax = select(vCurr > vMax, iCurr, iMax);
vMax = max(vMax, vCurr);
}
Vec32uc vMaxAll{horizontal_max(vMax)};
sum += iMax[horizontal_find_first(vMax == vMaxAll)];
}
return sum;
}
Full benchmark source is here: https://github.com/pauljurczak/simd-benchmarks/blob/main/main-5-vcl-eve.cpp. Here is the timing:
Ubuntu 22.04.3 LTS on AMD Ryzen 7 3800X 8-Core Processor
gcc v13.1 __cplusplus=202100
loop_vc_nested(): 3.597 3.777 [us] 108834
Ubuntu 23.10 on Intel(R) Core(TM) i3-N305
gcc v13.1 __cplusplus=202100
loop_vc_nested(): 11.804 11.922 [us] 108834
There is an unexpected slowdown of 3.2x. AFAIK, these CPUs have similar SIMD capabilities for a single thread program. Performance on 7-zip benchmark is very close. Why such a big gap?
Here is an output from perf. AMD Ryzen 7 3800X:
3,841.61 msec task-clock # 1.000 CPUs utilized
20 context-switches # 5.206 /sec
0 cpu-migrations # 0.000 /sec
2,191 page-faults # 570.333 /sec
14,909,837,582 cycles # 3.881 GHz (83.34%)
3,509,824 stalled-cycles-frontend # 0.02% frontend cycles idle (83.34%)
9,865,497,290 stalled-cycles-backend # 66.17% backend cycles idle (83.34%)
42,856,816,868 instructions # 2.87 insn per cycle
# 0.23 stalled cycles per insn (83.34%)
1,718,672,677 branches # 447.383 M/sec (83.34%)
2,409,251 branch-misses # 0.14% of all branches (83.29%)
Intel i3-N305:
12,015.18 msec task-clock # 1.000 CPUs utilized
57 context-switches # 4.744 /sec
0 cpu-migrations # 0.000 /sec
2,196 page-faults # 182.769 /sec
45,432,594,158 cycles # 3.781 GHz (74.97%)
42,847,054,707 instructions # 0.94 insn per cycle (87.48%)
1,714,003,765 branches # 142.653 M/sec (87.48%)
4,254,872 branch-misses # 0.25% of all branches (87.51%)
TopdownL1 # 0.2 % tma_bad_speculation
# 45.5 % tma_retiring (87.52%)
# 53.8 % tma_backend_bound
# 53.8 % tma_backend_bound_aux
# 0.5 % tma_frontend_bound (87.52%)
Compiler options: -O3 -Wno-narrowing -ffast-math -fno-trapping-math -fno-math-errno -ffinite-math-only -march=alderlake
Additional cache use information from perf stat -d on i3-N305:
15,615,324,576 L1-dcache-loads # 1.294 G/sec (54.50%)
<not supported> L1-dcache-load-misses
60,909 LLC-loads # 5.048 K/sec (54.50%)
5,231 LLC-load-misses # 8.59% of all L1-icache accesses (54.50%)
I installed the newest Intel C++ compiler, in order to get -march=gracemont working. Performance did not improve, since Intel compiler is based on clang, which performed worse than gcc in this benchmark. Here are the timings:
Ubuntu 23.10 on Intel(R) Core(TM) i3-N305
clang v17.0.0 (icx 2024.0.2.20231213) C++
loop_vc_nested(): 12.311 12.397 [us] 108834 # -march=native
loop_vc_nested(): 12.773 12.847 [us] 108834 # -march=alderlake
loop_vc_nested(): 12.418 12.519 [us] 108834 # -march=gracemont
loop_vc_unrolled(): 10.388 12.406 [us] 108834 # -march=gracemont
loop_vc_nested_noselect_2chains(): 6.686 10.454 [us] 109599 # -march=gracemont
The AVX encoding of
vpblendvbhas 4 operands (3 sources and a separate destination), and is multi-uop even on Intel P-cores (unlike the legacy-SSE 128-bit encoding), but is single-uop on Zen. A different algorithm can avoid it.Alder Lake E-cores (Gracemont) are 5-wide out-of-order with reasonable out-of-order exec capability, but they're not great at 256-bit SIMD in general, and choke badly on 8-uop
vpblendvb ymmin particular, including a front-end bottleneck it looks like. But your inner loop uses it every 4th instruction in a dependency chain (short enough for OoO exec to maybe partly hide, so we might just be getting the effects of the back-end-throughput or front-end bottleneck).Your implementation strategy / algorithm is something Zen 2 is great at but which is a stumbling block for Gracemont, amplifying the difference between 256-bit vs. 128-bit SIMD execution units.
Your i3-N305 is Alder Lake-N series. Like earlier Celeron / Pentium CPUs with N in their model number, the cores are low-power Silvermont-family. In this case Gracemont, the E-cores found in full Alder Lake chips. (Which are significantly beefier than Tremont or especially earlier generations like Goldmont Plus.) And it has AVX2+FMA which I guess is what justifies selling it as an i3.
https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/ is a good deep-dive on the CPU microarchitecture, with some comparisons to Zen 2, and microbenchmarks of cache bandwidth and latency (as part of an i9-12900k, IDK if the interconnect or L3 would be different in an i3-N series, but your benchmark fits in its 2M L2 cache; with a single core active, read bandwidth from L2 is about the same as L1d for sequential access.) No mention about how the decoders handle instructions that are more than 3 uops, but it does have a diagram showing the pair of 3-wide decode clusters. (If it's like previous Intel, any instruction more than 1 uop can only decode in the first decoder of a cluster, so that probably limits front-end throughput to two YMM vector instructions per clock even if they're the minimum 2 uops.)
Your Ryzen 3800X is a Zen 2, a full-fledged big core with good 256-bit SIMD load and ALU throughput (up from 128-bit in Zen 1, Ryzen 1xxx and 2xxx series). And single-uop
vpblendvb.The most important factors are:
Vector ALU and memory ports are 128-bit wide, and every 256-bit instruction decodes to (at least) 2 uops, except a few like
vextracti128andvpmovmskb. (So it's like Zen 1 and Bulldozer-family). So uops per clock is about twice the IPC, when running code that's mostly vector instructions with a bit of scalar overhead. 2/clock load bandwidth only goes half as far when each load is only 128-bit.That
selectcompiles to avpblendvb. Unfortunately that's very slow on Gracemont, see https://uops.info/ - VEX encodings of variable blends are 4 uops per 128-bit lane, so the YMM version is 8 uops with a measured throughput of one per 3.86 cycles. (Or 3.2 cycles for a memory source instead of register, surprisingly.) Zen family runs the 4-operandvpblendvbas a single uop (with a choice of ports even).The legacy-SSE encoding only has 3 operands, one of them implicitly XMM0, and Gracemont runs that as a single uop. Even Alder Lake P-cores run
vpblendvb x/ymmas 3 uops, up from 2 in Ice Lake, while SSE4.1pblendvb xmm, xmmis single uop on modern Intel P-cores, too.Gracemont
vpblendvb ymmalso has 6 to 7 cycle latency, or 5c for the XMM version (vs. 2 to 3 on P cores), depending on data vs. control inputs being the critical path, vs. 1 cycle on Zen. Even worse than its throughput even with the front-end bottleneck. Out-of-order exec buffers (scheduler and ROB) are probably big enough to hide this over a chain of 7 of them, since you start a new dep chain every 256 bytes, but it's not great and would be a bottleneck in a loop that ran more iterations.It seems Intel goofed when designing the AVX1 encoding of it (with a 4th register number in an immediate byte!) while Sandybridge-family was still being designed, not anticipating that their later CPUs would be able to handle 3-operand instructions as a single uop. (Motivated by FMA in Haswell, but benefiting others in Broadwell and later.) And that mov-elimination would remove the back-end execution port cost of copying a register if needed (unlike here) if the original value is needed after an instruction that modifies a R+W destination in-place. FMA3 and later 3-input instructions like AVX-512
vpternlogdandvpermi/t2dhave an R+W source/destination as the first operand. (kmask inputs to AVX-512 instructions are a separate forwarding network and a separate domain to track dependencies in, so they don't count.)8 uops inherently contributes to low IPC for the same uops/clock throughput, but probably also stalls the front-end some, reducing uops/clock. Even Gracemont's 4-uop
vpblendvb xmmhas about the same bad throughput if running just that back-to-back, which is consistent with some kind of decode stall or having to switch to a microcode ROM on >3 uop instructions.You could try to blend manually with
_mm256_and_si256/andnot/or, which would be 6 uops but avoid front-end stalls for a total throughput cost of 1.33 cycles on the vector ALU ports. But clang will "optimize" those intrinsics to avpblendvbsince it knows the blend-control is a compare result, with all bits matching the sign bit.Clang trunk's
-mtune=gracemontor-march=gracemontdoesn't know it's slow on that uarch, at least not splittingselectinto those. MSVC, or classic ICC, are a lot more literal about intrinsics. GCC does optimize some, but in this case it does use actualvpand/vpandn/vporinstructions (https://godbolt.org/z/3fc1jo9r4), so you could make a version that's worse on Ryzen, less bad on Gracemont, but not optimal anywhere. I think it's still worse on Gracemont than thenoselectversion below.Your original is fairly good for Ryzen, but there's room for improvement in the cleanup, and in maybe scanning backwards to avoid inverting the compare to feed the blend. Or the branchy strategy might be best if an instance of the max element is often seen within the first 64 bytes so it's predictable. Just load + 7x
vpmaxub ymm, mem, then reduce and scan.Avoiding variable-blend
Your actual problem could be done other ways, for example unpacking your data with indices as chtz suggested in Looking for an efficient function to find an index of max element in SIMD vector using a library , so the max
u16element contains the data and the index. (And instead of loading, the index can come fromidx = _mm256_add_epi8(idx, _mm256_set1_epi8(32));. Of maybe that inner loop over 256 bytes can get fully unrolled so you have 8 registers holding index data.)Since you'd probably want to use that improved reduction anyway, unpacking even earlier saves some cleanup work, and your loop is only 8 vectors.
For a sum of indices, I guess it's important that you get the first occurrence of a match? So you'd want to invert your indices so the max of data:index packed as a u16 picks the earlier index when it's a tie-break for equal data. That's what we want anyway for a cleanup that's going to use
vphminposuw.This is what it might look like, without being clever about indices so it might be taking the last one.
Instead of loading indices, you could maybe just compute them with
_mm256_sub_epi8(idx, _mm256_set1_epi8(-1))(oraddto go in descending order down from 255), although compilers will probably constant-propagate through that and make 8 vectors of constants, and the RIP-relative addressing mode to load that is larger code-size than[rsi+disp8]for the first 5 loads, but that's just the startup code. After the compiler's done unrolling, you definitely want it to have 8 vectors of indices that it generates once ahead of the loop.Godbolt. GCC
-O3 -march=alderlakefully unrolls, loading all 8indexvectors before the outer loop and using them from registers. (Same in the original version.)The inner loop looks like this; notice that it uses the same memory source operand twice to save front-bandwidth at the cost of more back-end uops. This is actually ok on Gracemont as well as Alder Lake;
vpunpckl/hbwis 2 front-end uops with or without a memory source operand. With 1.0 vs. 0.66 cycle throughput, but with separate loads I think the front-end would be a worse bottleneck depending how fast it can decode 2-uop instructions. And thevpmaxuwper unpack is extra vector ALU work to keep ports busy so it doesn't bottleneck on loads.Clang
-mtune=gracemontchooses differently, but it doesn't load twice even tuning for Alder Lake / Ice Lake.https://uica.uops.info/ predicts Ice Lake could run it at 14 cycles per iteration, vs. 17 for the
vpblendvbversion. And that's nearly bottlenecked on vector ALU ports, so Alder Lake would be even worse with thevpblendvbversion.I haven't analyzed by hand for Gracemont, or tried LLVM-MCA which might have a Gracemont model.
I also haven't looked at optimizing it to use
vphminposuwas part of the cleanup, which would save even more, helping pay for the extra shuffle work we're doing per vector.Or consider a branchy strategy, like finding the max and then searching the array for for the first match. (compare/movemask aka
to_bits(curr == bcast_max), and if non-zero, returntzcnt(mask)). You never need to load vectors of index data, and an early match reduces the amount of work. (But it can mispredict which might be much worse; still worth a try. But usefully microbenchmarking things that depend on correct branch prediction is hard - a microbenchmark can learn a pattern. Or if you make it totally random, it predicts worse than real data distributions.)With only 8 vectors of data, that second pass loop can be fully unrolled with no loads. The first pass can leave the data in registers. (But it would have to be fully unrolled, too, perhaps checking a pair of ymm regs at a time for a match, with shift/or and a 64-bit tzcnt.
vpmovmskb r32, ymmis single-uop on Gracemont.) And it would mean separate load + max instructions in the first pass, not memory-source. Gracemont doesn't have a uop-cache but apparently its decoders manage ok for throughput. Perhaps not wonderfully with back-to-back 2-uop instructions.(This is basically the same strategy your current cleanup is using, find the max then search for its position, but across the whole 8-vector array. Allowing reduction to 128-bit for most of the horizontal max work between the first and second pass is nice.)
Commented version of your original, looking at how it compiled to asm:
which compiles to code that loads the first 4 vectors early, the some processing, then loading more as it goes. ymm1 = set1(-1), XOR with it does a NOT of the compare result.
As mentioned in the comments I added, saving an instruction around the blend (to get the opposite condition) could be done with
curr == max(vmax, curr), but that's true on a tie when your condition isn't. Looping backward could fix that, but might be harder for the prefetchers.(In asm at least, you could load all 8 vectors in forward order, or one from each cache line, but process them backwards. That makes out-of-order exec work even harder to hide load latency, assuming prefetch keeps streaming in order.)