An avx512 vector can hold 64 int8 values. I would like to do something like the following:
- load 16 contiguous values from memory location a, say they are 1
- load 16 contiguous values from memory location b, say they are 2
- load 16 contiguous values from memory location c, say they are 3
- load 16 contiguous values from memory location d, say they are 4
- produce an avx512 vector with the following pattern: 123412341234...1234.
Note: the 16 values from the memory load are not expected to be the same, as in the example shown above.
I know how to functionally do this through loads then shuffles. However, I would like to know what's the most effective way to do this in terms of # of registered used and expected throughput.
Perhaps there's some weird instructions optimized for this purpose.
Thanks!
Since you mention throughput as a major concern, minimizing back-end uops for the shuffle port would be a good idea, and/or minimizing total front-end uops. (see this re: perf analysis). The overall bottleneck will depend on surrounding code.
I think the best strategy is to get all the data into the right 128-bit chunks (lanes) of one vector efficiently, then fix that up with a
vpshufb
(_mm512_shuffle_epi8
).Normal 128-bit lane-insert loads (
vinserti128 ymm, ymm, mem, imm
) need 2 uops per instruction: load and merge, but the ALU part can run on any port on Skylake-X, p015, not just the shuffle unit on port 5. (Or with the vector ALU unit on port 1 shut down because of 512-bit uops in flight, just p05). https://uops.info/ and https://agner.org/optimize/.Unfortunately,
vinserti128
does not micro-fuse, so both uops have to go through the front-end separately1.However,
vbroadcasti32x4 ymm{K}, [mem]
does micro-fuse (RETIRE_SLOTS: 1.0) so we can do a 1-fused-domain-uop insert via a merge-masked broadcast-load. The merge-masking does require an ALU uop, apparently able to run on p015*. (It's pretty dumb that memory-sourcevinserti128
can't just decode this way to 1 uop, but this does require a mask register prepared ahead of time.)(*: The uops.info detailed results strangely show none of the uops actually running on port 0, but a ZMM version does. If testing shows that the ymm version (with 512-bit uops in flight) actually only runs on p5, then I guess do a broadcast-load into ZMM registers with a
0x00f0
merge-mask.)I'd suggest something like this, if you can hoist loads of 2 shuffle-control vectors and setup of the mask register.
[a]
and[c]
can be any addressing mode, but an indexed addressing mode like[rdi + rcx]
may defeat micro-fusion of the broadcast, and make it un-laminate. (Or maybe not if it counts as a 2-operand instruction likeadd eax, [rdi + rcx]
and thus can stay micro-fused in the back-end on Haswell/Skylake.)If you want to avoid vzeroupper after the loop, you could use xmm/ymm/zmm16 and 17 or something, in which case you'd want
vmovdqu32 xmm20, [a]
, which takes more code-size than a VEX-encodedvmovdqu
.Shuffle constants:
If we were shuffling one ZMM with vpermd then vpshufb (after 3x insert, see below), I think it would be the same constant expanded 2 different ways (widen bytes to dwords, or repeat 4 times), doing the same shuffle to 16 dword in a ZMM and then to 16 bytes in each lane. So you'd save space in .rodata.
(You can load in any order: if have reason to expect that 2 of the sources will be ready first (store forwarding, or cache hit more likely, or load address ready first), you could use them as the source for the
vmovdqu
loads. Or pair them so the merge uop can execute and make room in the RS aka scheduler sooner. I paired them this way to make the shuffle control constants more human-friendly.)If this isn't in a loop (so you can't hoist the constant setup) it's not worth spending 2 uops to set up
k1
, just usevinserti128 ymm0, ymm0, [b], 1
and same forymm1, [d]
. (2 uops each, not micro-fused, p23 + p015). Also, thevpshufb
control vector can be a 64-byte memory source operand. A different strategy usingvpuncklbw
/hbw
and inserts (@EOF's comment) might be worth considering if you want to avoid loading any constants, but that would be more shuffles. Or possiblyvpmovzxbd
loads + shift/merge?Perf analysis
total front-end cost: 6 uops. (1.5 clock cycles on SKX). Down from 8 uops / 2 cycles with
vinserti128
total back-end cost: minimum of 2 cycles per result
imul
,lea
, and simple-integer stuff.)(Any cache misses will result in the merge uops having to replay when the data does arrive.)
Running just this back-to-back will bottleneck on back-end throughput for ports 2/3 (loads) and 0, 5 (vector ALU). There's some room to squeeze some more uops through the front-end, like storing this somewhere and/or some loop overhead that runs on other ports. Or for less-than-perfect front-end throughput. Vector ALU work will contribute to the p0 / p5 bottleneck.
With intrinsics, clang's shuffle optimizer might turn the masked broadcast into
vinserti128
, but hopefully not. And GCC probably wouldn't spot that deoptimization. You didn't say what language you were using, and mentioned registers, so I'll just use asm in the answers. Easy enough to translate to C intrinsics, maybe C# SIMD stuff, or whatever other language you're actually using. (Hand-written asm is usually not necessary or worth it in production code, especially if you want portability to other compilers.)It would also be possible to do one
vmovdqu
,vinserti128 ymm
, and 2xvinserti32x4 zmm
. (Or equivalent 1-uop merge-masking broadcast loads). But that would have worse ILP for merging, and we'd still need avpermd
+vpshufb
becausevpermb
requires AVXM512VBMI (Ice Lake, not Skylake-X).However, if you do also have AVX512VBMI,
vpermb
is only 1 uop on Ice Lake, so 3x insert +vpermb
would be ideal for throughput. Doing the inserts with merge-broadcats would need 2 separate merge masks,0xf0
(use with ymm 32x4 and zmm 64x2) and0xf000
(use with zmm 32x4, loading[d]
last), or some variation on that.Using
vpermt2b
with the parallel-insert setup would be worse: Ice Lakevpermt2b
costs 3 uops (p05 + 2p5).The two shuffle constants can be compressed in memory to 16 bytes each: load the
vpermt2d
vector withvpmovzxbd
to expand bytes to dwords, load thevpshufb
control withVBROADCASTI64X2 zmm1, m128
to repeat the in-lane shuffle vector 4 times. It's probably worth fitting both constants into the same cache line, even though that costs a load+shuffle outside the loop.If you implement this with C intrinsics, just use
_mm512_set_epi8/32
; compilers will usually defeat your attempt to be clever by doing constant-propagation. Clang and gcc are sometimes smart enough to compress constants for you, but usually only broadcast-loading, not vpmovzx.Footnote 1: Agner Fog's instruction tables indicate that
VINSERTI32x4 z,z,m,i
can micro-fuse (1 front-end uop), but uops.info's mechanical testing results disagree: RETIRE_SLOTS: 2.0 matches UOPS_EXECUTED.THREAD: 2.0. Probably a typo in Agner's table; it's normal that memory-source instructions with an immediate don't micro-fuse.(Also possible that it micro-fuses in the decoders and uop cache but not in the back-end; Agner's testing for micro-fusion is I think based on the uop cache, not the issue/rename bottleneck or perf counters. RETIRE_SLOTS counts fused-domain uops in the out-of-order back-end, after possible un-lamination before/during issue/rename.)
But anyway, VINSERTI32x4 definitely doesn't help for the issue/rename bottleneck which is more often significant in tight loops. And I doubt that it actually micro-fuses even in the decoders/uop-cache. Agner's tables unfortunately do have typos.
Alternate strategy:
vpermt2d
from memory (no advantages)Before I came up with using a broadcast-load as a 1-uop insert, this had fewer front-end uops at the cost of more shuffles, and of doing wider loads from memory for 2 of the 4 sources. I don't think this has any advantages.
vpermt2d ymm, ymm, [mem]
can micro-fuse into 1 load+shuffle uop for the front-end, on Skylake. (uops.info result: note RETIRE_SLOTS: 1.0 vs. UOPS_EXECUTED.THREAD: 2.0)That would require doing 256-bit loads from 2 of the four 128-bit memory operands. That would be slower if it crosses a cache-line boundary when a 128-bit load wouldn't have. (And could fault if crossing into an unmapped page). It would also require more shuffle control vectors. But could save front-end uops vs.
vinserti128
, but not vs. merge-maskedvbroadcasti32x4
It might be possible to use the same shuffle control for combining pairs and for the final ZMM vpermt2d or q. Maybe with
vpermt2q
for combining pairs andvpermt2d
last? I haven't really thought this through, whether you could choose a ZMM shuffle vector such that the low YMM can works for combining a pair of vectors with a different element size. Probably not.Unfortunately
vpblendd ymm, ymm, [mem], imm8
doesn't micro-fuse.If you happen to know how any of
[a..d]
are aligned relative to a cache-line boundary, you could avoid cache-line splits when doing a 256-bit load that includes the data you want as the low or high 128 bits, choosing yourvpermt2d
shuffle control appropriately.Alternate strategy that mixes up the order of data, unless you have AVX512VBMI
Would work with AVX512VBMI
vpermb
(Ice Lake) instead of AVX512BWvpshufb
5 fused-domain uops, 1 vector const, 3 masks
Avoid the vpermt2d by using different masked-broadcasts to distribute the 4 dwords of each 16-byte source chunk into separate lanes, such that every byte ends up somewhere, and each 16-byte lane of the result has data from all 4 vectors. (With
vpermb
, distributing across lanes is unnecessary; as described above you can just do whole-lane masking with masks like0xf0
)Every lane has 4 bytes of data from each of a,b,c, and d, with no duplication because every mask has a different set-bit in each nibble.
With a 64-byte shuffle mask, you could do a shuffle in each lane that produces DCBA... in each lane, but with data from non-corresponding source positions.
This is probably not useful (without
vpermb
), but I started writing up this idea before realizing it was impossible with masked-broadcasts to get the first 4 bytes of[a]
into the same lane as the first 4 bytes of[b]
, and so on.The mask setup could actually be optimized to smaller code and fewer front-end uops, but higher latency before k2 and k3 are actually ready for use. Using a k reg as a mask for a SIMD instruction that needs 16 mask bits ignores higher bits in the mask reg, so we can get the mask data into one and right shift it a couple times to produce masks with what we want in the low 16.
But again, if you have
vpermb
then you only need 2 masks,0xf0
and0xf000
, using the0xf0
mask withvbroadcasti32x4 ymm{k1}, [b]
andvbroadcasti64x2 zmm{k1}, [c]
.