Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.
So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:
// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
// ...
// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x,
gHigh32Permute)); // This seems to take 3 cycles
That shuffle+cast with
_mm256_permutevar8x32_ps
is optimal for one vector on Intel and Zen 2 or later. One one-uop instruction is the best you can get. (Two uops on AMD Zen 2 and Zen 3. One uop on Zen 4. https://uops.info/)Use
vpermps
instead ofvpermd
to avoid any risk for int / FP bypass delay if your input vector was created by apd
instruction rather than a load or something. Using the result of an FP shuffle as an input to an integer instruction is usually fine on Intel (I'm less sure about feeding the result of an FP instruction to an integer shuffle).If tuning for Intel, you can change the surrounding code so that you can shuffle into the bottom 64-bits of each 128-bit lane. It avoids a lane-crossing shuffle. (Then you can just use
vshufps ymm
, or if tuning for KNL,vpermilps
since 2-inputvshufps
is slower.)With AVX512, there's
_mm256_cvtepi64_epi32
(vpmovqd
) which packs elements across lanes, with truncation.Lane-crossing shuffles are slow on Zen 1. Agner Fog doesn't have numbers for
vpermd
, but listsvpermps
(which probably uses the same hardware internally) at three uops, five cycles of latency, one per four cycles of throughput. https://uops.info/ confirms those numbers for Zen 1.Zen 2 and Zen 3 have 256-bit wide vector execution units for the most part, but sometimes their lane-crossing shuffles with elements smaller than 128-bit take multiple uops. Zen 4 improves things, like 0.5 cycles throughput
vpermps
with four cycles of latency.vextractf128 xmm, ymm, 1
is very efficient on Zen 1 (1c latency, 0.33c throughput), which is not surprising since it tracks 256-bit registers as two 128-bit halves.shufps
is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.This also saves you a register for the
vpermps
shuffle mask you don't need anymore. (Onevpermps
to get the elements you want grouped into the high and low lanes forvextractf128
. Or if latency is important, two control vectors for 2xvpermps
on CPUs where it's single-uop) So for CPUs with multi-uopvpermps
, especially Zen 1, I'd suggest:On Intel, using three shuffles instead of two reaches two thirds of the optimal throughput, with one cycle extra latency for the first result.
On Zen 2 and Zen 3 where
vpermps
is two uops vs. one forvextractf128
, extract + 2xvshufps
is better than 2xvpermps
.Also the E-cores on Alder Lake have two-uop
vpermps
but one-uopvextractf128
andvshufps xmm