How can I replace a missing VPERMIL2PS instruction, using equivalent instructions in AVX2?
VPERMIL2PS ymm1, ymm2, ymm3, ymm4/m256, imz2
Permute single-precision floatingpoint values in ymm2 and ymm3 using controls from ymm4/mem, the results are stored in ymm1 with selective zero-match controls.
VPERMIL2PS (VEX.256 encoded version)
DEST[31:0] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[3:0])
DEST[63:32] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[35:32])
DEST[95:64] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[67:64])
DEST[127:96] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[99:96])
DEST[159:128] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[131:128])
DEST[191:160] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[163:160])
DEST[223:192] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[195:192])
DEST[255:224] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[227:224])
Intel C/C++ Compiler Intrinsic Equivalent
VPERMIL2PS __m128 _mm_permute2_ps (__m128 a, __m128 b, __m128i ctrl, int imm)
VPERMIL2PS __m256 _mm256_permute2_ps (__m256 a, __m256 b, __m256i ctrl, int imm)
VPERMIL2PS ymm1, ymm2, ymm3,ymm4/m256, imz2 Description - Permute single-precision floatingpoint values in ymm2 and ymm3 using controls from ymm4/mem, the results are stored in ymm1 with selective zero-match controls. imz2: Part of the is4 immediate byte providing control functions that apply to two-source permute instructions.
The closest instruction is VPERMILPS .. and this instruction still works
VPERMILPS (256-bit immediate version)
DEST[31:0] Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] Select4(SRC1[127:0], imm8[5:4]);
DEST[127:96] Select4(SRC1[127:0], imm8[7:6]);
DEST[159:128] Select4(SRC1[255:128], imm8[1:0]);
DEST[191:160] Select4(SRC1[255:128], imm8[3:2]);
DEST[223:192] Select4(SRC1[255:128], imm8[5:4]);
DEST[255:224] Select4(SRC1[255:128], imm8[7:6]);
VPERMILPS ymm1, ymm2, ymm3/m256 Description - RVM V/V AVX Permute single-precision floating-point values in ymm2 using controls from ymm3/mem and store result in ymm1.
It’s hard for me to say how it will be right, because for reliability, you need to emulate the instruction VPERMIL2PS, therefore I appeal to local specialists!
Recent Intel(R) AVX Architectural Changes January 29, 2009 Removed: VPERMIL2PS and VPERMIL2PD
All PERMIL2 instructions are gone – both the 128-bit and 256-bit flavors. Like the FMA below, they used the VEX.W bit to select which source was from memory – we’re not moving in the direction of using VEX.W for that purpose any more.
Intel compiler does not understand this VPERMIL2PS instruction.
AVX-512 instructions require the latest processors, this is not a general solution .. The visual studio assembles this instruction successfully, but the instruction cannot be executed on the processor, throwing an exception.
Disassembled code
align 20h;
Yperm_msk ymmword 000000000100000006000000070000000C0000000D0000000A0000000B000000h
vmovups ymm0, [rbp+920h+var_8C0]
vmovdqu ymm1, Yperm_msk
vpermil2ps ymm0, ymm0, [rbp+920h+var_880], ymm1, 920h+var_920
vmovups [rbp+920h+var_1A0], ymm0
Full description of the instruction
Operation
select2sp(src1, src2, sel) // This macro is used by another macro “sel_and_condzerosp“ below
{
if (sel[2:0]=0) then TMP src1[31:0]
if (sel[2:0]=1) then TMP src1[63:32]
if (sel[2:0]=2) then TMP src1[95:64]
if (sel[2:0]=3) then TMP src1[127:96]
if (sel[2:0]=4) then TMP src2[31:0]
if (sel[2:0]=5) then TMP src2[63:32]
if (sel[2:0]=6) then TMP src2[95:64]
if (sel[2:0]=7) then TMP src2[127:96]
return TMP
}
sel_and_condzerosp(src1, src2, sel) // This macro is used by VPERMIL2PS
{
TMP[31:0] select2sp(src1[127:0], src2[127:0], sel[2:0])
IF (imm8[1:0] = 2) AND (sel[3]=1) THEN TMP[31:0] 0
IF (imm8[1:0] = 3) AND (sel[3]=0) THEN TMP[31:0] 0
return TMP
}
VPERMIL2PS (VEX.256 encoded version)
DEST[31:0] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[3:0])
DEST[63:32] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[35:32])
DEST[95:64] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[67:64])
DEST[127:96] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[99:96])
DEST[159:128] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[131:128])
DEST[191:160] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[163:160])
DEST[223:192] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[195:192])
DEST[255:224] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[227:224])
The way the Bochs emulates this instruction
class bxInstruction_c;
void BX_CPP_AttrRegparmN(1) BX_CPU_C::VPERMIL2PS_VdqHdqWdqIbR(bxInstruction_c *i)
{
BxPackedYmmRegister op1 = BX_READ_YMM_REG(i->src1());
BxPackedYmmRegister op2 = BX_READ_YMM_REG(i->src2());
BxPackedYmmRegister op3 = BX_READ_YMM_REG(i->src3()), result;
unsigned len = i->getVL();
result.clear();
for (unsigned n=0; n < len; n++) {
xmm_permil2ps(&result.ymm128(n), &op1.ymm128(n), &op2.ymm128(n), &op3.ymm128(n), i->Ib() & 3);
}
BX_WRITE_YMM_REGZ_VLEN(i->dst(), result, len);
BX_NEXT_INSTR(i);
}
BX_CPP_INLINE void xmm_permil2ps(BxPackedXmmRegister *r, const BxPackedXmmRegister *op1, const BxPackedXmmRegister *op2, const BxPackedXmmRegister *op3, unsigned m2z)
{
for(unsigned n=0; n < 4; n++) {
Bit32u ctrl = op3->xmm32u(n);
if ((m2z ^ ((ctrl >> 3) & 0x1)) == 0x3)
r->xmm32u(n) = 0;
else
r->xmm32u(n) = (ctrl & 0x4) ? op1->xmm32u(ctrl & 0x3) : op2->xmm32u(ctrl & 0x3);
}
}
They're not "gone", they never existed in any real CPUs in the first place. 2009 is before the first CPU with AVX1 was released, while AVX was still in planning stages. IDK what you were looking at that even mentioned them.
Current versions of the ISA ref manual, or HTML extracts of it don't mention it. Neither does Intel's intrinsics guide. Maybe a 10-year-old version of a "future extensions" manual from before Sandybridge was released?
No you don't, it never existed in the first place so there's no code that uses it. (Or very little; possibly some written in anticipation based on early pre-release AVX documentation). You only need to implement exactly the functionality that you need for any given problem.
You tagged this (AMD) XOP but you only cited Intel documents; XOP did have some 2-input shuffles I think but I didn't go check the docs. Of course only ever for 128-bit vectors.
AVX1 does have some 2-input shuffles but none with variable control. There's
vshufps
/pd
with immediate control, andvunpckl/hps
and...pd
that do two separate in-lane versions of the corresponding 128-bit shuffle.Worst case, you can build any fixed 2-input in-lane shuffle out of 2x
vshufps
+vblendps
. Best-case is onevshufps
, or in the middle isvshufps
+vblendps
or 2xvshufps
(e.g. collect the elements you want into one vector then put them in the right order). Any of thosevshufps
shuffles can bevunpcklps
orhps
. Keep in mind that immediatevblendps
is cheap but shuffles only have 1/clock throughput on Intel (port 5 only until Ice Lake).You could even use variable-control 2x
vpermilps
and compare or shift +vblendvps
to emulatevpermil2ps
, becausevpermilps
ignores high bits in the index. So this would be the BOCHS implementation of(ctrl & 0x4) ? op2[ctrl & 0x3] : op2[ctrl & 0x3];
where you shuffle both inputs onctrl
withvpermilps
(which implicitly only looks at the low 2 bits), and you blend onctrl & 4
by shifting that bit to the top with an integer shift.(Optionally also emulate the conditional zeroing with
vandps
by usingvpslld
to put the 3rd index bit at the top for blend, andvpsrad
or a compare-against-zero result to create an AND mask forvpand
. Or on Skylake,vblendvps
is 2 uops for any port so you could just use that to blend in zeros instead of shift/and or cmp/and).But don't just naively drop this in if you care about performance for a compile-time constant shuffle control. Instead build the equivalent shuffle out of the available 2-input operations. That's why I'm not bothering to write out a full implementation in C.
AVX2 only added a few new 2-input shuffles that might be useful here: 256-bit
vpalignr
which is like 2 in-lanepalignr
instructions. It also added integervpunpckl/h b/w/d/q
but we already havevunpckl/hps
from AVX1.A true variable-control 2-input shuffle didn't appear until AVX512F
vpermt2ps
andvpermi2ps
/pd
.But it doesn't support conditional zeroing based on high bits of index elements like
pshufb
or the proposedvpermil2ps
; instead use a mask register for zero masking. e.g.Or probably better to use
vpfclassps k1, ymm0, some_constant
to getk1
set for non-negative values, avoiding the need for aknot
. On Skylake-X it's a single uop.Or use
vptestnmd
with aset1(1UL<<31)
mask to set a mask register =!signbit
of a vector.It's also not "in lane" so you'd potentially need to tweak the indices, adding 8 for indices > 4 I think.
vpermi/t2ps
indexes into the concatenation of the two vectors, so cross-lane within one source happens before selecting the other input.