I was trying to work on a AVX512 code. While working on the same, was trying to look for a function similar to _mm256_sign_epi8 in AVX512 but wasn't able to find an equivalent. It would be really useful if we find a similar instruction. Is there an equivalent instruction or any other alternate way to do this for AVX512 with similar/lesser CPI/latency ? Thanks
AVX2 function example
z = _mm256_sign_epi8(x,y)
Based on sign of elements of y, sign of elements of x is also updated
That's correct, there's no AVX-512 version of any of the vpsignb/w/d instructions (https://felixcloutier.com/x86/psignb:psignw:psignd). If you're working with 256-bit vectors using AVX-512 (which is often pretty efficient), you can of course just use
_mm256_sign_epi8; the compiler will arrange for the inputs and output to be in ymm0-15 for the VEX-coded version, not ymm16-31.For 512-bit vectors, you probably need 2 compares into masks and two masked operations to apply both the conditional-negation and the conditional-zeroing. I don't think the same functionality is available from a single instruction with a different name, so you can't get the same 1 uop with 1 cycle latency! IDK why they dropped it.
Unless you only need a simplified version that doesn't do the
b==0part, justb<0 ? 0-a : a. As Daniel Lemire points out, that can be done in two instructions. (https://lemire.me/blog/2024/01/11/implementing-the-missing-sign-instruction-in-avx-512/).But there's some room for improvement in Daniel's full version, still using basically the same strategy of 2 compares and 2 masked ops.
We can use one merge-masked and one zero-masked operation, rather than two merge-masking ops. We do still need the zero constant in a register to subtract from, but at least we can avoid another asm instruction to copy or regenerate it after merge-masking into it would destroy that register value. GCC actually optimizes to
vpblendmbso the zeroed vector is still around to subtract from, but clang does an insanevpmovm2b zmm1, k0to make a mask forvpandq. Clang also compared against zeros in static storage withvpcmpltb ..., [rip + .LCPI1_0]even though this function does need to zero a register. So Daniel Lemire's version compiles unexpectedly poorly with Clang, as well as being slightly sub-optimal if compiled literally as one would expect, or even with GCC's optimization of it. (Godbolt for that vs. my version.)The test-for-zero can be
vptestmb, saving code-size vs.vcmpb z,z, imm8which needs an immediate (to select the comparison predicate since AVX-512 integer compares aren't limited to justeqor signed-gtwith different opcodes for different predicates).If
ais coming from memory, the compiler can optimize the first access toainto a zero-masked load. Daniel's version reads the originalaagain later, vs. this version only using the zero-maskeda. That could be changed in Daniel's version orthogonal to other changes. (0-0==0, andbwon't be negative in the elements where it was0anyway.)We can look at asm for that case with a wrapper function that uses
_mm512_load_si512, or just look at a non-inlined version that takes a reference instead of value arg,__m512i &a:Or perhaps it would be best to do the zero-masking last, so it could perhaps fold into the next use of the return value. Like
_mm512_add_epi8(x, sign_epi8(y,z))- a compiler could optimize a final zero-masking into merge-masking forvpaddb.With a memory-source
a, this would have to load first; the merge destination has to be the asm destination register of the instruction. (So no savings on uops for the front-end or back-end vector ALU execution ports. Still four ALU uops, and either a normal load or a zero-masked load. Unlike with a register source where this is 4 vector execution port uops vs. 5.)Don't name your own functions
_mm_whatever- if a function by that name exists later, the conflict can cause problems. See C program compiled with gcc -msse2 contains AVX1 instructions for an example. (It's not inconceivable that some later AVX-512 or AVX10 extension will contain an EVEXvpsignbinstruction, in which case we'd expect an intrinsic with this name.)If we don't care about the b==0 zeroing special case
Daniel Lemire points out that some use-cases don't need the full power of
vpsignb, just the conditional negation. That's cheaper, just 2 instructions (not counting zeroing a register): test and a merge-masked subtract-from-zero.