I was trying to work on a AVX512 code. While working on the same, was trying to look for a function similar to _mm256_sign_epi8 in AVX512 but wasn't able to find an equivalent. It would be really useful if we find a similar instruction. Is there an equivalent instruction or any other alternate way to do this for AVX512 with similar/lesser CPI/latency ? Thanks
AVX2 function example
z = _mm256_sign_epi8(x,y)
Based on sign of elements of y, sign of elements of x is also updated
That's correct, there's no AVX-512 version of any of the vpsignb/w/d instructions (https://felixcloutier.com/x86/psignb:psignw:psignd). If you're working with 256-bit vectors using AVX-512 (which is often pretty efficient), you can of course just use
_mm256_sign_epi8
; the compiler will arrange for the inputs and output to be in ymm0-15 for the VEX-coded version, not ymm16-31.For 512-bit vectors, you probably need 2 compares into masks and two masked operations to apply both the conditional-negation and the conditional-zeroing. I don't think the same functionality is available from a single instruction with a different name, so you can't get the same 1 uop with 1 cycle latency! IDK why they dropped it.
Unless you only need a simplified version that doesn't do the
b==0
part, justb<0 ? 0-a : a
. As Daniel Lemire points out, that can be done in two instructions. (https://lemire.me/blog/2024/01/11/implementing-the-missing-sign-instruction-in-avx-512/).But there's some room for improvement in Daniel's full version, still using basically the same strategy of 2 compares and 2 masked ops.
We can use one merge-masked and one zero-masked operation, rather than two merge-masking ops. We do still need the zero constant in a register to subtract from, but at least we can avoid another asm instruction to copy or regenerate it after merge-masking into it would destroy that register value. GCC actually optimizes to
vpblendmb
so the zeroed vector is still around to subtract from, but clang does an insanevpmovm2b zmm1, k0
to make a mask forvpandq
. Clang also compared against zeros in static storage withvpcmpltb ..., [rip + .LCPI1_0]
even though this function does need to zero a register. So Daniel Lemire's version compiles unexpectedly poorly with Clang, as well as being slightly sub-optimal if compiled literally as one would expect, or even with GCC's optimization of it. (Godbolt for that vs. my version.)The test-for-zero can be
vptestmb
, saving code-size vs.vcmpb z,z, imm8
which needs an immediate (to select the comparison predicate since AVX-512 integer compares aren't limited to justeq
or signed-gt
with different opcodes for different predicates).If
a
is coming from memory, the compiler can optimize the first access toa
into a zero-masked load. Daniel's version reads the originala
again later, vs. this version only using the zero-maskeda
. That could be changed in Daniel's version orthogonal to other changes. (0-0
==0
, andb
won't be negative in the elements where it was0
anyway.)We can look at asm for that case with a wrapper function that uses
_mm512_load_si512
, or just look at a non-inlined version that takes a reference instead of value arg,__m512i &a
:Or perhaps it would be best to do the zero-masking last, so it could perhaps fold into the next use of the return value. Like
_mm512_add_epi8(x, sign_epi8(y,z))
- a compiler could optimize a final zero-masking into merge-masking forvpaddb
.With a memory-source
a
, this would have to load first; the merge destination has to be the asm destination register of the instruction. (So no savings on uops for the front-end or back-end vector ALU execution ports. Still four ALU uops, and either a normal load or a zero-masked load. Unlike with a register source where this is 4 vector execution port uops vs. 5.)Don't name your own functions
_mm_whatever
- if a function by that name exists later, the conflict can cause problems. See C program compiled with gcc -msse2 contains AVX1 instructions for an example. (It's not inconceivable that some later AVX-512 or AVX10 extension will contain an EVEXvpsignb
instruction, in which case we'd expect an intrinsic with this name.)If we don't care about the b==0 zeroing special case
Daniel Lemire points out that some use-cases don't need the full power of
vpsignb
, just the conditional negation. That's cheaper, just 2 instructions (not counting zeroing a register): test and a merge-masked subtract-from-zero.