Equivalent function for _mm256_sign_epi8 in AVX512

140 Views Asked by At

I was trying to work on a AVX512 code. While working on the same, was trying to look for a function similar to _mm256_sign_epi8 in AVX512 but wasn't able to find an equivalent. It would be really useful if we find a similar instruction. Is there an equivalent instruction or any other alternate way to do this for AVX512 with similar/lesser CPI/latency ? Thanks

AVX2 function example

z = _mm256_sign_epi8(x,y)

Based on sign of elements of y, sign of elements of x is also updated

2

There are 2 best solutions below

0
On

That's correct, there's no AVX-512 version of any of the vpsignb/w/d instructions (https://felixcloutier.com/x86/psignb:psignw:psignd). If you're working with 256-bit vectors using AVX-512 (which is often pretty efficient), you can of course just use _mm256_sign_epi8; the compiler will arrange for the inputs and output to be in ymm0-15 for the VEX-coded version, not ymm16-31.

For 512-bit vectors, you probably need 2 compares into masks and two masked operations to apply both the conditional-negation and the conditional-zeroing. I don't think the same functionality is available from a single instruction with a different name, so you can't get the same 1 uop with 1 cycle latency! IDK why they dropped it.

Unless you only need a simplified version that doesn't do the b==0 part, just b<0 ? 0-a : a. As Daniel Lemire points out, that can be done in two instructions. (https://lemire.me/blog/2024/01/11/implementing-the-missing-sign-instruction-in-avx-512/).
But there's some room for improvement in Daniel's full version, still using basically the same strategy of 2 compares and 2 masked ops.


We can use one merge-masked and one zero-masked operation, rather than two merge-masking ops. We do still need the zero constant in a register to subtract from, but at least we can avoid another asm instruction to copy or regenerate it after merge-masking into it would destroy that register value. GCC actually optimizes to vpblendmb so the zeroed vector is still around to subtract from, but clang does an insane vpmovm2b zmm1, k0 to make a mask for vpandq. Clang also compared against zeros in static storage with vpcmpltb ..., [rip + .LCPI1_0] even though this function does need to zero a register. So Daniel Lemire's version compiles unexpectedly poorly with Clang, as well as being slightly sub-optimal if compiled literally as one would expect, or even with GCC's optimization of it. (Godbolt for that vs. my version.)

The test-for-zero can be vptestmb, saving code-size vs. vcmpb z,z, imm8 which needs an immediate (to select the comparison predicate since AVX-512 integer compares aren't limited to just eq or signed-gt with different opcodes for different predicates).

#include <immintrin.h>

// don't define your own functions in reserved namespace like _...
__m512i m512_sign_epi8(__m512i a, __m512i b)
{
  __mmask64 b_nonzero = _mm512_test_epi8_mask(b, b);
  __mmask64 b_neg = _mm512_movepi8_mask(b);  // extract sign bits: b < 0

  __m512i a_zeroed = _mm512_maskz_mov_epi8(b_nonzero, a);  // (b!=0) ? a : 0
  return _mm512_mask_sub_epi8(a_zeroed, b_neg, _mm512_setzero_si512(), a_zeroed);  // b_neg ? 0-a_zeroed : a_zeroed
}

If a is coming from memory, the compiler can optimize the first access to a into a zero-masked load. Daniel's version reads the original a again later, vs. this version only using the zero-masked a. That could be changed in Daniel's version orthogonal to other changes. (0-0 == 0, and b won't be negative in the elements where it was 0 anyway.)

We can look at asm for that case with a wrapper function that uses _mm512_load_si512, or just look at a non-inlined version that takes a reference instead of value arg, __m512i &a:

# clang18 -O3 -march=x86-64-v4  # memory-source `a`
m512_sign_epi8(long long __vector(8)&, long long __vector(8)):
        vptestmb        k1, zmm0, zmm0     # b_nonzero
        vpxor   xmm1, xmm1, xmm1           # can get hoisted out of loops
        vpmovb2m        k2, zmm0           # b_neg
        vmovdqu8        zmm0 {k1} {z}, zmmword ptr [rdi]  # zero-masked load.
        vpsubb  zmm0 {k2}, zmm1, zmm0
        ret

Or perhaps it would be best to do the zero-masking last, so it could perhaps fold into the next use of the return value. Like _mm512_add_epi8(x, sign_epi8(y,z)) - a compiler could optimize a final zero-masking into merge-masking for vpaddb.

// worse if a comes from memory
// better if the final maskz can fold into the next use of the result
// Only Clang17 and later do this, GCC and earlier clang miss that optimization
__m512i m512_sign_epi8_foldable(__m512i a, __m512i b)
{
  __mmask64 b_neg = _mm512_movepi8_mask(b);  // extract sign bits: b < 0
  __mmask64 b_nonzero = _mm512_test_epi8_mask(b, b);

  __m512i a_neg = _mm512_mask_sub_epi8(a, b_neg, _mm512_setzero_si512(), a);  // b_neg ? 0-a : a
  return _mm512_maskz_mov_epi8(b_nonzero, a_neg);  // (b!=0) ? a_neg : 0
}

__m512i fold_sign(__m512i x, __m512i y, __m512i z)
{
    return _mm512_add_epi8(x, m512_sign_epi8_foldable(y,z));
}
# clang 18 -O3 -march=x86-64-v4
fold_sign(long long __vector(8), long long __vector(8), long long __vector(8))
        vpmovb2m        k1, zmm2
        vpxor   xmm3, xmm3, xmm3
        vptestmb        k2, zmm2, zmm2
        vpsubb  zmm1 {k1}, zmm3, zmm1
 # zero-masking of the vpsubb result optimized away,
 #  folded into merge-masking for the add
        vpaddb  zmm0 {k2}, zmm0, zmm1
        ret

With a memory-source a, this would have to load first; the merge destination has to be the asm destination register of the instruction. (So no savings on uops for the front-end or back-end vector ALU execution ports. Still four ALU uops, and either a normal load or a zero-masked load. Unlike with a register source where this is 4 vector execution port uops vs. 5.)


Don't name your own functions _mm_whatever - if a function by that name exists later, the conflict can cause problems. See C program compiled with gcc -msse2 contains AVX1 instructions for an example. (It's not inconceivable that some later AVX-512 or AVX10 extension will contain an EVEX vpsignb instruction, in which case we'd expect an intrinsic with this name.)


If we don't care about the b==0 zeroing special case

Daniel Lemire points out that some use-cases don't need the full power of vpsignb, just the conditional negation. That's cheaper, just 2 instructions (not counting zeroing a register): test and a merge-masked subtract-from-zero.

// this is efficient; nothing to improve on here
__m512i lemire_mm512_sign_epi8_no_zeroing(__m512i a, __m512i b) {
  __m512i zero = _mm512_setzero_si512();
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  return _mm512_mask_sub_epi8(a, blt0, zero, a);  // b<0 ? 0-a : a
}
2
On

There is no direct alternative of _mm256_sign_epi8 in AVX512.

Quoting https://lemire.me/blog/2024/01/11/implementing-the-missing-sign-instruction-in-avx-512/ , one possible replacement is:

#include <immintrin.h>

__m512i _mm512_sign_epi8(__m512i a, __m512i b) {
  // Set a 512-bit integer vector of all zeros.
  __m512i zero = _mm512_setzero_si512();
  // Build 64-bit mask, where each bit indicates whether the corresponding element < 0.
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  // Build 64-bit mask, where each bit indicates whether the corresponding element <= 0.
  __mmask64 ble0 = _mm512_cmple_epi8_mask(b, zero);
  // Copy elements from `a` where the mask blt0 is true, otherwise use zero.
  __m512i a_blt0 = _mm512_mask_mov_epi8(zero, blt0, a);
  // Return `0 - a_blt0` where the mask ble0 is true, otherwise use a.
  return _mm512_mask_sub_epi8(a, ble0, zero, a_blt0);
}