Add saturate 32-bit signed ints intrinsics?

2.9k Views Asked by At

Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ?

I looked at the intrinsics guide and found _mm256_adds_epi16 but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around.

3

There are 3 best solutions below

3
On

This link answers this very question:

https://software.intel.com/en-us/forums/topic/285219

Here's an example implementation:

#include <immintrin.h>

__m128i __inline __mm_adds_epi32( __m128i a, __m128i b )
{
    static __m128i int_min = _mm_set1_epi32( 0x80000000 );
    static __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    __m128i res      = _mm_add_epi32( a, b );
    __m128i sign_and = _mm_and_si128( a, b );
    __m128i sign_or  = _mm_or_si128( a, b );

    __m128i min_sat_mask = _mm_andnot_si128( res, sign_and );
    __m128i max_sat_mask = _mm_andnot_si128( sign_or, res );

    __m128 res_temp = _mm_blendv_ps(_mm_castsi128_ps( res ),
                                    _mm_castsi128_ps( int_min ),
                                    _mm_castsi128_ps( min_sat_mask ) );

    return _mm_castps_si128(_mm_blendv_ps( res_temp,
                                          _mm_castsi128_ps( int_max ),
                                          _mm_castsi128_ps( max_sat_mask ) ) );
}

void addSaturate(int32_t* bufferA, int32_t* bufferB, size_t numSamples)
{
    //
    // Load and add
    //
    __m128i* pSrc1 = (__m128i*)bufferA;
    __m128i* pSrc2 = (__m128i*)bufferB;

    for(int i=0; i<numSamples/4; ++i)
    {
        __m128i res = __mm_adds_epi32(*pSrc1, *pSrc2);
        _mm_store_si128(pSrc1, res);

        pSrc1++;
        pSrc2++;
    }
}
6
On

Here is a version which works on SSE2, with improvements for SSE4.1 (_mm_blendv_ps), AVX-512VL (_mm_ternarylogic_epi32), and AVX-512DQ (_mm_movepi32_mask, on Peter Cordes' suggestion).

__m128i __mm_adds_epi32( __m128i a, __m128i b) {
  const __m128i int_max = _mm_set1_epi32(INT32_MAX);

  /* normal result (possibly wraps around) */
  const __m128i res = _mm_add_epi32(a, b);

  /* If result saturates, it has the same sign as both a and b */
  const __m128i sign_bit = _mm_srli_epi32(a, 31); /* shift sign to lowest bit */

  #if defined(__AVX512VL__)
    const __m128i overflow = _mm_ternarylogic_epi32(a, b, res, 0x42);
  #else
    const __m128i sign_xor = _mm_xor_si128(a, b);
    const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
  #endif

  #if defined(__AVX512DQ__) && defined(__AVX512VL__)
    return _mm_mask_add_epi32(res, _mm_movepi32_mask(overflow), int_max, sign_bit);
  #else
    const __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    #if defined(__SSE4_1__)
      return
        _mm_castps_si128(
          _mm_blendv_ps(
            _mm_castsi128_ps(res),
            _mm_castsi128_ps(saturated),
            _mm_castsi128_ps(overflow)
          )
        );
    #else
      const __m128i overflow_mask = _mm_srai_epi32(overflow, 31);
      return
        _mm_or_si128(
          _mm_and_si128(overflow_mask, saturated),
          _mm_andnot_si128(overflow_mask, res)
        );
    #endif
  #endif
}

I did this for SIMDe's implementation of the NEON vqaddq_s32 (and the MSA __msa_adds_s_b); if you need other versions you should be able to adapt them from simde/arm/neon/qadd.h. For 128-bit vectors, in addition to what SSE supports (8/16-bit, both signed and unsigned) there are:

  • vaddq_s32 (think _mm_adds_epi32)
  • vaddq_s64 (think _mm_adds_epi64)
  • vaddq_u32 (think _mm_adds_epu32)

vaddq_u64 (think _mm_adds_epu64) is also present, but currently relies on vector extensions. I could (and probably should) just port generated code to intrinsics, but TBH I'm not sure how to improve on it so I haven't bothered.

6
On

A signed overflow will happen if (and only if):

  • the signs of both inputs are the same, and
  • the sign of the sum (when added with wrap-around) is different from the input

Using C-Operators: overflow = ~(a^b) & (a^(a+b)).

Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1 trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps) this can be implemented as:

__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
    const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    // normal result (possibly wraps around)
    __m128i res      = _mm_add_epi32( a, b );

    // If result saturates, it has the same sign as both a and b
    __m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
    __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    // saturation happened if inputs do not have different signs, 
    // but sign of result is different:
    __m128i sign_xor  = _mm_xor_si128( a, b );
    __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));

    return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps( res ),
                                          _mm_castsi128_ps(saturated),
                                          _mm_castsi128_ps( overflow ) ) );
}

If your blendvps is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min and int_max, with the sign-bits of a. Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow) 31 bits to the right, and manual blending (using and/andnot/or).

And naturally, with AVX2 this can take __m256i variables instead of __m128i (should be very easy to rewrite).

Addendum If you know the sign of either a or b at compile-time, you can directly set saturated accordingly, and you can save both _mm_xor_si128 calculations, i.e., overflow would be _mm_andnot_si128(b, res) for positive a and _mm_andnot(res, b) for negative a (with res = a+b).

Test case / demo: https://godbolt.org/z/v1bsc85nG