SSE4.1 unsigned integer comparison with overflow

310 Views Asked by At

Is there any way to perform a comparison like C >= (A + B) with SSE2/4.1 instructions considering 16 bit unsigned addition (_mm_add_epi16()) can overflow?

The code snippet looks like-

#define _mm_cmpge_epu16(a, b) _mm_cmpeq_epi16(_mm_max_epu16(a, b), a)

__m128i *a = (__m128i *)&ptr1;
__m128i *b = (__m128i *)&ptr2;
__m128i *c = (__m128i *)&ptr3;
            
_m128i xa = _mm_lddqu_si128(a);
_m128i xb = _mm_lddqu_si128(b);
_m128i xc = _mm_lddqu_si128(c);

_m128i res = _mm_add_epi16(xa, xb);
_m128i xmm3 = _mm_cmpge_epu16(xc, res);

The issue is that when the 16 bit addition overflows (wraps-around), the greater than comparison results in false positives. I can't use saturated addition for my purpose. I have looked at mechanism to detect overflow for unsigned addition here SSE2 integer overflow checking. But how how do I use if for greater than comparision.

2

There are 2 best solutions below

3
On BEST ANSWER

You build the missing primitives from what you have available in the instruction set. Here’s one possible implementation, untested. Disassembly.

// Compare uint16_t lanes for a >= b
inline __m128i cmpge_epu16( __m128i a, __m128i b )
{
    const __m128i max = _mm_max_epu16( a, b );
    return _mm_cmpeq_epi16( max, a );
}

// Compare uint16_t lanes for c >= a + b, with overflow handling
__m128i cmpgeSum( __m128i a, __m128i b, __m128i c )
{
    // Compute c >= a + b, ignoring overflow issues
    const __m128i sum = _mm_add_epi16( a, b );
    const __m128i ge = cmpge_epu16( c, sum );

    // Detect overflow of a + b
    const __m128i sumSaturated = _mm_adds_epu16( a, b );
    const __m128i sumInRange = _mm_cmpeq_epi16( sum, sumSaturated );

    // Combine the two
    return _mm_and_si128( ge, sumInRange );
}
0
On

Here are a few reasonable approaches:

#include <cstdint>
using v8u16 = uint16_t __attribute__((vector_size(16)));

v8u16 lthsum1(v8u16 a, v8u16 b, v8u16 c) {
    return (c >= a) & (c - a >= b);
}

v8u16 lthsum2(v8u16 a, v8u16 b, v8u16 c) {
    return (a + b >= a) & (a + b <= c);
}

You can see how this gets compiled on godbolt. Both approaches are broadly equivalent, and I'm not seeing large changes with -msse4.1 with gcc, but AVX2 and later do improve the code. clang also gets minor improvements with sse4.1 for the second variant. With AVX512BW, clang does pretty well for itself.