Add two vectors (uint64_t type) with saturation for each int8_t element

309 Views Asked by At

I was recently faced with a given problem:

There are 8 elements in the vector, each is represented by int8_t.

Implement an algorithm in x86_64 that will add two vectors (uint64_t type).

Adding elements should be done with saturation arithmetic taken into account.


80 + 60 = 127

(−40) + (−100) = −128

The biggest challenge turns out to be the restrictions imposed:

  • No conditional instructions except ret; no jumps, cmove, set, etc.
  • The solution cannot be longer than 48 instructions (there exists a solution shorter than 37 instructions)

I can't think of any solution that fits these restrictions. Could anyone give me some hints? Examples in C are welcome.

I can use only "standard", transfer, arithmetic, logical instructions and standard registers:

  • mov cbw/cwde/cdqe cwd/cdq/cqo movzx movsx
  • add sub imul mul idiv div inc dec neg
  • and or xor not sar sarx shr shrx shl shlx ror rol
  • lea ret

There are 4 best solutions below


I wrote it in C++ like this:

#include <cstdint>

uint64_t add(uint64_t a, uint64_t b) {
    uint64_t asigns = a & 0x8080808080808080L;
    uint64_t bsigns = b & 0x8080808080808080L;
    uint64_t sum = (a^asigns) + (b^bsigns);
    // fix up 8 bit wrapped sums
    sum ^= asigns ^ bsigns;
    uint64_t sumsigns = sum & 0x8080808080808080L;
    // we saturate high when a and b were positive, but the result is negative
    uint64_t sat = sumsigns & ~(asigns|bsigns);
    sum |= (sat>>7)*127;
    sum &= ~sat;
    // we saturate negative when a and b were negative, but the result is positive
    sat = (asigns&bsigns) & ~sumsigns;
    sum &= ~((sat>>7)*127);
    sum |= sat;
    return sum;

Then I went over to to see what various compilers generate. clang-16 gives 33 instructions:

add(unsigned long, unsigned long):
        movabs  rdx, -9187201950435737472
        mov     rax, rdi
        and     rax, rdx
        mov     rcx, rsi
        and     rcx, rdx
        movabs  r8, 9187201950435737471
        mov     r9, rdi
        and     r9, r8
        and     r8, rsi
        add     r8, r9
        xor     rax, rcx
        xor     rax, r8
        or      rsi, rdi
        not     rsi
        and     rdx, rsi
        and     rdx, r8
        mov     rsi, rdx
        shr     rsi, 7
        mov     r8, rdx
        sub     r8, rsi
        or      r8, rax
        xor     r8, rdx
        not     rax
        and     rcx, rdi
        and     rcx, rax
        mov     rdx, rcx
        shr     rdx, 7
        mov     rax, rcx
        sub     rax, rdx
        not     rax
        and     rax, r8
        or      rax, rcx

You can try the various other options.


The following code uses a pedestrian approach to byte-wise addition with signed saturation, but is very competitive in terms of instruction count and execution time with Falk Hüffner's excellent algorithm.

To avoid crossing byte-lane boundaries, the classical approach for emulated SIMD arithmetic is to perform the computation separately for the low-order seven bits and the most significant bits, then merge partial result. In this case this also helps with detecting signed integer overflow, one definition of which is that the carry-in to the most significant bit differs from the carry-out from that bit.

Signed integer overflow in addition can only occur when the signs of the addends are the same. If overflow occurs, the byte-size special result (spc in the code below) is either 0x7f or 0x80, and this can therefore be computed from the sign of either addend.

The overflow flag is expanded into a full-byte mask of all-zeros or all-ones, and this is used to select either the regular addition result (res in code below) or the special overflow result in a traditional multiplexing idiom.

The question lists various instructions from the BMI2 instruction set extension (introduced in 2013) as permissible, so I will assume that use of the andn instruction from the BMI1 extension is likewise allowed, although it is not explicitly listed in the question.

I developed my implementation epaddsb on a Windows 10 machine, and the code therefore uses the Windows calling convention for x86-64. Changing this for the System V ABI used by Linux is trivial: simply exchange a few register names. For a comparison with Falk Hüffner's algorithm I compiled his C code with a recent Intel oneAPI compiler and captured the generated code in hpaddsb.

epaddsb requires 21 instructions without ret, while hpaddsb requires 20 instructions without ret. The performance of the two variants is identical within measurement noise level of ±2% on my PC based on a Skylake CPU.

PUBLIC  epaddsb

        ALIGN 16

;; epaddsb(a,b): emulated byte-wise 64-bit addition with signed saturation
;; Windows x86-64 calling convention:
;; function arguments: rcx, rdx, {r8, r9}
;; function return value: rax
;; scratch registers: rax, rcx, rdx, r8, r9, {r10, r11}

epaddsb PROC
        mov  rax, 7f7f7f7f7f7f7f7fh ; NMSB_MASK = ~MSB_MASK
        mov  r8, rcx                ; a
        mov  r9, rdx                ; b
        and  rcx, rax               ; a & NMSB_MASK
        and  rdx, rax               ; b & NMSB_MASK
        xor  r9, r8                 ; sum = a ^ b
        add  rdx, rcx               ; res = (a & NMSB_MASK) + (b & NMSB_MASK)
        andn rcx, rax, r8           ; a & ~NMSB_MASK
        xor  r8, rdx                ; res ^ a
        shr  rcx, 7                 ; (a & ~NMSB_MASK) >> 7
        andn r8, r9, r8             ; ofl = (res ^ a) & ~sum
        add  rcx, rax               ; spc = ((a & ~ NMSB_MASK) >> 7) + NMSB_MASK
        andn r9, rax, r9            ; sum & ~NSMB_MASK
        xor  rdx, r9                ; res = res ^ (sum & ~NMSB_MASK)
        andn r8, rax, r8            ; ofl & ~NMSB_MASK
        lea  r9, [r8 + r8]          ; ofl << 1
        shr  r8, 7                  ; ofl >> 7
        sub  r9, r8                 ; mask = (ofl << 1) - (ofl >> 7)
        andn rax, r9, rdx           ; res & ~mask
        and  rcx, r9                ; spc & mask
        or   rax, rcx               ; res = (spc & mask) | (res & ~mask)
epaddsb ENDP

        ALIGN 16

;; Falk Hüffner's algorithm from
;; Compiled by Intel(R) oneAPI DPC++/C++ compiler version 2023.0.0

hpaddsb PROC
        mov  rax, rdx              ;
        xor  rax, rcx              ;
        mov  r8, 8080808080808080h ;
        andn r9, rax, r8           ;
        mov  r10, 7f7f7f7f7f7f7f7fh;
        and  rcx, r10              ;
        and  r10, rdx              ;
        add  r10, rcx              ;
        xor  rdx, r10              ;
        and  rdx, r9               ;
        lea  rax, [rdx + rdx]      ;
        shr  rdx, 7                ;
        xor  r10, r9               ;
        not  rax                   ;
        add  rax, rdx              ;
        and  rax, r10              ;
        xor  rax, r8               ;
        shr  r10, 7                ;
        andn rcx, r10, rdx         ;
        sub  rax, rcx              ;
hpaddsb ENDP

        ALIGN 16



I am showing my test scaffolding below. I built as follows

ml64 /c paddsb.obj paddsb.asm
icx /W4 /Ox /QxHOST paddsb_stackoverflow.c paddsb.obj

using Microsoft Macro Assembler 14.27.29112.0 and Intel oneAPI DPC++/C++ Compiler 2023.0.0.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define NBR_TEST_CASES     (1000000000)

/* emulated byte-wise 64-bit addition with signed saturation; in assembly */
extern uint64_t epaddsb (uint64_t a, uint64_t b); /* algorithm: N. Juffa */
extern uint64_t hpaddsb (uint64_t a, uint64_t b); /* algorithm: F. Hüffner */

/* reference function for byte-wise addition with signed saturation */
uint64_t paddsb_ref (uint64_t a, uint64_t b)
    int8_t a0 = (int8_t)(uint8_t)(a >>  0);
    int8_t a1 = (int8_t)(uint8_t)(a >>  8);
    int8_t a2 = (int8_t)(uint8_t)(a >> 16);
    int8_t a3 = (int8_t)(uint8_t)(a >> 24);
    int8_t a4 = (int8_t)(uint8_t)(a >> 32);
    int8_t a5 = (int8_t)(uint8_t)(a >> 40);
    int8_t a6 = (int8_t)(uint8_t)(a >> 48);
    int8_t a7 = (int8_t)(uint8_t)(a >> 56);
    int8_t b0 = (int8_t)(uint8_t)(b >>  0);
    int8_t b1 = (int8_t)(uint8_t)(b >>  8);
    int8_t b2 = (int8_t)(uint8_t)(b >> 16);
    int8_t b3 = (int8_t)(uint8_t)(b >> 24);
    int8_t b4 = (int8_t)(uint8_t)(b >> 32);
    int8_t b5 = (int8_t)(uint8_t)(b >> 40);
    int8_t b6 = (int8_t)(uint8_t)(b >> 48);
    int8_t b7 = (int8_t)(uint8_t)(b >> 56);
    b0 = ((a0 + b0) > 127) ? 127 : (((a0 + b0) < (-128)) ? (-128) : (a0 + b0));
    b1 = ((a1 + b1) > 127) ? 127 : (((a1 + b1) < (-128)) ? (-128) : (a1 + b1));
    b2 = ((a2 + b2) > 127) ? 127 : (((a2 + b2) < (-128)) ? (-128) : (a2 + b2));
    b3 = ((a3 + b3) > 127) ? 127 : (((a3 + b3) < (-128)) ? (-128) : (a3 + b3));
    b4 = ((a4 + b4) > 127) ? 127 : (((a4 + b4) < (-128)) ? (-128) : (a4 + b4));
    b5 = ((a5 + b5) > 127) ? 127 : (((a5 + b5) < (-128)) ? (-128) : (a5 + b5));
    b6 = ((a6 + b6) > 127) ? 127 : (((a6 + b6) < (-128)) ? (-128) : (a6 + b6));
    b7 = ((a7 + b7) > 127) ? 127 : (((a7 + b7) < (-128)) ? (-128) : (a7 + b7));
    return (((uint64_t)(uint8_t)b0 <<  0) | ((uint64_t)(uint8_t)b1 <<  8) | 
            ((uint64_t)(uint8_t)b2 << 16) | ((uint64_t)(uint8_t)b3 << 24) |
            ((uint64_t)(uint8_t)b4 << 32) | ((uint64_t)(uint8_t)b5 << 40) | 
            ((uint64_t)(uint8_t)b6 << 48) | ((uint64_t)(uint8_t)b7 << 56)); 

/*!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J */
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

int main (void)
    uint64_t res, ref, a, b, count = 0;

    printf ("Testing %s's algo\n", TEST_HUEFFNER_ALGO ? "Hueffner" : "Juffa");
    do {
        a = KISS64;
        b = KISS64;
        ref = paddsb_ref (a, b);
        res = hpaddsb (a, b);
        res = epaddsb (a, b);
        if (res != ref) {
            printf ("error @ a=%016llx b=%016llx:  res=%016llx  ref=%016llx\n", 
                    a, b, res, ref);
            return EXIT_FAILURE;
    } while (count < NBR_TEST_CASES);
    printf ("test passed\n");
    return EXIT_SUCCESS;

Here is a version (tested and does not require imul) that takes 22 instructions when compiled with clang-16.

uint64_t add(uint64_t x, uint64_t y) {
    uint64_t eq, xv, yv, satmask, satbits, satadd, t0, t1;
    uint64_t signmask = 0x8080808080808080ULL;

    eq = (x ^ ~y) & signmask;
    xv = x & ~signmask;
    yv = y & ~signmask;
    xv += yv;
    satbits = (xv ^ y) & eq;
    satadd = satbits >> 7;
    satmask = (satbits << 1) - satadd;
    xv ^= eq;
    t0 = (xv & ~satmask) ^ signmask;
    t1 = satadd & ~(xv >> 7);
    return t0 - t1;


mov     rdx, rsi
xor     rdx, rdi
not     rdx
movabs  r8, -9187201950435737472
and     rdx, r8
movabs  rcx, 9187201950435737471
and     rdi, rcx
and     rcx, rsi
add     rcx, rdi
xor     rsi, rcx
and     rsi, rdx
lea     rax, [rsi + rsi]
shr     rsi, 7
xor     rcx, rdx
not     rax
add     rax, rsi
and     rax, rcx
xor     rax, r8
shr     rcx, 7
not     rcx
and     rcx, rsi
sub     rax, rcx

Use the paddsb instruction to add vectors of bytes with signed saturation. The implementation could be like (assuming the amd64 sysv abi):

    movq    %rdi, %mm0  # move the first operand to an MMX register
    movq    %rsi, %mm1  # move the second operand to an MMX register
    paddsb  %mm1, %mm0  # packed add bytes with signed saturation
    movq    %mm0, %rax  # move the result back to a scalar register
    emms                # end MMX mode
    ret                 # return to caller

Without MMX, the following approach can be used. The idea is to perform the following algorithm on all bytes in parallel with SWAR techniques:

int8_t addsb(int8_t a, int8_t b) {
    int8_t q = a + b;

    /* can the addition overflow (are a and b of different sign?) */
    if (((a ^ b) & 0x80) == 0) {
        /* is the result of different sign? */
        if (((a ^ q) & 0x80) != 0) {
            /* if yes, overflow occurred */
            return (a & 0x80 ? 0x80 : 0x7f);

    return (q);

The following code is untested but should work:

paddsb: mov     $0x0101010101010101, %rdx       # LSB bit masks
        lea     (%rsi, %rdi, 1), %rax           # q = a + b
        mov     %rdi, %rcx
        xor     %rsi, %rcx                      # a ^ b
        mov     %rax, %rbx
        sub     %rcx, %rbx                      # a + b - (a ^ b) (carry out)
        and     %rdx, %rbx                      # carry outs from one byte to the next
        not     %rcx                            # ~a ^ b
        xor     %rax, %rdi                      # a ^ q
        sub     %rbx, %rax                      # compensate for the carry out
        and     %rcx, %rdi                      # bit 7 set where overflow
        shr     $7, %rdi                        # bit 0 set where overflow
        and     %rdx, %rdi                      # 0x01 where overflow, 0x00 where not
        imul    $0xff, %rdi, %rdi               # 0xff where overflow, 0x00 where not
        shr     $7, %rsi
        and     %rdx, %rsi                      # 0x01 where b negative, 0x00 where not
        mov     $0x7f7f7f7f7f7f7f7f, %rdx
        add     %rsi, %rdx                      # 0x80 where b negative, 0x7f where not
        and     %rdi, %rdx                      # masked to only where overflown
        not     %rdi                            # 0x00 where overflow, 0xff where not
        and     %rdi, %rax                      # q masked to only where not overflown
        or      %rdx, %rax                      # signed sum of a and b

Note that some extra processing is needed to avoid carry out from one byte to the next.