If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1 from each packed integer while handling overflow without the result of one element affecting the result of another element.
I have this code at the moment and it works but I need a solution that does the subtraction of each packed 8-bit integer in parallel and doesn't make memory accesses.  On x86 I could use SIMD instructions like psubb that subtracts packed 8-bit integers in parallel but the platform I'm coding for doesn't support SIMD instructions.  (RISC-V in this case).
So I'm trying to do SWAR (SIMD within a register) to manually cancel out carry propagation between bytes of a uint64_t, doing something equivalent to this:
uint64_t sub(uint64_t arg) {
    uint8_t* packed = (uint8_t*) &arg;
    for (size_t i = 0; i < sizeof(uint64_t); ++i) {
        packed[i] -= 1;
    }
    return arg;
}
I think you could do this with bitwise operators but I'm not sure. I'm looking for a solution that doesn't use SIMD instructions. I'm looking for a solution in C or C++ that's quite portable or just the theory behind it so I can implement my own solution.
 
                        
If you have a CPU with efficient SIMD instructions, SSE/MMX
paddb(_mm_add_epi8) is also viable. Peter Cordes' answer also describes GNU C (gcc/clang) vector syntax, and safety for strict-aliasing UB. I strongly encourage reviewing that answer as well.Doing it yourself with
uint64_tis fully portable, but still requires care to avoid alignment problems and strict-aliasing UB when accessing auint8_tarray with auint64_t*. You left that part out of the question by starting with your data in auint64_talready, but for GNU C amay_aliastypedef solves the problem (see Peter's answer for that ormemcpy).Otherwise you could allocate / declare your data as
uint64_tand access it viauint8_t*when you want individual bytes.unsigned char*is allowed to alias anything so that sidesteps the problem for the specific case of 8-bit elements. (Ifuint8_texists at all, it's probably safe to assume it's anunsigned char.)Note that this is a change from a prior incorrect algorithm (see revision history).
This is possible without looping for arbitrary subtraction, and gets more efficient for a known constant like
1in each byte. The main trick is to prevent carry-out from each byte by setting the high bit, then correct the subtraction result.We are going to slightly optimize the subtraction technique given here. They define:
with
Hdefined as0x8080808080808080U(i.e. the MSBs of each packed integer). For a decrement,yis0x0101010101010101U.We know that
yhas all of its MSBs clear, so we can skip one of the mask steps (i.e.y & ~His the same asyin our case). The calculation proceeds as follows:xto 1, so that a borrow cannot propagate past the MSB to the next component. Call this the adjusted input.0x01010101010101from the corrected input. This does not cause inter-component borrows thanks to step 1. Call this the adjusted output.The operation can be written as:
Preferably, this is inlined by the compiler (use compiler directives to force this), or the expression is written inline as part of another function.
Testcases:
Performance details
Here's the x86_64 assembly for a single invocation of the function. For better performance it should be inlined with the hope that the constants can live in a register as long as possible. In a tight loop where the constants live in a register, the actual decrement takes five instructions: or+not+and+add+xor after optimization. I don't see alternatives that would beat the compiler's optimization.
With some IACA testing of the following snippet:
we can show that on a Skylake machine, performing the decrement, xor, and compare+jump can be performed at just under 5 cycles per iteration:
(Of course, on x86-64 you'd just load or
movqinto an XMM reg forpaddb, so it might be more interesting to look at how it compiles for an ISA like RISC-V.)