If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1
from each packed integer while handling overflow without the result of one element affecting the result of another element.
I have this code at the moment and it works but I need a solution that does the subtraction of each packed 8-bit integer in parallel and doesn't make memory accesses. On x86 I could use SIMD instructions like psubb
that subtracts packed 8-bit integers in parallel but the platform I'm coding for doesn't support SIMD instructions. (RISC-V in this case).
So I'm trying to do SWAR (SIMD within a register) to manually cancel out carry propagation between bytes of a uint64_t
, doing something equivalent to this:
uint64_t sub(uint64_t arg) {
uint8_t* packed = (uint8_t*) &arg;
for (size_t i = 0; i < sizeof(uint64_t); ++i) {
packed[i] -= 1;
}
return arg;
}
I think you could do this with bitwise operators but I'm not sure. I'm looking for a solution that doesn't use SIMD instructions. I'm looking for a solution in C or C++ that's quite portable or just the theory behind it so I can implement my own solution.
If you have a CPU with efficient SIMD instructions, SSE/MMX
paddb
(_mm_add_epi8
) is also viable. Peter Cordes' answer also describes GNU C (gcc/clang) vector syntax, and safety for strict-aliasing UB. I strongly encourage reviewing that answer as well.Doing it yourself with
uint64_t
is fully portable, but still requires care to avoid alignment problems and strict-aliasing UB when accessing auint8_t
array with auint64_t*
. You left that part out of the question by starting with your data in auint64_t
already, but for GNU C amay_alias
typedef solves the problem (see Peter's answer for that ormemcpy
).Otherwise you could allocate / declare your data as
uint64_t
and access it viauint8_t*
when you want individual bytes.unsigned char*
is allowed to alias anything so that sidesteps the problem for the specific case of 8-bit elements. (Ifuint8_t
exists at all, it's probably safe to assume it's anunsigned char
.)Note that this is a change from a prior incorrect algorithm (see revision history).
This is possible without looping for arbitrary subtraction, and gets more efficient for a known constant like
1
in each byte. The main trick is to prevent carry-out from each byte by setting the high bit, then correct the subtraction result.We are going to slightly optimize the subtraction technique given here. They define:
with
H
defined as0x8080808080808080U
(i.e. the MSBs of each packed integer). For a decrement,y
is0x0101010101010101U
.We know that
y
has all of its MSBs clear, so we can skip one of the mask steps (i.e.y & ~H
is the same asy
in our case). The calculation proceeds as follows:x
to 1, so that a borrow cannot propagate past the MSB to the next component. Call this the adjusted input.0x01010101010101
from the corrected input. This does not cause inter-component borrows thanks to step 1. Call this the adjusted output.The operation can be written as:
Preferably, this is inlined by the compiler (use compiler directives to force this), or the expression is written inline as part of another function.
Testcases:
Performance details
Here's the x86_64 assembly for a single invocation of the function. For better performance it should be inlined with the hope that the constants can live in a register as long as possible. In a tight loop where the constants live in a register, the actual decrement takes five instructions: or+not+and+add+xor after optimization. I don't see alternatives that would beat the compiler's optimization.
With some IACA testing of the following snippet:
we can show that on a Skylake machine, performing the decrement, xor, and compare+jump can be performed at just under 5 cycles per iteration:
(Of course, on x86-64 you'd just load or
movq
into an XMM reg forpaddb
, so it might be more interesting to look at how it compiles for an ISA like RISC-V.)