This code sample takes an arbitrarily large number of bytes as inputs, one at the time, and maps them to a table with 32 values [0,31]. To simplify the example I used mod 32. The >> 3 operation is equivalent to division by 8. In case it is not clear, the "q" loop counts to 5 because numbers from 0 to 31 only require 5-bits of resolution.
Can this be improved in terms of speed?
uint8_t sto[10000];
uint8_t srt[32] = {<32 values in total>};
uint8_t t,q
uint64_t loop,cr = 0;
for(loop = 0; loop < 10000; loop++) {
t = srt[loop % 32];
for(q = 0; q < 5; q++) {
if(t & (0x1 << (4 - q)))
sto[cr >> 3] |= (0x1 << (7 - (cr % 8)));
cr++;
}
}
The code can be reduced to simply:
after doing some work to prepare
bits. That loop can be unrolled, can be converted to units wider thanuint8_t, can be converted to SIMD code, and can be parallelized. After a few of those, such as converting to wider units, it might stream at memory bandwidth, so further optimization might not be useful.The work to prepare
bitsis to consolidate the bits fromsrtinto a full bitmask: