My task is to check (>trillions checks), does two int contain any of predefined pairs of nibbles (first pair 0x2 0x7; second 0xd 0x8). For example:
bit offset: 12345678
first int: 0x3d542783 first pair of 0x2 second: 0xd
second int: 0x486378d9 nibbles: 0x7 pair: 0x8
^ ^
So, for this example I mark two offsets with needed pairs (offsets are 2 and 5; but not a 7). Actual offsets and number of found pair are not needed in my task.
So, for given two ints the question is: Does them contains the any of these pairs of nibbles at the same offset.
I checked my program, this part is the hottest place (gprof
proven); and it is called a very-very many times (gcov
proven). Actually it is the 3rd or 4th loop (most nested) of nested loops.
My current code is slow (I rewrite it as function, but it is a code from the inner loop):
static inline int nibble_check (uint32_t A, uint32_t B)
__attribute__((always_inline))
{
int i;
for(i=0;i<8;i++)
if( ( ( (A&0xf) ==0xD) && ( (B&0xf) ==0x8) ) // first pair
|| ( ( (A&0xf) ==0x2) && ( (B&0xf) ==0x7) ) ) // second pair
return 1; // nibbles found
else {
A>>=4;
B>>=4;
}
return 0; // nibbles not found
}
The other task is finding this pairs not only at offsets 0,4,8 bits and so on, but at offsets 0,2,4,8,10,... bits:
#define douburu_nibble_check(A,B) (nibble_check(A,B) || nibble_check(A>>2, B>>2) )
Is it possible to rewrite this function and macro in parallel way?
My compiler is gcc452 and cpu is Intel Core2 Solo in 32bit mode (x86).
There are tricks for testing for a zero byte in a word (see e.g. http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord); a fast method is to use this expression:
which evaluates to some non-zero value if any of the 4 bytes within the 32-bit word are 0, or 0 otherwise.
This method can be adapted to work here: