Most efficient formula for unpacking 16-bit BCD? (e.g. 0x1234 to 0x01020304)

469 Views Asked by At

Is there a bit twiddling hack for efficiently unpacking a 16-bit packed BCD number?

Doing it the pedestrian way requires 10 operations (3 shifts, 4 ANDs and 3 ORs or ADDs):

x = (bcd & 0xF000) << 12
  | (bcd & 0x0F00) <<  8
  | (bcd & 0x00F0) <<  4
  | (bcd & 0x000F)

With multi-way ADD/OR the critical path length would be 3 but these operations tend to be binary and so most CPUs would be looking at a critical path of length 4.

Can this be done more efficiently?

Note: for some purposes it can be equally useful if some permutation of the nibbles can be unpacked especially efficiently, like if the word to be unpacked comes from a lookup table over whose creation I have full control (so that I can stick each digit wherever I want). The purpose of using packed instead of unpacked BCD in this case would be to halve the memory pressure and to avoid exceeding the size of the L1 cache, taking some load off an over-saturated memory subsystem by increasing the load on the CPU's ALUs.

For example, if I permute the digits like 0x1324 then a simple de-interleave yields 0x01020304:

x = ((bcd << 12) | bcd) & 0x0F0F0F0F

That's just three operations with critical path length 3, quite an improvement over the original version...

3

There are 3 best solutions below

5
user448810 On

Use the DoubleDabble algorithm.

2
harold On

Here is an alternative way, with fewer operations but a longer critical path, based on the binary decomposition of the move-distance of the nibbles (moving nibbles that move by 8 or 12 steps together by 8, moving nibbles that move a distance of 4 or 12 together by 4).

x = bcd
x = ((x & 0xFF00) << 8) | (x & 0xFF)
x = ((x & 0x00F000F0) << 4) | (x & 0x000F000F)

For example:

// start
0000ABCD
// move A and B by 8
00AB00CD
// move A and C by 4
0A0B0C0D
1
njuffa On

The most efficient solution will be machine specific, as different ISAs have different capabilities when it comes to dealing with immediate constants, or combining shifts with ALU operations. Here is an alternative implementation with good instruction-level parallelism that may be superior on platforms with a very fast integer multiply. Integer multiply is often helpful for bit twiddling algorithms by performing multiple shift-add operations in parallel.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

/* reference implementation */
uint32_t bcd_spread_1 (uint32_t a)
{
    return (((a & 0xF000) << 12) |
            ((a & 0x0F00) <<  8) |
            ((a & 0x00F0) <<  4) |
            ((a & 0x000F) <<  0));
}

/* alternative implementation */
uint32_t bcd_spread_2 (uint32_t a)
{
    return ((((a & 0xf0f0) * 0x1010) & 0x0f000f00) |
            (((a & 0x0f0f) * 0x0101) & 0x000f000f));
}

/* BCD addition. Knuth TAOCP 4 */
uint32_t median (uint32_t x, uint32_t y, uint32_t z)
{
    return (x & (y | z)) | (y & z);
}

uint32_t bcd_add (uint32_t x, uint32_t y)
{
    uint32_t z, u, t;
    z = y + 0x66666666;
    u = x + z;
    t = median (~x, ~z, u) & 0x88888888;
    return u - t + (t >> 2);
}

int main (void)
{
    uint32_t x, y, bcd = 0;
    do {
        x = bcd_spread_1 (bcd);
        y = bcd_spread_2 (bcd);
        if (x != y) {
            printf ("!!!! bcd=%04x x=%08x y=%08x\n", bcd, x, y);
            return EXIT_FAILURE;
        }
        bcd = bcd_add (bcd, 1);
    } while (bcd < 0x10000);
    return EXIT_SUCCESS;
}