Hash function to map consecutive integers to non-distinct integers

514 Views Asked by At

I have a sequence of 1760 integers, ranging from 129 to 250, and there is no discernible pattern to these integers. I'm working on a very small embedded system, and can't afford to waste almost 2 KB on a lookup table. I'd like to come up with a function that allows me to lookup a value given an index (in the 0 to 1759 range).

I know that minimal perfect hashing would allow me to map distinct values onto a set of consecutive integers, but I'm looking to map a set of consecutive integers onto non-distinct values.

Is brute force over millions of years the only way to do this? Is there some approach that would allow for a much smaller lookup table (say, around 256 bytes or less)?

1

There are 1 best solutions below

3
On BEST ANSWER

What process generates your 1760 integers? Unfortunately, without knowing a little more about the source of your data it will be difficult (as you say, "millions of years") to find such a function, if it exists. Claude Shannon proved that random noise is at maximum information entropy and therefore impossible to compress. So if there is no discernible pattern to your integers, that indeed qualifies as random noise.


Going back to the lookup table, you can reduce the size of your table by 1/8 by recognizing that your integers are all within the range 129-250 which only requires 7 bits to represent. With some bit manipulation tricks in the table lookup you will only require 1760 * 7/8 = 1540 bytes or a 12.5% savings. It's not a lot but it's a start; here's some sample code to illustrate what I mean.

Sample code

#include <cassert>
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <vector>

void compress(const std::vector<uint8_t>& raw, std::vector<uint8_t>& comp) {
    // Length must be a multiple of 8 to handle unrolled loop.
    assert(raw.size() % 8 == 0);

    comp.resize(raw.size() * 7 / 8);
    for (size_t rIdx = 0, cIdx = 0; rIdx < raw.size(); rIdx += 8, cIdx += 7) {
        comp[cIdx + 0] = (raw[rIdx + 0] << 1) | ((raw[rIdx + 1] & 0x7f) >> 6);
        comp[cIdx + 1] = (raw[rIdx + 1] << 2) | ((raw[rIdx + 2] & 0x7f) >> 5);
        comp[cIdx + 2] = (raw[rIdx + 2] << 3) | ((raw[rIdx + 3] & 0x7f) >> 4);
        comp[cIdx + 3] = (raw[rIdx + 3] << 4) | ((raw[rIdx + 4] & 0x7f) >> 3);
        comp[cIdx + 4] = (raw[rIdx + 4] << 5) | ((raw[rIdx + 5] & 0x7f) >> 2);
        comp[cIdx + 5] = (raw[rIdx + 5] << 6) | ((raw[rIdx + 6] & 0x7f) >> 1);
        comp[cIdx + 6] = (raw[rIdx + 6] << 7) | ((raw[rIdx + 7] & 0x7f) >> 0);
    }
}

uint8_t lookup(const std::vector<uint8_t>& comp, size_t rIdx) {
    size_t cIdx = rIdx / 8 * 7;
    switch (rIdx % 8) {
    case 0:
        return                                  (comp[cIdx + 0] >> 1) | 0x80;
    case 1:
        return ((comp[cIdx + 0] & 0x01) << 6) | (comp[cIdx + 1] >> 2) | 0x80;
    case 2:
        return ((comp[cIdx + 1] & 0x03) << 5) | (comp[cIdx + 2] >> 3) | 0x80;
    case 3:
        return ((comp[cIdx + 2] & 0x07) << 4) | (comp[cIdx + 3] >> 4) | 0x80;
    case 4:
        return ((comp[cIdx + 3] & 0x0f) << 3) | (comp[cIdx + 4] >> 5) | 0x80;
    case 5:
        return ((comp[cIdx + 4] & 0x1f) << 2) | (comp[cIdx + 5] >> 6) | 0x80;
    case 6:
        return ((comp[cIdx + 5] & 0x3f) << 1) | (comp[cIdx + 6] >> 7) | 0x80;
    case 7:
        return ((comp[cIdx + 6] & 0x7f) << 0) | 0x80;
    }
}

int main() {
    std::vector<uint8_t> raw { 151, 169, 162, 164, 155, 147, 149, 143, };
    std::vector<uint8_t> comp;

    compress(raw, comp);

    for (size_t i = 0; i < raw.size(); ++i) {
        std::cout << i << ": raw " << static_cast<int>(raw[i])
                  << ", lookup " << static_cast<int>(lookup(comp, i))
                  << std::endl;
    }
    return 0;
}

Output

I'm just printing out the raw and compressed/uncompressed data at each index to verify storage and retrieval.

0: raw 151, lookup 151
1: raw 169, lookup 169
2: raw 162, lookup 162
3: raw 164, lookup 164
4: raw 155, lookup 155
5: raw 147, lookup 147
6: raw 149, lookup 149
7: raw 143, lookup 143

There's some work to be done if your input data length is no longer a multiple of 8 but this should get you started.