Easily Memorable Hash (what 3 words)

314 Views Asked by At

I was hoping to create an easily memorable hash something like 3 random words (what3words), so the idea would be to hash a java object and the result was three random words.

Use case: I have objects with many fields and I need to compress the field into 24 chars (this is the size of a varchar primary key column in a database where these objects are stored, and cannot be changed), the resulting compressed value should also be easily memorable.

Initially, I decided to use 3 different hash functions (namely FNV1a64Hash, CRC32Hash and DJB2) to create 3 pre-hashes and then use those values as indexes into the dictionary however this resulted in a lot of collisions (Random Words tried: 10000000 No of collisions: 9272419). Note my dictionary size is approx 50k words.

Next, I decided to just call hashCode() on the object, then pad the resulting int and finally split it into 3 chunks of 5 digits, unfortunately again a lot of collisions occurred (Random Words tried: 10000000 No of collisions: 9999900). I think this in part could be down to the max size of an int being 2^31 which is only a 10 digit number so the first index was always 00000.

I've also used universal hashing but again I got a fair amount of collisions (Random Words tried: 10000000 No of collisions: 9996436)

I am wondering if I am missing something obvious here or if anyone knows of any well-known algorithms that might be able to help here? Apologies in advance for the noob question, this is the first time I've encountered hashing and there is a whole lot to learn.

I've pasted my code & test code below in case there is something obviously wrong with that.

public static String generate3Words1(Object obj) {
    BigInteger input = BigInteger.valueOf(obj.hashCode());
    int index1 = indexFor(CRC32Hash(input.toByteArray()));
    int index2 = indexFor(FNV1a64Hash(input.toByteArray()));
    int index3 = indexFor(DBJ2(input.toByteArray()));
    return dictionary.get(index1) + "-" + dictionary.get(index2) + "-" + dictionary.get(index3);
}


public String generate3Words2(Object obj) {
       int h = (h = obj.hashCode()) ^ (h >>> 16);
       String i = String.format("%015d", h);      
       String s = dictionary.get(indexFor(Integer.parseInt(i.substring(0, 5)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(5, 10)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(10, 15))));
       return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
   }

private static int indexFor(long h) {
    return (int) (h & (ThreeWordHash.dictionary.size() - 1));
}

private static long FNV1a64Hash(byte[] data) {
    long hash = 0xcbf29ce484222325L;
    for (byte datum : data) {
        hash ^= (datum & 0xff);
        hash *= 1099511628211L;
    }
    return hash;
}

private static long CRC32Hash(byte[] data) {
    CRC32.reset();
    CRC32.update(data);
    return CRC32.getValue();
}


private static long DBJ2(byte[] data) {
    long hash = 5381;
    for (byte datum : data) {
        hash = ((hash << 5) + hash) + datum;
    }
    return hash;
}

private static String universalHashing(Object data) {
     int[] hashCodes = new int[NO_OF_WORDS];
     int hashCodeSizeDiff = WORD_SIZE - (WORD_SIZE / 2);
     int hstart = data.hashCode();
     int bmax = 1 << hashCodeSizeDiff;
     for (int i = 0; i < NO_OF_WORDS; i++) {
         hashCodes[i] = (((hstart * (i * 2 + 1)) + RAND.nextInt(bmax)) >> hashCodeSizeDiff) & (ThreeWordHash.dictionary.size() - 1);
     }
     String s = ThreeWordHash.dictionary.get(hashCodes[0]) + " " + ThreeWordHash.dictionary.get(hashCodes[1]) + " " + ThreeWordHash.dictionary.get(hashCodes[2]);
     return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
 }

Test code:

@Test
void generate3Words() {
    List<String> words = new ArrayList<>(TestDictionary.WORDS);
    words.addAll(TestDictionary.WORDS);

    Random random = new Random(1);
    HashSet<String> seen1 = new HashSet<>();
    HashSet<String> seen2 = new HashSet<>();
    
    int count = 0;
    int noOfIterations = 10000000;
    
    //NOTE test dict size approx 4k words
    for (int j = 0; j < noOfIterations; j++) {
        String randomWord =  new StringBuilder()
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size()))).toString();
      
        String res = ThreeWordHash.generate3Words(randomWord);
        
        if (seen1.contains(res) && !seen2.contains(randomWord)) {
            count++;
        }

        seen2.add(randomWord);
        seen1.add(res);
    }
    System.out.println("Random Words tried: " + seen2.size() + " No of collisions: " + count);
}
0

There are 0 best solutions below