Generating a fake ISBN from book title? (Or: How to hash a string into a 6-digit numeric ID)

3.5k Views Asked by At

Short version: How can I turn an arbitrary string into a 6-digit number with minimal collisions?

Long version:

I'm working with a small library that has a bunch of books with no ISBNs. These are usually older, out-of-print titles from tiny publishers that never got an ISBN to begin with, and I'd like to generate fake ISBNs for them to help with barcode scanning and loans.

Technically, real ISBNs are controlled by commercial entities, but it is possible to use the format to assign numbers that belong to no real publisher (and so shouldn't cause any collisions).

The format is such that:

978-0-01-######-?

Gives you 6 digits to work with, from 000000 to 999999, with the ? at the end being a checksum.

Would it be possible to turn an arbitrary book title into a 6-digit number in this scheme with minimal chance of collisions?

2

There are 2 best solutions below

0
On

After using code snippets for making a fixed-length hash and calculating the ISBN-13 checksum, I managed to create really ugly C# code that seems to work. It'll take an arbitrary string and convert it into a valid (but fake) ISBN-13:

       public int GetStableHash(string s)
       {
           uint hash = 0;
           // if you care this can be done much faster with unsafe 
           // using fixed char* reinterpreted as a byte*
           foreach (byte b in System.Text.Encoding.Unicode.GetBytes(s))
           {   
               hash += b;
               hash += (hash << 10);
               hash ^= (hash >> 6);    
           }
           // final avalanche
           hash += (hash << 3);
           hash ^= (hash >> 11);
           hash += (hash << 15);
           // helpfully we only want positive integer < MUST_BE_LESS_THAN
           // so simple truncate cast is ok if not perfect
           return (int)(hash % MUST_BE_LESS_THAN);
       }

       public int CalculateChecksumDigit(ulong n)
       {
           string sTemp = n.ToString();
           int iSum = 0;
           int iDigit = 0;

           // Calculate the checksum digit here.
           for (int i = sTemp.Length; i >= 1; i--)
           {
               iDigit = Convert.ToInt32(sTemp.Substring(i - 1, 1));
               // This appears to be backwards but the 
               // EAN-13 checksum must be calculated
               // this way to be compatible with UPC-A.
               if (i % 2 == 0)
               { // odd  
                   iSum += iDigit * 3;
               }
               else
               { // even
                   iSum += iDigit * 1;
               }
           }
           return (10 - (iSum % 10)) % 10;
       }


       private void generateISBN()
       {
           string titlehash = GetStableHash(BookTitle.Text).ToString("D6");
           string fakeisbn = "978001" + titlehash;
           string check = CalculateChecksumDigit(Convert.ToUInt64(fakeisbn)).ToString();

            SixDigitID.Text = fakeisbn + check;
       }
7
On

The 6 digits allow for about 10M possible values, which should be enough for most internal uses. I would have used a sequence instead in this case, because a 6 digit checksum has relatively high chances of collisions.

So you can insert all strings to a hash, and use the index numbers as the ISBN, either after sorting or without it.
This should make collisions almost impossible, but it requires keeping a number of "allocated" ISBNs to avoid collisions in the future, and keeping the list of titles that are already in store, but it's information that you would most probably want to keep anyway.

Another option is to break the ISBN standard and use hexadecimal/uuencoded barcodes, that may increase the possible range to a point where it may work with a cryptographic hash truncated to fit.

I would suggest that since you are handling old book titles, which may have several editions capitalized and punctuated differently, I would strip punctuation, duplicated whitespaces and convert everything to lowercase before the comparison to minimize the chance of a technical duplicate even though the string is different (Unless you want different editions to have different ISBNs, in that case, you can ignore this paragraph).