Building a crash tolerant serial number generator via log files

241 Views Asked by At

I've got a system that is basically a serial number generator.

There are two main design requirements of the serial number generator (in terms of fault tolerance):

  1. All serial numbers issued must be unique (no duplicates)
  2. If the serial number generator crashes, it needs a way to restart approximately from where it left off (skipping a few serial numbers when recovering from a crash is fine, if necessary, to ensure requirement #1 is upheld)

It can be assumed that serial numbers are issued sequentially (1, 2, 3, 4, etc.)

My first attempt to solve this is for the serial number generator to log every serial number that is issues by appending to a single log file. This way, if it crashes, it simply picks up from the last serial number issued, and goes along it's merry way.

Heres the challenge:

So, what are the pros/cons of the following logging approaches:

  1. Have a single log file which is appended, and capped at a certain size
  2. Have two log files, neither of which are appended. The first log file always contains the most recently issued serial number, and the second file contains the 2nd most recently issued serial number. The log files are written to in a see-saw fashion (i.e. all even serial numbers wind up getting put in log file 'A', and odd serial numbers get put in log file 'B'). The end result is that you've always got the last two serial numbers issued (and no more), the use of disk space is tiny, and if the serial number generator happened to crash while logging the most recent serial number, then the 'most recent' file may be corrupted. Since we also have the 2nd most recent serial number in the other log file, we can detect the crash, and restart the serial number generator from this '2nd most recent' serial number, +2

In a crash scenario where the serial number generator cannot tell which of the 'most recent' or '2nd most recent' log file is the one corrupted, it should be safe to always restart a crashed generator from the the non-corrupted serial number +2.

Option 1 is a little simpler to implement, but option 2 uses less disk space, and seems more intelligent.

Am I missing anything in terms of designing a system that can reliably recover from a crash via sufficient log files?

2

There are 2 best solutions below

4
On BEST ANSWER

You need to decide on a realm of "close". By that I mean how many numbers you are willing to lose in case of a crash.

Let's say its 1000.

Then you keep the largest sequence in a file.

When its time to update, you write the new number to a new file, then rename it over the old file. This is an atomic operation on modern file systems, it either works or it doesn't, so it's like a commit in a database. It guarantees you have room for the new sequence information, and should fail without harming the current sequence information if something really untoward happens.

If there's a failure, you have to STOP, and abort the sequence generator.

The key here is that the number on the file system is larger than any issued number. So, you must guarantee that it never ends up below a current issued number, or it will reuse numbers on restart.

So, here's the procedure.

function int getNextSequence() {
    currentSeq = currentSeq + 1; 
    if (currentSeq >= maxSeq) {
        maxSeq = maxSeq + 1000;
        write(maxSeq, "newSeq");
        rename("newSeq", "curSeq");
    }
    return currentSeq;
}

function restartSequence() {
    maxSeq = read("curSeq");

    currentSeq = maxSeq - 1; // This will immediately create a disk update on first use.
 }

There may be a one off error in here, untested.

Addenda:

If you're that worried, you could keep four pieces of data in memory, and write out two copies. Or better, six and three.

The data you keep in memory is three copies of the counter, and three checksums of those counters (MD5 of the value perhaps).

Then when you write them, you use the same technique as above, write, then rename.

But you write the values and the hashes.

The reason you do this is simple.

If the values of the sequence do not match their hash/checksum, you know that the pair is in error.

You have three copies based on the premise that while one corruption is possible, and not just in disk, but in memory as well -- do not disregard potential memory errors (if you want to go paranoid, go all the way I say), but the fact of the corruption affecting more than one is astronomically unlikely.

When you do detect a failed pair, you have three samples to choose from, and each sample is a "vote". Pick the two that match as the official value, recover with that value, and move on.

3
On

Before going about with any sort of design I think you really need to define and address the reasons why such a simple piece of software might crash.

Off the top of my head a few might be: Lack of disk space, shoddy coding with unfreed resources, threading issues, etc.

If your goal is to simply make sure a generated serial number is persisted and unique then I'd probably suggest using something like a sql server in combination with a column of type NEWSEQUENTIALID(). There are certain advantages here due to the problem space that sql server has already solved. The number of transactions a second you could support is really up to the hardware and how you go about using it.

This was a long winded way of saying: First establish why you think it will crash. Then look at existing technologies to see if they meet your need before you go on to write something like this.

For example. If you're having problems with threads, consider using a web server to handle all of that for you. If you're having problems with disk space, consider upgrading your hardware. If you have problems with ensuring persistance, use a SQL (brand doesn't matter too much) server to store the data. If the generator machine is overloaded, consider a different architecture that allows you to load balance the devices intelligently.


One other thing: I think neither of the approaches you've outlined are good solutions. If you are truly generating 1000's per second then you might consider load balancing the generation. At which point you'll have serious issues figuring out how to keep regular log files in sync across multiple generation points... Which, sql server is already good at.