I've got a system that is basically a serial number generator.
There are two main design requirements of the serial number generator (in terms of fault tolerance):
- All serial numbers issued must be unique (no duplicates)
- If the serial number generator crashes, it needs a way to restart approximately from where it left off (skipping a few serial numbers when recovering from a crash is fine, if necessary, to ensure requirement #1 is upheld)
It can be assumed that serial numbers are issued sequentially (1, 2, 3, 4, etc.)
My first attempt to solve this is for the serial number generator to log every serial number that is issues by appending to a single log file. This way, if it crashes, it simply picks up from the last serial number issued, and goes along it's merry way.
Heres the challenge:
So, what are the pros/cons of the following logging approaches:
- Have a single log file which is appended, and capped at a certain size
- Have two log files, neither of which are appended. The first log file always contains the most recently issued serial number, and the second file contains the 2nd most recently issued serial number. The log files are written to in a see-saw fashion (i.e. all even serial numbers wind up getting put in log file 'A', and odd serial numbers get put in log file 'B'). The end result is that you've always got the last two serial numbers issued (and no more), the use of disk space is tiny, and if the serial number generator happened to crash while logging the most recent serial number, then the 'most recent' file may be corrupted. Since we also have the 2nd most recent serial number in the other log file, we can detect the crash, and restart the serial number generator from this '2nd most recent' serial number, +2
In a crash scenario where the serial number generator cannot tell which of the 'most recent' or '2nd most recent' log file is the one corrupted, it should be safe to always restart a crashed generator from the the non-corrupted serial number +2.
Option 1 is a little simpler to implement, but option 2 uses less disk space, and seems more intelligent.
Am I missing anything in terms of designing a system that can reliably recover from a crash via sufficient log files?
You need to decide on a realm of "close". By that I mean how many numbers you are willing to lose in case of a crash.
Let's say its 1000.
Then you keep the largest sequence in a file.
When its time to update, you write the new number to a new file, then rename it over the old file. This is an atomic operation on modern file systems, it either works or it doesn't, so it's like a commit in a database. It guarantees you have room for the new sequence information, and should fail without harming the current sequence information if something really untoward happens.
If there's a failure, you have to STOP, and abort the sequence generator.
The key here is that the number on the file system is larger than any issued number. So, you must guarantee that it never ends up below a current issued number, or it will reuse numbers on restart.
So, here's the procedure.
There may be a one off error in here, untested.
Addenda:
If you're that worried, you could keep four pieces of data in memory, and write out two copies. Or better, six and three.
The data you keep in memory is three copies of the counter, and three checksums of those counters (MD5 of the value perhaps).
Then when you write them, you use the same technique as above, write, then rename.
But you write the values and the hashes.
The reason you do this is simple.
If the values of the sequence do not match their hash/checksum, you know that the pair is in error.
You have three copies based on the premise that while one corruption is possible, and not just in disk, but in memory as well -- do not disregard potential memory errors (if you want to go paranoid, go all the way I say), but the fact of the corruption affecting more than one is astronomically unlikely.
When you do detect a failed pair, you have three samples to choose from, and each sample is a "vote". Pick the two that match as the official value, recover with that value, and move on.