I've been working on a program that will read in OCR output, find the page numbers and then give them back to me. Any time my function finds a number it begins a sequence, it then looks on the next page for a number that is 1 greater than the previous. It can also add blanks to extrapolate a missing number.
On any given book my function will identify anywhere from 1-100 potential sequences. Many of the sequences it identifies are junk...totally useless. However, the others are usually subsets of the main sequences that could be stitched together to form a more comprehensive sequence. This is my problem: how do I stitch them together? My output as of now looks something like this:
Index: 185 PNUM: 158
Index: 186 PNUM: 159
Index: 187 PNUM: 160
Index: 188 PNUM: 161
Index: 189 PNUM: 162
Index: -1 PNUM: blank
Index: -1 PNUM: blank
-------------------------------------------------
Index: 163 PNUM: 134
Index: 164 PNUM: 135
Index: -1 PNUM: blank
-------------------------------------------------
Index: 191 PNUM: 166
Index: 192 PNUM: 167
Index: 193 PNUM: 168
Index: 194 PNUM: 169
The index is the number of pages from the cover of the book, including all those copyright, dedication, table of contents pages that are traditionally unnumbered. The PNUM is the page number my alg detected. Here we can see three different sequences, the top and bottom of which should be stitched together. As you'll notice the offset between in the index and pnum for the top sequence is 27, while the offset for the bottom sequence is 25. The most common reason for the difference between the offset is either a missing page, or page that was scanned in twice.
It's been suggested to me that I use the Viterbi best path algorithm to stitch these sequences together, but that kind of seems like overkill to me since I really only need to stitch my sequences together, not confirm their accuracy. I really have no idea where to go with this and I greatly appreciate any help. Thanks!
Viterbi
Yep, Viterbi would work, slight overkill but will give you lots of flexibility later to make up for problems in OCR, missing pages, duplicates, etc...
If you take the wikipedia pseudocode, your problem can be reformulated as
Other possibility: use levenshtein distance
Put all your page numbers sequentially again in an array {PNUM[INDEX=0], PNUM[INDEX=1], ...} and try to match it with 1, 2, 3, ... MAX(PNUM). While computing the distance, the levenshtein algorithm will insert changes (deletes, inserts, page change). If you code it to show those changes, you should have something decent as well.