Is this a good case for Viterbi's best path alg?

85 Views Asked by At

I've been working on a program that will read in OCR output, find the page numbers and then give them back to me. Any time my function finds a number it begins a sequence, it then looks on the next page for a number that is 1 greater than the previous. It can also add blanks to extrapolate a missing number.

On any given book my function will identify anywhere from 1-100 potential sequences. Many of the sequences it identifies are junk...totally useless. However, the others are usually subsets of the main sequences that could be stitched together to form a more comprehensive sequence. This is my problem: how do I stitch them together? My output as of now looks something like this:

 Index: 185 PNUM: 158   
 Index: 186 PNUM: 159   
 Index: 187 PNUM: 160   
 Index: 188 PNUM: 161   
 Index: 189 PNUM: 162   
 Index: -1 PNUM: blank   
 Index: -1 PNUM: blank   
 -------------------------------------------------
 Index: 163 PNUM: 134   
 Index: 164 PNUM: 135   
 Index: -1 PNUM: blank   
-------------------------------------------------
 Index: 191 PNUM: 166   
 Index: 192 PNUM: 167   
 Index: 193 PNUM: 168   
 Index: 194 PNUM: 169   

The index is the number of pages from the cover of the book, including all those copyright, dedication, table of contents pages that are traditionally unnumbered. The PNUM is the page number my alg detected. Here we can see three different sequences, the top and bottom of which should be stitched together. As you'll notice the offset between in the index and pnum for the top sequence is 27, while the offset for the bottom sequence is 25. The most common reason for the difference between the offset is either a missing page, or page that was scanned in twice.

It's been suggested to me that I use the Viterbi best path algorithm to stitch these sequences together, but that kind of seems like overkill to me since I really only need to stitch my sequences together, not confirm their accuracy. I really have no idea where to go with this and I greatly appreciate any help. Thanks!

1

There are 1 best solutions below

2
On

Viterbi

Yep, Viterbi would work, slight overkill but will give you lots of flexibility later to make up for problems in OCR, missing pages, duplicates, etc...

If you take the wikipedia pseudocode, your problem can be reformulated as

//this is the actual hidden variable you're trying to guess
states = ('i', 'ii', 'iii', 'iv', ...., '1','2','3' ....)

//what OCR will give you, a 98% accurate view of state
//blank is for when there is no page number
//other is for an OCR result you didn't anticipate, such as 'f413dsaf'
possible_observations = (blank,other, 'i','ii','iii','iv',...,'1','2','3'...)

//the probability distribution of states for the first page
//must sum to 1.0
start_probability = {'i': 0.2, '1':0.5, all the rest: (1-0.7)/numOtherStates}

//the probability that the state '2' is found after '1'
//let's put a 0.05 percent chance of duplicate
//and put a very small probability of getting somewhere random
transition_probability = {
'i' : {'ii':0.8,'1':0.1,'i':0.05,allOthers: 0.05/numOtherStates},
'1' : {'2': 0.9, '1': 0.05, allOthers: 0.05/numOtherStates}
//etc
}

//that's the probability of what you OCR will see given the true state
//for the true page '1', there's 95% percent chance the OCR will see '1', 1% it will see    
//'i', 3% it will see a blank, and 0.01%/otherObservation that it will OCR something else
//you can use some string distance for that one (Levenshtein etc...)
emission_probability = {
'1' : {'1': 0.95, 'i': 0.01, blank: 0.03, otherObservations: (0.01)/numObservations},
'2' : {'2': 0.95, 'z': 0.01, blank: 0.03, otherObservations: (0.01)/numObservations},
}

observations = for i = 1 to maxINDEX {PNUM[INDEX]}

Other possibility: use levenshtein distance

Put all your page numbers sequentially again in an array {PNUM[INDEX=0], PNUM[INDEX=1], ...} and try to match it with 1, 2, 3, ... MAX(PNUM). While computing the distance, the levenshtein algorithm will insert changes (deletes, inserts, page change). If you code it to show those changes, you should have something decent as well.