Basic OCR PostProcessing (Spelling corrector)

2.5k Views Asked by At

I'm setting up a server to do a lot of automated OCR using tesseract, and I want to do some postprocessing of the results.

There are a LOT of resources about this on the theoretical side, but I haven't found much on the practical side.

I imagine there are some basic things you can do, like:

  • Eliminate three identical letters in a row
  • Eliminate 'words' with all of the vowels
  • Eliminate 'words' longer than a certain length
  • Etc.

I haven't given this a ton of thought, but the OCR'ed text gets fed into a search system, so keeping the wordmap small is a good thing, as is eliminating or fixing words that are obviously wrong.

If it matters, the content itself is court documents written in English. So there are proper names from time to time, but the variety of words probably isn't huge, and the fonts are probably pretty stable.

Any pointers or good resources I should know about?

1

There are 1 best solutions below

1
On

Each OCR engine will have its own set of common errors which will also depend on the fonts in the document, the quality of the scanning, the dpi used, colour background and the image pre-processing used such as despeckle, deskew, line removal. You will only learn what these errors are by performing lots of test runs and analysing the results looking for a common set of errors.

Using the correct scanner settings and image preprocessing algorithms can improve OCR results considerably. Don't underestimate this part.

If the text is mainly English words then a good dictionary with a fuzzy type of lookup system will be very helpful. Other useful techniques are trigram analysis and voting with a 2nd OCR engine.