Basic OCR PostProcessing (Spelling corrector)

2.5k Views Asked by mlissner At 05 June 2025 at 13:55

I'm setting up a server to do a lot of automated OCR using tesseract, and I want to do some postprocessing of the results.

There are a LOT of resources about this on the theoretical side, but I haven't found much on the practical side.

I imagine there are some basic things you can do, like:

Eliminate three identical letters in a row
Eliminate 'words' with all of the vowels
Eliminate 'words' longer than a certain length
Etc.

I haven't given this a ton of thought, but the OCR'ed text gets fed into a search system, so keeping the wordmap small is a good thing, as is eliminating or fixing words that are obviously wrong.

If it matters, the content itself is court documents written in English. So there are proper names from time to time, but the variety of words probably isn't huge, and the fonts are probably pretty stable.

Any pointers or good resources I should know about?

Original Q&A

There are 1 best solutions below

Andrew Cash On 24 January 2012 at 05:06

Each OCR engine will have its own set of common errors which will also depend on the fonts in the document, the quality of the scanning, the dpi used, colour background and the image pre-processing used such as despeckle, deskew, line removal. You will only learn what these errors are by performing lots of test runs and analysing the results looking for a common set of errors.

Using the correct scanner settings and image preprocessing algorithms can improve OCR results considerably. Don't underestimate this part.

If the text is mainly English words then a good dictionary with a fuzzy type of lookup system will be very helpful. Other useful techniques are trigram analysis and voting with a 2nd OCR engine.

Basic OCR PostProcessing (Spelling corrector)

There are 1 best solutions below

Related Questions in OCR

Related Questions in TESSERACT

Related Questions in POST-PROCESSING

Trending Questions

Popular # Hahtags

Popular Questions