Tesseract - Preprocessing that Doesn't Affect Final Image

120 Views Asked by prednizone At 20 April 2022 at 20:09

I'm using the latest version of Tesseract (5.0), and I'm trying to determine whether or not I can insert some preprocessing steps that will -not- affect the form of the final image.

For example, I might start out with an image such as this.

There are different levels of shadow/brightness, so I might use adaptive Gaussian thresholding to avoid shadows during binarization.

I will now run this through tesseract, with the hope of creating an OCR'd PDF in the end. However, I want the image that the end user (and I) see to be the full-color, original image, with the text from the transformed image underlaid

Is there a way to manage this? Or am I completely missing the point here.

Original Q&A

There are 1 best solutions below

prednizone On 24 May 2022 at 14:10 BEST ANSWER

I was provided an answer on another forum, and wanted to share it here.

Instead of using the built in PDF option in Tesseract, I used the hOCR setting. My pipeline went:

Preprocess image (thresholding, etc)
Run tesseract with the following command: tesseract example1.jpg example1 -l eng hocr
Use the hocr-pdf module from Ocropus to merge the hocr'd material with the ORIGINAL IMAGE, no preprocessing.

Tesseract - Preprocessing that Doesn't Affect Final Image

There are 1 best solutions below

Related Questions in OPENCV

Related Questions in OCR

Related Questions in TESSERACT

Related Questions in IMAGE-THRESHOLDING

Trending Questions

Popular # Hahtags

Popular Questions