How to get the hidden text layout that tesseract creates for pdf files?

528 Views Asked by user6028395 At 28 July 2025 at 01:24

I don't have much experience with ocr. Here's what I try:

tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf

The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf. My question is: can I get this layout as a file (hocr or html)? (Config parameters preferred, not API.)

What I've tried:
tesseract -l eng -psm 1 image_str007_0001.jpg output hocr

and

hocr2pdf -i image_str007_001 -o output.pdf < output.hocr

In the file output.pdf the words are badly mislpaced when searching through the text. Is command 2. not correct for creating the tesseract hocr layout file, or the hocr2pdf app does not create the pdf correctly?

There are 0 best solutions below