How to get the hidden text layout that tesseract creates for pdf files?

535 Views Asked by At

I don't have much experience with ocr. Here's what I try:

  1. tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf

    The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf. My question is: can I get this layout as a file (hocr or html)? (Config parameters preferred, not API.)

    What I've tried:

  2. tesseract -l eng -psm 1 image_str007_0001.jpg output hocr

and

  1. hocr2pdf -i image_str007_001 -o output.pdf < output.hocr

    In the file output.pdf the words are badly mislpaced when searching through the text. Is command 2. not correct for creating the tesseract hocr layout file, or the hocr2pdf app does not create the pdf correctly?

0

There are 0 best solutions below