What are the strategies to convert an HOCR output to a string (for regex purposes)?

1.4k Views Asked by Maxime Georges At 09 August 2019 at 15:40

I am working with Pytesseract and would like to convert an HOCR output to a string. Of course, such a function is implemented into Pytesseract but I would like to know more about the possible strategies to get it done thx

from pytesseract import image_to_pdf_or_hocr
hocr_output = image_to_pdf_or_hocr(image, extension='hocr')

Original Q&A

There are 1 best solutions below

David Rubio On 17 November 2019 at 23:37

Since hOCR is a type of .xml we can use a .xml parser.

But first we need to convert the binary output of tesseract to str:

from pytesseract import image_to_pdf_or_hocr

hocr_output = image_to_pdf_or_hocr(image, extension='hocr')
hocr = hocr_output.decode('utf-8')

Now we can use xml.etree to parse it:

import xml.etree.ElementTree as ET

root = ET.fromstring(hocr)

xml.etree provides us with a text iterator whose result we can join in a single string:

text = ''.join(root.itertext())

What are the strategies to convert an HOCR output to a string (for regex purposes)?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-TESSERACT

Related Questions in HOCR

Trending Questions

Popular # Hahtags

Popular Questions