Parsing hOCR to JSON with Python

3.7k Views Asked by At

I am using tesseract-ocr and get the output in hOCR format. I need to store this hOCR output into the database (PostgreSQL in my case).

Since I may need every piece of information (80% of it) from this hOCR individually, which would be the right approach? Should it be stored as XML datatype or parsed to JSON and stored? And in case of JSON, how to parse this hOCR to JSON with Python. Other related suggestions are also appreciated.

1

There are 1 best solutions below

1
On

hOCR appears to be a dialect of XML, so you should be able to use the xml.etree module from the stdlib to parse the hOCR code into a Python-navigable tree. Then navigate that tree to compose an object or nested dict, and then finally using the stdlib's json module to convert that dict to JSON.