I am using tesseract-ocr and get the output in hOCR format. I need to store this hOCR output into the database (PostgreSQL in my case).
Since I may need every piece of information (80% of it) from this hOCR individually, which would be the right approach? Should it be stored as XML datatype or parsed to JSON and stored? And in case of JSON, how to parse this hOCR to JSON with Python. Other related suggestions are also appreciated.
hOCR appears to be a dialect of XML, so you should be able to use the
xml.etree
module from the stdlib to parse the hOCR code into a Python-navigable tree. Then navigate that tree to compose an object or nested dict, and then finally using the stdlib'sjson
module to convert that dict to JSON.