Borderless pdf extraction to json is not working properly for Python camelot library

376 Views Asked by At

Can anyone give me quick answer/help that as we are facing some issue after pdf extraction to json using python camelot is not giving exact content. some content is missing after extraction.

1

There are 1 best solutions below

4
On

I tried the following code:

import camelot

pdf_path = '/YOUR/FILEPATH.pdf'
tables = camelot.read_pdf(pdf_path, flavor='stream')

enter image description here

Here are two problems:

  • headers font is not properly read, so you find strange characters like (cid:71)...
  • using flavor='lattice', the table isn't detected. Using flavor='stream', the table is detected, but the cells aren't properly detected.

At the moment, I think that Camelot can't properly extract this table. They are working on fixing the second problem (see this and this).