Camelot pdf extraction has an issue while copying texts among span cells

340 Views Asked by At

I am extracting data from PDFs using camelot and am faced with the following issue on 3. page of this datasheet. The problematic table is shown below:

The table with issue

The issue is inconsistency during the copying content of span cells. As you can see on the following picture span cells are correctly detected.

The grid of the Table

Even if the cells are detected correctly in the 3. column the content is copied to one of two spanned cells and in the 4. column the content is copied to two of three spanned cells. You can see the data I extracted as follow. There is always one missing cell per both columns.

The data I extracted from the table

And here is the code I used if you want to try it out;

table_areas=['86, 697, 529, 95'] # To ignore page borders
tables = camelot.read_pdf(single_source, pages='all', 
                          flavor = 'lattice', 
                          copy_text=['v'], 
                          line_scale = 110, 
                          table_regions=table_areas, 
                          flag_size = False, 
                          process_background=False)

Code (Colab):

!pip install "camelot-py[cv]" -q
!pip install PyPDF2==2.12.1
!apt-get install ghostscript
import camelot
import pandas as pd
from tabulate import tabulate
import re
import fitz
single_source = '/content/FDB9406_F085-D.PDF'
print("Extracting ", single_source, "...")

table_areas=['86, 697, 529, 95']
tables = camelot.read_pdf(single_source, pages='all', flavor = 'lattice', copy_text=['v'], line_scale = 110, table_regions=table_areas, flag_size = False, process_background=False)


print("Extracting ", single_source, "is finished!")

to visualize the tables:

for table in accurate_tables:
  print(table.parsing_report, table.shape, table._bbox)
  print(tabulate(table.df, headers='keys', tablefmt='psql'))
  camelot.plot(table, kind='grid').show()

print("Extracting ", single_source, "is finished!")
0

There are 0 best solutions below