Problem extracting a specific table from a PDF-page with multiple tables. (Python)

64 Views Asked by At

this is my first time posting here on stack overflow because I really have nowhere else to turn.

My problem is extracting a specific table from a PDF-file containing multiple tables, and converting that specific table to a dataframe.

image of PDF-page in question:

In the image, you can see the tables highlighted in red and a specific table highlighted in green. I want only the green table, not the other ones.

I have been trying to do this using Camelot. Basically:


tables_dataframe = camelot.read_pdf(PDF_file, pages=page, flavor='stream')
table = tables_dataframe[0] 
df_table = table.df


This however is not optimal, since it takes all the content of that PDF-page (see image) and makes a mashed dataframe that gets too messy.

Now one could go about cleaning up the dataframe specifically... but I feel it is much more efficient to just generate the desired dataframe from the start.

All help is appreciated. How can I generalize a code that can specifically target the tables I want in a situation where there are multiple tables on the same page.

1

There are 1 best solutions below

2
MagentaPink On

In order to target a specific table on the page of a pdf file you can you the table_regions parameter of read_pdf() function. Here is an example according to Camelot docs:

tables = camelot.read_pdf('table_regions.pdf', table_regions=['170,370,560,270'])

You will need to play around with the values to get the right table.