this is my first time posting here on stack overflow because I really have nowhere else to turn.
My problem is extracting a specific table from a PDF-file containing multiple tables, and converting that specific table to a dataframe.
image of PDF-page in question:
In the image, you can see the tables highlighted in red and a specific table highlighted in green. I want only the green table, not the other ones.
I have been trying to do this using Camelot. Basically:
tables_dataframe = camelot.read_pdf(PDF_file, pages=page, flavor='stream')
table = tables_dataframe[0]
df_table = table.df
This however is not optimal, since it takes all the content of that PDF-page (see image) and makes a mashed dataframe that gets too messy.
Now one could go about cleaning up the dataframe specifically... but I feel it is much more efficient to just generate the desired dataframe from the start.
All help is appreciated. How can I generalize a code that can specifically target the tables I want in a situation where there are multiple tables on the same page.
In order to target a specific table on the page of a pdf file you can you the
table_regionsparameter ofread_pdf()function. Here is an example according to Camelot docs:You will need to play around with the values to get the right table.