How to use tabula to extract the table more details by using python script?

68 Views Asked by At

This is my code:

import tabula

# Specify the path to your PDF file
pdf_path = "path.pdf"

# Use tabula.read_pdf with the default auto method
tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)

# Print each table
for i, table in enumerate(tables):
    print(f"Table {i + 1}:\n{table}\n")

And that's the result come out: table extracted using tabula (python script)

But in the pdf, the table will look like: table i want to extract in pdf file

Therefore, I would like to know how to extract the table perfectly like this sample table?

2

There are 2 best solutions below

0
On

I have found that by adding the lattice to true will make the table looks better like this: table printed out in terminal after using the lattice parameter

tables = tabula.read_pdf(pdf_path, pages='all', 
multiple_tables=True,lattice=True)

But there are still redundant column for example the Unnamed: 0 at the beginning and the Unnamed: 1 columns at the end. So, how can i make it better?

1
On

You can use area parameter to specify the area where the table exist :

area (iterable of float, iterable of iterable of float, optional) –

Portion of the page to analyze(top,left,bottom,right). Default is entire page.

updated code

tables = tabula.read_pdf(
    pdf_path,
    pages='all',
    multiple_tables=True,
    lattice=True,
    area=[40, 5, 175, 730] # need to be adjusted based on the table position
)