Easiest way to ignore or drop one header row from first page, when parsing table spanning several pages

903 Views Asked by At

I am parsing a PDF with tabula-py, and I need to ignore the first two tables, but then parse the rest of the tables as one, and export to a CSV. On the first relevant table (index 2) the first row is a header-row, and I want to leave this out of the csv.

See my code below, including my attempt at dropping the relevant row from the Pandas frame.

What is the easiest/most elegant way of achieving this?

tables = tabula.read_pdf('input.pdf', pages='all', multiple_tables=True)
f = open('output.csv', 'w')
# tables[2].drop(index=0) # tried this, but makes no difference
for df in tables[2:]:
    df.to_csv(f, index=False, sep=';')
f.close()
1

There are 1 best solutions below

0
On

Given the following toy dataframes:

import pandas as pd

tables = [
    pd.DataFrame([[1, 3], [2, 4]]),
    pd.DataFrame([["a", "b"], [1, 3], [2, 4]]),
]
for table in tables:
    print(table)
# Ouput
   0  1
0  1  3
1  2  4

   0  1
0  a  b  <<< Unwanted row in table[1]
1  1  3
2  2  4

You can drop the first row of the second dataframe either by reassigning the resulting dataframe (preferable way):

tables[1] = tables[1].drop(index=0)

Or inplace:

tables[1].drop(index=0, inplace=True)

And so, in both cases:

print(table[1])
# Output
   0  1
1  1  3
2  2  4