I was trying to read table from a PDF file using the tabula read_pdf()
method. But it is not reading complete table. It is missing out on some row of table.
I was trying the below given code:
tables = tabula.read_pdf(f,
stream=True,
pages="all",
silent=False,
multiple_tables=True,
password=pas,
pandas_options={'header': None}
)
df = pd.DataFrame()
df = pd.concat([c for c in tables]).drop_duplicates()
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df[df.isnull().sum(axis=1) < df.shape[1] - 2].reset_index(drop=True)
This was the output while running the above code :
| XNS Date | Narration | Withdrawl | Credits |
|------------|-----------|-----------|---------|
| 01/04/2018 | IMPS 1234 | 2200 | |
| 02/04/2018 | NEFT 4567 | | 4500 |
| 03/04/2018 | RTGS 2234 | | 5500 |
And the actual data from PDF file
| XNS Date | Narration | Withdrawl | Credits |
|------------|-----------|-----------|---------|
| 30/03/2018 | NEFT 445 | | 1200 |
| 31/03/2018 | RTGS 556 | | 2000 |
| 01/04/2018 | IMPS 1234 | 2200 | |
| 02/04/2018 | NEFT 4567 | | 4500 |
| 03/04/2018 | RTGS 2234 | | 5500 |
| 04/04/2018 | POS | 1500 | |