Tabula-py Not readng the full data of file

662 Views Asked by At

I was trying to read table from a PDF file using the tabula read_pdf() method. But it is not reading complete table. It is missing out on some row of table. I was trying the below given code:

tables = tabula.read_pdf(f,
                         stream=True,
                         pages="all",
                         silent=False,
                         multiple_tables=True,
                         password=pas,
                         pandas_options={'header': None}
                        )
df = pd.DataFrame()
df = pd.concat([c for c in tables]).drop_duplicates()
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df[df.isnull().sum(axis=1) < df.shape[1] - 2].reset_index(drop=True)

This was the output while running the above code :

| XNS Date   | Narration | Withdrawl | Credits |
|------------|-----------|-----------|---------|
| 01/04/2018 | IMPS 1234 | 2200      |         |
| 02/04/2018 | NEFT 4567 |           | 4500    |
| 03/04/2018 | RTGS 2234 |           | 5500    |

And the actual data from PDF file

| XNS Date   | Narration | Withdrawl | Credits |
|------------|-----------|-----------|---------|
| 30/03/2018 | NEFT 445  |           | 1200    |
| 31/03/2018 | RTGS 556  |           | 2000    |
| 01/04/2018 | IMPS 1234 | 2200      |         |
| 02/04/2018 | NEFT 4567 |           | 4500    |
| 03/04/2018 | RTGS 2234 |           | 5500    |
| 04/04/2018 | POS       | 1500      |         |

0

There are 0 best solutions below