Problem Statement:
I'm using Tabula App user interface for selecting dimension of table from PDF file as tabula-template to give dimension in JSON Format.
The DataFrame in Tabula App Interface from extracting table after selecting Table dimension is correct.
However, when I'm using read_pdf_with_template
method which is returning List object. When I'm converting this List Object to DataFrame then its merging different columns.
Code Snippet:
tables = tabula.read_pdf_with_template(f, "tabula_saved.json")
df = pd.DataFrame()
df = pd.concat([c for c in tables]).drop_duplicates()
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df[df.isnull().sum(axis=1) < df.shape[1] - 2].reset_index(drop=True)
DataFrame
The DataFrame in Tabula App interface:
| Txn Date | Value Date | Brn Code | Particulars | Ref No | Debit | Credit | Balance |
|-------------|------------|----------|-----------------------------|--------|-----------|-------------|-------------|
| 01/02/2018 | 01/02/2018 | 7777 | 31-JAN-18M2M Cash Dep Chrgs | | 202.00 | | 40,233.11 |
| 01/02/2018 | 01/02/2018 | 4115 | NEFT : 00003- TV 18 HOME | | | 5,52,743.00 | 5,92,976.11 |
| 01/02/2018 | 01/02/2018 | 4115 | NEFT : AXISP1-TECH | | | 25,252.00 | 6,18,228.11 |
| 01/02/2018 | 01/02/2018 | 1221 | To ECS : ECS-TP UIA | 911387 | 66,733.00 | | 5,51,495.11 |
The DataFrame after using read_pdf_with_template
method to return list and then converting it to DataFrame
| 0 | 1 | 2 |
|-------------|-------------------------------------------------------|-------------|
| 01/02/2018 | 01/02/2018 7777 31-JAN-18M2M Cash Dep Chrgs 202.00 | 40,233.11 |
| 01/02/2018 | 01/02/2018 4115 NEFT : 00003- TV 18 HOME 5,52,743.00 | 5,92,976.11 |
| 01/02/2018 | 01/02/2018 4115 NEFT : AXISP1-TECH 25,252.00 | 6,18,228.11 |
| 01/02/2018 | 01/02/2018 1221 To ECS : ECS-TP UIA 911387 66,733.00 | 5,51,495.11 |
Note: Please Ignore the columns header in this question.