Covert List to DataFrame | tabula-py | read_pdf_with_template()

932 Views Asked by At

Problem Statement:

I'm using Tabula App user interface for selecting dimension of table from PDF file as tabula-template to give dimension in JSON Format.

The DataFrame in Tabula App Interface from extracting table after selecting Table dimension is correct.

However, when I'm using read_pdf_with_template() method which is returning List object. When I'm converting this List Object to DataFrame then its merging different columns.


Code Snippet:

  1. After using read_pdf_with_template() method.
  • file is PDF file.
  • tabula_saved.json is JSON dimension Template of PDF File created using Tabula App Interface.
tables = tabula.read_pdf_with_template(file, "tabula_saved.json")
tables

Output:

[   0  \ 
 0   01/02/2018   
 1   01/02/2018   
 2   01/02/2018   
 3   01/02/2018    
 
   1 \
                      
 0   01/02/2018 7777 31-JAN-18M2M Cash Dep Chrgs 202.00                          
 1   01/02/2018 4115 NEFT : 00003- TV 18 HOME  5,52,743.00                         
 2   01/02/2018 4115 NEFT : AXISP1-TECH 25,252.00                          
 3   01/02/2018 1221 To ECS  : ECS-TP UIA 911387 66,733.00                         
 
      2   
 0     40,233.11  
 1   5,92,976.11  
 2   6,18,228.11  
 3   5,51,495.11
  1. After trying to converting to DataFrame using below code
df = pd.DataFrame()
df = pd.concat([c for c in tables]).drop_duplicates()
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df[df.isnull().sum(axis=1) < df.shape[1] - 2].reset_index(drop=True)
df

Output:

| 0           | 1                                                     | 2           |
|-------------|-------------------------------------------------------|-------------|
| 01/02/2018  | 01/02/2018 7777 31-JAN-18M2M Cash Dep Chrgs 202.00    | 40,233.11   |
| 01/02/2018  | 01/02/2018 4115 NEFT : 00003- TV 18 HOME  5,52,743.00 | 5,92,976.11 |
| 01/02/2018  | 01/02/2018 4115 NEFT : AXISP1-TECH 25,252.00          | 6,18,228.11 |
| 01/02/2018  | 01/02/2018 1221 To ECS  : ECS-TP UIA 911387 66,733.00 | 5,51,495.11 |

The DataFrame extracted in Tabula App Interface which is correct.

| Txn Date    | Value Date | Brn Code | Particulars                 | Ref No | Debit     | Credit      | Balance           |
|-------------|------------|----------|-----------------------------|--------|-----------|-------------|-------------|
| 01/02/2018  | 01/02/2018 | 7777     | 31-JAN-18M2M Cash Dep Chrgs |        | 202.00    |             | 40,233.11   |
| 01/02/2018  | 01/02/2018 | 4115     | NEFT : 00003- TV 18 HOME    |        |           | 5,52,743.00 | 5,92,976.11 |
| 01/02/2018  | 01/02/2018 | 4115     | NEFT : AXISP1-TECH          |        |           | 25,252.00   | 6,18,228.11 |
| 01/02/2018  | 01/02/2018 | 1221     | To ECS  : ECS-TP UIA        | 911387 | 66,733.00 |             | 5,51,495.11 |

Note: Please Ignore the columns header in this question.

1

There are 1 best solutions below

0
On

Do you want this?

col_list = ['Value Date', 'Brn Code', 'Particulars', 'Ref No', 'Debit',
       'Credit']
df['merged'] = df.apply(lambda x: ' '.join(x[col] for col in col_list), axis=1)