Sort tables based on column names in python

48 Views Asked by At

I have a pdf file with mixed tables with different columns. I am reading tables from each page using tabula. Now I need to sort these tables to different tables based on common column name. Presicely, the steps involved would be -

  1. Read the tables using tabula
  2. Compare the tables with my dataframe. If columns don't match, create a new dataframe and add this dataframe to a list of dataframe
  3. Repeat this process with for table in next page.

Please suggest a code for this

df_try = tabula.read_pdf(file,pages=106)[0]
ind = 2
obj = PyPDF2.PdfReader(file)
NumPages = len(obj.pages)
while ind < len(j)-1:
    df_new = tabula.read_pdf(file,pages=j[ind])[0]
    if len(df_new.columns) == len(df_try.columns):
        if df_new.columns == df_try.columns:
             df_try = pd.concat([df_try,df_new], axis=0, ignore_index = True)
    else:
        print("Page not included",j[ind] )
    ind = ind+1

I am not able to create a new dataframe with variable name. Also, I need to compare the columns with all dataframe in the list.

1

There are 1 best solutions below

0
On

You can read all the pages using :

all_dfs = tabula.read_pdf(file,pages"all")

Then create dict to store dfs with similar columns where columns tuple is the key :

import pandas as pd
from collections import defaultdict

# this is example
all_dfs = [
    pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}), #A,B
    pd.DataFrame({'A': [4, 5, 6], 'B': ['d', 'e', 'f']}), #A,B
    pd.DataFrame({'B': ['g', 'h', 'i'], 'C': [7, 8, 9]})  #B,C
]

# creat  dict 
dfs_dict = defaultdict(list)

for df in all_dfs:
    # Create the key 
    cols_key = tuple(df.columns)
    # add the df 
    dfs_dict[cols_key].append(df)

dfs_dict.keys()

output:

dict_keys([('A', 'B'), ('B', 'C')])

Note : we do not need to check if the key already exist defaultdict it set default value as []

Now we can merge similar dfs using pd.concat:

dfs = {}
for cols, dfs_list in dfs_dict.items():
    # merge all dfs
    df_merged = pd.concat(dfs_list, axis=0)
    # add the merged df to the new dict
    dfs[cols] = df_merged

dfs[('A', 'B')]

output :

index A B
0 1 a
1 2 b
2 3 c
0 4 d
1 5 e
2 6 f