Sort tables based on column names in python

Question

Sort tables based on column names in python

40 Views Asked by deboshree roy At 28 July 2025 at 00:48

I have a pdf file with mixed tables with different columns. I am reading tables from each page using tabula. Now I need to sort these tables to different tables based on common column name. Presicely, the steps involved would be -

Read the tables using tabula
Compare the tables with my dataframe. If columns don't match, create a new dataframe and add this dataframe to a list of dataframe
Repeat this process with for table in next page.

Please suggest a code for this

df_try = tabula.read_pdf(file,pages=106)[0]
ind = 2
obj = PyPDF2.PdfReader(file)
NumPages = len(obj.pages)
while ind < len(j)-1:
    df_new = tabula.read_pdf(file,pages=j[ind])[0]
    if len(df_new.columns) == len(df_try.columns):
        if df_new.columns == df_try.columns:
             df_try = pd.concat([df_try,df_new], axis=0, ignore_index = True)
    else:
        print("Page not included",j[ind] )
    ind = ind+1

I am not able to create a new dataframe with variable name. Also, I need to compare the columns with all dataframe in the list.

Original Q&A

There are 1 best solutions below

**حمزة نبيل** · Answer 1

You can read all the pages using :

all_dfs = tabula.read_pdf(file,pages"all")

Then create dict to store dfs with similar columns where columns tuple is the key :

import pandas as pd
from collections import defaultdict

# this is example
all_dfs = [
    pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}), #A,B
    pd.DataFrame({'A': [4, 5, 6], 'B': ['d', 'e', 'f']}), #A,B
    pd.DataFrame({'B': ['g', 'h', 'i'], 'C': [7, 8, 9]})  #B,C
]

# creat  dict 
dfs_dict = defaultdict(list)

for df in all_dfs:
    # Create the key 
    cols_key = tuple(df.columns)
    # add the df 
    dfs_dict[cols_key].append(df)

dfs_dict.keys()

output:

dict_keys([('A', 'B'), ('B', 'C')])

Note : we do not need to check if the key already exist defaultdict it set default value as []

Now we can merge similar dfs using pd.concat:

dfs = {}
for cols, dfs_list in dfs_dict.items():
    # merge all dfs
    df_merged = pd.concat(dfs_list, axis=0)
    # add the merged df to the new dict
    dfs[cols] = df_merged

dfs[('A', 'B')]

output :

index	A	B
0	1	a
1	2	b
2	3	c
0	4	d
1	5	e
2	6	f

Sort tables based on column names in python

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in DATAFRAME

Related Questions in TABULA

Trending Questions

Popular # Hahtags

Popular Questions