I have a pdf file with mixed tables with different columns. I am reading tables from each page using tabula. Now I need to sort these tables to different tables based on common column name. Presicely, the steps involved would be -
- Read the tables using tabula
- Compare the tables with my dataframe. If columns don't match, create a new dataframe and add this dataframe to a list of dataframe
- Repeat this process with for table in next page.
Please suggest a code for this
df_try = tabula.read_pdf(file,pages=106)[0]
ind = 2
obj = PyPDF2.PdfReader(file)
NumPages = len(obj.pages)
while ind < len(j)-1:
df_new = tabula.read_pdf(file,pages=j[ind])[0]
if len(df_new.columns) == len(df_try.columns):
if df_new.columns == df_try.columns:
df_try = pd.concat([df_try,df_new], axis=0, ignore_index = True)
else:
print("Page not included",j[ind] )
ind = ind+1
I am not able to create a new dataframe with variable name. Also, I need to compare the columns with all dataframe in the list.
You can read all the pages using :
Then create
dict
to store dfs with similar columns where columns tuple is the key :output:
Note : we do not need to check if the key already exist defaultdict it set default value as
[]
Now we can merge similar dfs using pd.concat:
output :