Merging two chunked dataframes

92 Views Asked by At

Working with large datasets in Python via Pandas and initially chunked the two datasets so they could load into memory but not sure how to merge them given they are turned into TextFileReader instead of dataframe types. I tried a solution from SO and it failed and I will have to merge the outputs several times with other files later as well as other data processing.

I initially tried this

A = pd.read_csv('A.csv', chunksize = 100000)
B= pd.read_csv('B.csv', chunksize = 100000)

C= pd.merge('A', 'B', how = 'left', on ='ID')

but got an error when merging as they are not data frames so I tried a suggestion from here:

n = 200000  #chunk row size
list_df = [df2[i:i+n] for i in range(0, df2.shape[0],n)]

res = pd.DataFrame() 
for chunk in list_df:
    res = pd.concat([res, df1.merge(chunk, how='left', left_on=['x','y'], right_on['x','y']) 

This returned an error that says module 'Pandas' has no attribute 'dataframe'. Looking for a proper way to merge these.

I am also wondering if I will have to do the solution merge code every time before each merge in the script? As I said in the top I will have to do lots of merges and other data manipulation with the outputs. Is there a quicker way to do this instead of doing for loops before each merge?

1

There are 1 best solutions below

1
Ömer Sezer On

Dask may help you: https://docs.dask.org/en/stable/dataframe.html

Dask is parallel computing library that can efficiently handle large datasets by parallelizing operations.

Dask Code:

import dask.dataframe as dd
A = dd.read_csv('A.csv')
B = dd.read_csv('B.csv')

C = dd.merge(A, B, how='left', on='ID')
C = C.compute()
C.to_csv('C.csv', index=False)

I also tested generating A.csv, and B.csv with chunk 100K. After generation, dask code will merge them.

Generator A.csv and B.csv:

import pandas as pd
import numpy as np

n_rows = 100000
data = {
    'ID': range(1, n_rows + 1),
    'ColumnB': ['ValueB' + str(i) for i in range(1, n_rows + 1)]
    #'ColumnA': ['ValueA' + str(i) for i in range(1, n_rows + 1)]
}
df = pd.DataFrame(data)
df.to_csv('B.csv', index=False)
# df.to_csv('A.csv', index=False)