I have a very large csv file (40G), and I want to split it into 10 df by column and then write each to csv file (about 4G each). To save time, I choose multiple processing to process it. But I found the mp doesn't work, it still processes one by one. I wonder if we cannot write large files by mp? here my code goes:
def split(i, output_path, original_large_data_path):
data = pandas.read_csv(original_large_data_path) #read in the large data
new_data = data[i].dropna(how = 'all', subset = [i]) #split the data and drop na based on seperated df
new_data.to_csv(os.path.join(output_path, '{}.csv'.format(i)) #write csv
pool = Pool(5)
for i in [some columns]:
r = pool.apply_async(split,(i,output_path, original_large_data_path,))
pool.close()
pool.join()
Use map, partial and a context manager as follows: