I am reading five huge CVS files. All of them have the same number of rows but the number of rows is in millions. Because of memory constraint, I need to read them in batches and subsequently join the data from different files into a single Dataframe.
Below is what I have now:
import pandas as pd
it1 = pd.read_csv('1.csv', chunksize=10)
it2 = pd.read_csv('2.csv', chunksize=10)
it3 it4 it5
are given in a list list_iterators
. That is:
list_iterators = [it3 it4 it5]
What I want to achieve is that whenever I perform a read operation, I will get the data from all iterators in a list form.
So the first time I read them, I will have:
[first 10 rows in 1.csv, first 10 rows in 2.csv, first 10 rows in 3.csv ... first 10 rows in 5.csv]
In order to achieve the desired outcome, what I am doing now is:
ak = zip(it1, it2, list_iterators[0], list_iterators[1], list_iterators[2])
ak.__next__() #I will call this to read the next 10 rows
I wonder if there is any way that I can pass the list_iterators
as an argument instead of spelling out all the elements inside it because I wouldn't be able to know how many elements are there in list_iterators
when I write my program.
My second question is that instead of using __next__()
, is there a more elegant way of retrieving the data from the pandas iterators.
Yes, you can pass the contents of
list_iterators
using the*
operator: