I have a large data file (N,4) which I am mapping line-by-line. My files are 10 GBs, a simplistic implementation is given below. Though the following works, it takes huge amount of time.
I would like to implement this logic such that the text file is read directly and I can access the elements. Thereafter, I need to sort the whole (mapped) file based on column-2 elements.
The examples I see online assumes smaller piece of data (d
) and using f[:] = d[:]
but I can't do that since d
is huge in my case and eats my RAM.
PS: I know how to load the file using np.loadtxt
and sort them using argsort
, but that logic fails (memory error) for GB file size. Would appreciate any direction.
nrows, ncols = 20000000, 4 # nrows is really larger than this no. this is just for illustration
f = np.memmap('memmapped.dat', dtype=np.float32,
mode='w+', shape=(nrows, ncols))
filename = "my_file.txt"
with open(filename) as file:
for i, line in enumerate(file):
floats = [float(x) for x in line.split(',')]
f[i, :] = floats
del f
EDIT: Instead of do-it-yourself chunking, it's better to use the chunking feature of pandas, which is much, much faster than numpy's
load_txt
.The
pd.read_csv
function in chunked mode returns a special object that can be used in a loop such asfor chunk in chunks:
; at every iteration, it will read a chunk of the file and return its contents as a pandasDataFrame
, which can be treated as a numpy array in this case. The parameternames
is needed to prevent it from treating the first line of the csv file as column names.Old answer below
The
numpy.loadtxt
function works with a filename or something that will return lines in a loop in a construct such as:It doesn't even need to pretend to be a file; a list of strings will do!
We can read chunks of the file that are small enough to fit in memory and provide batches of lines to
np.loadtxt
.Disclaimer: I tested this in Linux. I expect this to work in Windows, but it could be that the handling of '\r' characters causes problems.