I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv
and to_hdf
methods in its io_tools
, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use
append=True
in the call toto_hdf
:yields
Note that you need to use
format='table'
in the first call todf.to_hdf
to make the table appendable. Otherwise, the format is'fixed'
by default, which is faster for reading and writing, but creates a table which can not be appended to.Thus, you can process each CSV one at a time, use
append=True
to build the hdf5 file. Then overwrite the DataFrame or usedel df
to allow the old DataFrame to be garbage collected.Alternatively, instead of calling
df.to_hdf
, you could append to a HDFStore:yields