I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use
append=Truein the call toto_hdf:yields
Note that you need to use
format='table'in the first call todf.to_hdfto make the table appendable. Otherwise, the format is'fixed'by default, which is faster for reading and writing, but creates a table which can not be appended to.Thus, you can process each CSV one at a time, use
append=Trueto build the hdf5 file. Then overwrite the DataFrame or usedel dfto allow the old DataFrame to be garbage collected.Alternatively, instead of calling
df.to_hdf, you could append to a HDFStore:yields