Goal
Read the data component of a hdf5 file in R.
Problem
I am using rhdf5 to read hdf5 files in R. Out of 75 files, it successfully read 61 files. But it throws an error about memory for the rest of the files. Although, some of these files are shorter than already read files.
I have tried running individual files in a fresh R session, but get the same error.
Following is an example:
# Exploring the contents of the file:
library(rhdf5)
h5ls("music_0_math_0_simple_12_2022_08_08.hdf5")
group name otype dclass dim
0 / data H5I_GROUP
1 /data ACC_State H5I_DATASET INTEGER 1 x 1
2 /data ACC_State_Frames H5I_DATASET INTEGER 1
3 /data ACC_Voltage H5I_DATASET FLOAT 24792 x 1
4 /data AUX_CACC_Adjust_Gap H5I_DATASET INTEGER 24792 x 1
... CONTINUES ----
# Reading the file:
rhdf5::h5read("music_0_math_0_simple_12_2022_08_08.hdf5", name = "data")
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, :
Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter.
In addition: Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
Error: Error in h5checktype(). H5Identifier not valid.
I can read the file via python:
import h5py
filename = "music_0_math_0_simple_12_2022_08_08.hdf5"
hf = h5py.File(filename, "r")
hf.keys()
data = hf.get('data')
data['SCC_Follow_Info']
#<HDF5 dataset "SCC_Follow_Info": shape (9, 24792), type "<f4">
How can I successfully read the file in R?
When you ask to read the
datagroup, rhdf5 will read all the underlying datasets into R's memory. It's not clear from your example exactly how much data this is, but maybe for some of your files it really is more than the available memory on your computer. I don't know how Python works under the hood, but perhaps it doesn't do any reading of datasets until you rundata['SCC_Follow_Info']?One option to try, is that rather than reading the entire
datagroup, you could be more selective and try reading only the specific dataset you're interested in at that moment. In the Python example that seems to be/data/SCC_Follow_Info.You can do that with something like:
Once you've finished working with that dataset remove it from your R session e.g.
rm(follow_info)and read the next dataset or file you need.