Handling multiple large .h5 files for creating data loader objects in Pytorch

35 Views Asked by At

I am working with a dataset (Ninapro DB2). I have the whole dataset into 8 individual .h5 files where each file is 30 GB. Example for one such file.

with h5py.File('Sub1_5.h5', 'r') as f:
    train_data = f['key1'][:]
    train_labels = f['key2'][:]
    test_data = f['key3'][:]
    test_labels = f['key4'][:]
print(train_data.shape)
print(len(train_labels))
print(test_data.shape)
print(len(test_labels))

Output:

(513636, 12, 400)
513636
(256729, 12, 400)
256729

This is for one file. Each file contains approximately 700k records including both training and testing. In total I have (for all 8 files) 5.6 million records.

I can't load this whole dataset as it exceeds my CPU RAM. My goal here is to create data loader objects in pytorch with batch size (say 512). Each batch should contain randomly selected samples from all the 8 files and each sample should be used only once per epoch. I have asked chatGPT, it gave some CustomDataset Implementation class but it did not work.

So, how do I achieve this by loading these files individually then concatenate, randomly select samples from all the files and finally create data loader objects? Expecting pytorch code or some idea on how to deal with these type of large files and create data loaders.

0

There are 0 best solutions below