I need to load a time-series dataset to train a network. The dataset was split into many chunks train_x_0.npy
, train_x_1.npy
, ..., train_x_40.npy
(41 chunks) because of memory issue when I extract these .npy
files from the raw data. However, their sizes are too large (around 1000 GB) that I couldn't load everything into the RAM. I have been considering two ways to solve this problem.
- Loading the data chunks using
np.load()
with argumentmmap_mode='r+'
. The memory-mapped chunks are stored in a Python listself.data
. In the__getitem__(self, idx)
method of PytorchDataset
class, I convertidx
tochunk_idx
andsample_idx
, then get the sample byself.data[chunk_idx][sample_idx]
. - Extract
.npy
files again from raw data, and save the data sample-by-sample, i.e. one.npy
file is now one sample, not a data chunk. In the__getitem__(self, idx)
method, I will get one sample by loading it usingnp.load(sample_path)
.
Assuming the Pytorch DataLoader
will be used to iterate through all samples, then which method will be faster?
If you have another suggestion to extract the raw data or to load the .npy
files, please share your opinion.
Both suggested approaches will be limited by your filesystem's IO, since each sample will be loaded from disk on-demand (memory mapping does not speed up the actual loading, once a given patch is requested).
Especially when you are planning to train for many epochs, you can achieve a strong speedup by loading your original chunks
train_x_0.npy
,train_x_1.npy
, etc. one (or as many as you can hold in RAM) at a time and training multiple epochs on this chunk before switching to the next.For this, you would need control over the sample indices requested by the
dataloader
. For that you could define a sampler which is passed the sample indices available in the respective cached data chunk. In pseudocode, your training loop could look something like this when caching one chunk at a time:Hereby, your
Dataset
class needs to take care ofcache_chunk
method)get_chunk_sample_inds
method)If you use a fast GPU (which is often limited by shuffling data back and forth between RAM and VRAM, even for RAM-cached data), you can expect several orders of magnitude speed up using this approach (as opposed to attempting to fill the VRAM from HDD for each sample).