for machine learning I need to get data from multiple large memmap files, combine them and return it.
The amount of variables (files) used are defined by the user.
At the moment I store the files in a list:
memmap_path=["Folder1/file2.dat","Folder24/file28.dat","Folder65/file1.dat"]
The data retrieval in the dataset class itself looks like this:
def __getitem__(self, index):
list_of_arrays = [np.memmap(memmap_file, dtype='float32', mode='r', shape=(24000,300,300))[index] for memmap_file in memmap_path]
x = np.stack(list_of_arrays)
y = self.targets[index]
return torch.from_numpy(x), y
Does anyone have a better approach for the situation? I am aware that using loops should be avoided if possible, but I am not sure how to do it in this case. I thought about creating an array of zeros and filling it a loop, instead of using np.stack and list comprehension, but I am not sure about the performance improvements. Any suggestions are welcome.