I found that all of the examples of Theano/Lasagne deal with small data set like mnist and cifar10 which can be loaded into memory completely.
My question is how to write efficient code for training on large scale datasets? Specifically, what is the best way to prepare mini-batches (including real time data augmentation) in order to keep the GPU busy?
Maybe like using CAFFE's ImageDataLayer? For example, I have a big txt file which contains all the image paths and labels. It would be appreciated to show some code.
Thank you very much!
In case the data doesn't fit into memory, a good way is to prepare the minibatches and store them into an HDF5 file, which is then used at training time.
However, this does suffice when doing data augmentation as this is done on the fly. Because of Pythons global interpreter lock, images cannot already be loaded and preprocesed while the GPU is busy. The best way around this, that I know of, is the Fuel library. Fuel loads and preprocesses the minibatches in a different python process and then streams them to the training process over a TCP socket: http://fuel.readthedocs.org/en/latest/server.html#data-processing-server
It additionally provides some functions to preprocess the data, such as scaling and mean subtraction: http://fuel.readthedocs.org/en/latest/overview.html#transformers-apply-some-transformation-on-the-fly
Hope this helps. Michael