I am fitting a Neural network model using TFrecords and keras. I have a relatively big dataset which is pretty heterogeneous. I already used the shuffle my dataset during the training of the model like in the documentation exemple :https://keras.io/examples/keras_recipes/tfrecord/ (but can't shuffle all because it would cost too much memory) and I also separated my dataset into small shards each of equal size.
However I have reasons to think that this "approximate" shuffling is not enough and I also think that feeding already shuffled data would increase training speed.
So now my question is: After I have separated my dataset into Tfrecords shards, Is it possible to efficiently make code that takes randomly 2 shards, load them, shuffle them and then rewrite 2 shards (which are now shuffled between two shards). So that I can repeat this process a lot of time, which would result in correctly shuffled TFrecords files.
More precisely, I take 2 shards: shard1.tfrec and shard2.tfrec, load them into one tf.data.dataset, shuffle it, and then output 2 shards of equal size again.
That might not answer your question directly, but I would like to address this point:
The number 2048 in the shuffle call is the buffer size, i.e., the number of elements held in the memory buffer from which elements will be randomly selected. You can reduce this number drastically for a better memory efficiency.
This will not be completely random, however. Even less so than having a buffer size of 2048. See
tf.data.Dataset.shuffle
.