How to shuffle TFrecords files before feeding them to the model

1.9k Views Asked by At

I am fitting a Neural network model using TFrecords and keras. I have a relatively big dataset which is pretty heterogeneous. I already used the shuffle my dataset during the training of the model like in the documentation exemple :https://keras.io/examples/keras_recipes/tfrecord/ (but can't shuffle all because it would cost too much memory) and I also separated my dataset into small shards each of equal size.

However I have reasons to think that this "approximate" shuffling is not enough and I also think that feeding already shuffled data would increase training speed.

So now my question is: After I have separated my dataset into Tfrecords shards, Is it possible to efficiently make code that takes randomly 2 shards, load them, shuffle them and then rewrite 2 shards (which are now shuffled between two shards). So that I can repeat this process a lot of time, which would result in correctly shuffled TFrecords files.

More precisely, I take 2 shards: shard1.tfrec and shard2.tfrec, load them into one tf.data.dataset, shuffle it, and then output 2 shards of equal size again.

1

There are 1 best solutions below

0
On

That might not answer your question directly, but I would like to address this point:

can't shuffle all because it would cost too much memory

The number 2048 in the shuffle call is the buffer size, i.e., the number of elements held in the memory buffer from which elements will be randomly selected. You can reduce this number drastically for a better memory efficiency.

def get_dataset(filenames, labeled=True):
    dataset = load_dataset(filenames, labeled=labeled)
    dataset = dataset.shuffle(32)
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    dataset = dataset.batch(BATCH_SIZE)
    return dataset

This will not be completely random, however. Even less so than having a buffer size of 2048. See tf.data.Dataset.shuffle.

Randomly shuffles the elements of this dataset.

This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.

For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer.

reshuffle_each_iteration controls whether the shuffle order should be different for each epoch.