optimize data input pipeline with keras datagenerator by using tf dataset

128 Views Asked by At

i want to train my autoencoder with ~100k hdf5 files. I wrote a datagenerator using keras.utils.Sequence. Everything works fine, but now im getting a data bottleneck. I watched some documentation on the tf datasets and how they perform much faster.

class DataGenerator(keras.utils.Sequence):

def __init__(self,path,batch_size):
    self.path = path
    self.batch_size = batch_size
    self.ids = os.listdir(self.path)

def __len__(self):
    return int(np.floor(len(self.ids) / self.batch_size))

def __getitem__(self,index):
    epsilons,fields = list(), list()

    for id in self.ids[index*self.batch_size:(index+1)*self.batch_size]:

        hf = h5py.File(os.path.join(self.path, id), 'r')
        epsilons.append(np.array(hf.get('epsilon')))
        fields.append(np.array(hf.get('field')))
        hf.close()

    return np.asarray(epsilons), np.asarray(fields)

Normally I would use my generator like this:

train = DataGenerator(args.p_train, args.bs)
m.fit(train, epochs=args.ep,callbacks = [tboard_callback])

Now I'm using the Dataset.from generator method:

dataset = tf.data.Dataset.from_generator(lambda: train,(tf.float64,tf.float64))
dataset = dataset.prefetch(autotune)

m.fit(dataset, epochs=args.ep,callbacks = [tboard_callback])

Unfortunately my basic approach need 20s per epoch, the from_generator approach takes 31s. Does anyone of you had similar problems on how to get your datagenerator much faster?

Thanks, Lukas

0

There are 0 best solutions below