i want to train my autoencoder with ~100k hdf5 files. I wrote a datagenerator using keras.utils.Sequence. Everything works fine, but now im getting a data bottleneck. I watched some documentation on the tf datasets and how they perform much faster.
class DataGenerator(keras.utils.Sequence):
def __init__(self,path,batch_size):
self.path = path
self.batch_size = batch_size
self.ids = os.listdir(self.path)
def __len__(self):
return int(np.floor(len(self.ids) / self.batch_size))
def __getitem__(self,index):
epsilons,fields = list(), list()
for id in self.ids[index*self.batch_size:(index+1)*self.batch_size]:
hf = h5py.File(os.path.join(self.path, id), 'r')
epsilons.append(np.array(hf.get('epsilon')))
fields.append(np.array(hf.get('field')))
hf.close()
return np.asarray(epsilons), np.asarray(fields)
Normally I would use my generator like this:
train = DataGenerator(args.p_train, args.bs)
m.fit(train, epochs=args.ep,callbacks = [tboard_callback])
Now I'm using the Dataset.from generator method:
dataset = tf.data.Dataset.from_generator(lambda: train,(tf.float64,tf.float64))
dataset = dataset.prefetch(autotune)
m.fit(dataset, epochs=args.ep,callbacks = [tboard_callback])
Unfortunately my basic approach need 20s per epoch, the from_generator approach takes 31s. Does anyone of you had similar problems on how to get your datagenerator much faster?
Thanks, Lukas