The model training is running out of the data

43 Views Asked by At

I am trying to perform a distributed training on TPUs in Google colab using TensorFlow, so I am distributing the dataset using strategy.experimental_distribute_dataset(), but the issue is coming when I am using the model.fit(), I necessarily had to pass steps_per_epoch parameter. By general convention I am setting this equal to the number of Batches I have and this results into running out of data issue. Kindly look at the code below:

# Detect hardware
try:
  tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_address) # TPU detection
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
  strategy = tf.distribute.TPUStrategy(tpu)
  # Going back and forth between TPU and host is expensive.
  # Better to run 128 batches on the TPU before reporting back.
  print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
  print("Number of accelerators: ", strategy.num_replicas_in_sync)
except ValueError:
  print('TPU failed to initialize.')


with strategy.scope():
  IMG_SIZE=25
  BUFFER_SIZE=2000
  def prepare_data(x,y,aug=False,BATCH_SIZE_PER_REPLICA = 1):
    GLOBAL_BATCH_SIZE=BATCH_SIZE_PER_REPLICA*strategy.num_replicas_in_sync
    x_scaled=keras.layers.Rescaling(1./255)(x)
    x_scaled=keras.layers.Resizing(IMG_SIZE, IMG_SIZE)(x_scaled)
    dataset=tf.data.Dataset.from_tensor_slices((x_scaled, y)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)
    def preprocess(image,label):
      Aug=Sequential(
            [
             keras.layers.RandomRotation(0.1),
             keras.layers.RandomFlip('horizontal_and_vertical')
             ])
      return Aug(image),label
    if aug==True:
      dataset=dataset.map(preprocess)
    number_of_batches=len(dataset)
    return number_of_batches,strategy.experimental_distribute_dataset(dataset)

number_of_batches_train,Train_dataset=prepare_data(x_train,y_train,aug=False,BATCH_SIZE_PER_REPLICA=10)
number_of_batches_test,Test_dataset=prepare_data(x_test,y_test,aug=False,BATCH_SIZE_PER_REPLICA=5)

with strategy.scope():
  model=MyModel(10,(IMG_SIZE,IMG_SIZE,1))
model.compile(loss=SparseCategoricalCrossentropy(),metrics=['accuracy'],
              optimizer=Adam(learning_rate=0.01),steps_per_execution=80)
history=model.fit(Train_dataset,epochs=100,verbose=1,steps_per_epoch=number_of_batches_train)

The error is

Epoch 1/100
750/750 [==============================] - 11s 15ms/step - loss: 0.9443 - accuracy: 0.6234
Epoch 2/100
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 75000 batches). You may need to use the repeat() function when building your dataset.

The same line of code runs perfectly fine when I don't use the distributed data. But for this issue I can never exceed a certain number of epoch.

Just for information It is fashion mnist dataset with 60000 samples. In my case total batch size I am using is 80, so total 750 batches.

0

There are 0 best solutions below