save the best model based on criteria in custom_training loop

Question

save the best model based on criteria in custom_training loop

286 Views Asked by pro At 17 August 2025 at 21:26

I wrote a custom training loop following the tensorflow tutorials. Anyway its training and it produces output as

Start of epoch 0
Training loss (for one batch) at step 0: 15.9249
Seen so far: 16 samples
Training loss (for one batch) at step 2: 14.9462
Seen so far: 48 samples
Training loss (for one batch) at step 4: 14.6554
Seen so far: 80 samples
Training loss (for one batch) at step 6: 14.1741
Seen so far: 112 samples
Training acc over epoch: 15.1999
Validation acc: 14.5266
Time taken: 8.02s

In custom training loop I donot know, how to compile the model,save the best model based on "if the loss on the validation sets fails to reduce or remains constant for 10 consecutive epochs then the model will be saved to model.h5 file and the training will be stopped Moreover i want to save the training loss and validation loss of each epoch to a csv file which may be something similar to what the following keras commands does.

#save_model_name = 'model_name' +'.h5'
#early_stopping = EarlyStopping(monitor='val_loss', patience=30, verbose=1)
#model_checkpoint = ModelCheckpoint(save_model_name,monitor='val_R2_score',
                                    save_best_only=True, verbose=1, mode='max')
#reduce_lr = ReduceLROnPlateau(factor=0.5, monitor='val_loss',
#                              patience=15, min_lr=0.000001, verbose=1)

#csv_logger = CSVLogger(model_name +".csv", append=True)

My code is

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value


@tf.function
def test_step(x, y):
    val_logits = model(x, training=False)
    val_acc_metric.update_state(y, val_logits)


optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3)
loss_fn =tf.keras.losses.MeanSquaredError()

batch_size = 16



# dataset.

x_train = np.load('x_train_data.npy') 
x_valid = np.load('x_valid_data.npy') 
y_train = np.load('y_train_data.npy') 
y_valid = np.load('y_valid_data.npy') 


#prepare the data for training
x_train = np.expand_dims(x_train, axis=2)
x_valid = np.expand_dims(x_valid, axis=2)
y_train = np.expand_dims(y_train, axis=2)
y_valid = np.expand_dims(y_valid, axis=2)


#prepare the training datasets based on tensorflow
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)



# Prepare the validation dataset based on tensorflow
val_dataset = tf.data.Dataset.from_tensor_slices((x_valid, y_valid))
val_dataset = val_dataset.batch(batch_size)


train_acc_metric = tf.keras.metrics.MeanSquaredError()
val_acc_metric = tf.keras.metrics.MeanSquaredError()


#model
model = test_model(im_width=1, im_height=80, neurons=16, kern_sz = 20) 
model.summary()

######cutom training loop ######

import time
epochs = 2
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    start_time = time.time()

    # Iterate over the batches of the dataset.
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        loss_value = train_step(x_batch_train, y_batch_train)
        losses.append(float(loss_value))

        # Log every 200 batches.
        if step % 2 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %d samples" % ((step + 1) * batch_size))
    print(losses)
    # Display metrics at the end of each epoch.
    train_acc = train_acc_metric.result()
    print("Training acc over epoch: %.4f" % (float(train_acc),))
    train_acc_metric.reset_states()

     
    # Run a validation loop at the end of each epoch.
    for x_batch_val, y_batch_val in val_dataset:
        test_step(x_batch_val, y_batch_val)

    val_acc = val_acc_metric.result()
    val_acc_metric.reset_states()
    print("Validation acc: %.4f" % (float(val_acc),))
    print("Time taken: %.2fs" % (time.time() - start_time))

Original Q&A

There are 1 best solutions below

**frozen_slot** · Answer 1

I'm not quite sure what you are trying to achieve. Callbacks allows to keep track of your records for both stages of model training and model inference. It's not able to decide which model "is the best". There is also the question what you define as a good model? Is a model that converges well? To do you have a certain metric in mind? Does it mean it should generalize well (to in or out of distribution samples)?

Also, if you want to keep track of your results and want to compare your results I would suggest you using weight and biases. It is really easy to bind in with keras. This can be done by first initializing to

# Initilize a new wandb run
wandb.init(entity="wandb", project="keras-intro")

# Default values for hyper-parameters
config = wandb.config # Config is a variable that holds and saves hyperparameters and inputs
config.learning_rate = 0.01
config.batch_size = 128
...
config.activation = 'relu'
config.optimizer = 'nadam'

And afterwards binding in a callback of weight and biases after having defined your model

# Fit the model to the training data
model.fit_generator(datagen.flow(X_train, y_train, batch_size=config.batch_size),
                   steps_per_epoch=len(X_train) / 32, epochs=config.epochs,
                   validation_data=(X_test, y_test), verbose=0,
                   callbacks=[WandbCallback(data_type="image", validation_data=(X_test, y_test), labels=character_names),
                               tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)])

I hope this is what you intended to do. For more information see https://wandb.ai/site/articles/intro-to-keras-with-weights-biases

Edit: to get your callback executed during your training run I refer to the documentation of keras [2]:

In case you want to store multiple callbacks at once, you can use

callbacks = tf.keras.callbacks.CallbackList([...list of callbacks go here...])

which wraps all of the callbacks together into a container.

Afterwards within your training cycle you define when to execute the callback, e.g. if you want to execute it at the end of each epoch you could use something like this:

early_stopping = EarlyStopping(monitor='val_loss', patience=30, verbose=1)
model_checkpoint = ModelCheckpoint(save_model_name,monitor='val_R2_score',
                                    save_best_only=True, verbose=1, mode='max')
csv_logger = CSVLogger(model_name +".csv", append=True)


_callbacks = [early_stopping, model_checkpoint, csv_logger]

callbacks = tf.keras.callbacks.CallbackList(
    _callbacks, add_history=True, model=training_model)

logs = {}
callbacks.on_train_begin(logs=logs)


for epoch in epochs:
    ...your code....
    
    callbacks.on_epoch_end(epoch, logs=logs)

[2] https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/Callback

save the best model based on criteria in custom_training loop

There are 1 best solutions below

Related Questions in TENSORFLOW

Related Questions in TENSORFLOW2.0

Related Questions in TF.KERAS

Related Questions in EARLY-STOPPING

Trending Questions

Popular # Hahtags

Popular Questions