Is it possible to resume training from a checkpoint model in Tensorflow?

448 Views Asked by rzaratx At 28 July 2025 at 01:53

I am doing auto segmentation and I was training a model over the weekend and the power went out. I had trained my model for 50+ hours and saved my model every 5 epochs using the line:

model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)

I'm loading the saved model using the line:

model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})

I have included all of my data that splits my training data into train_x for the scan and train_y for the label. When I run the line:

loss, dice_coef = model.evaluate(train_x,  train_y, verbose=1)

I get the error:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
 [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_3673]

Function call stack:
distributed_function

Original Q&A

There are 1 best solutions below

Rajith Thennakoon On 28 April 2020 at 00:39 BEST ANSWER

This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.

evaluate(train_x,  train_y, batch_size=<batch size>)

from keras documentation

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.

Is it possible to resume training from a checkpoint model in Tensorflow?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in MACHINE-LEARNING

Related Questions in TRAINING-DATA

Related Questions in RESUMING-TRAINING

Trending Questions

Popular # Hahtags

Popular Questions