Is it possible to resume training from a checkpoint model in Tensorflow?

468 Views Asked by At

I am doing auto segmentation and I was training a model over the weekend and the power went out. I had trained my model for 50+ hours and saved my model every 5 epochs using the line:

model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)

I'm loading the saved model using the line:

model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})

I have included all of my data that splits my training data into train_x for the scan and train_y for the label. When I run the line:

loss, dice_coef = model.evaluate(train_x,  train_y, verbose=1)

I get the error:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
 [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_3673]

Function call stack:
distributed_function
1

There are 1 best solutions below

0
On BEST ANSWER

This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.

evaluate(train_x,  train_y, batch_size=<batch size>)

from keras documentation

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.