I am doing auto segmentation and I was training a model over the weekend and the power went out. I had trained my model for 50+ hours and saved my model every 5 epochs using the line:
model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)
I'm loading the saved model using the line:
model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})
I have included all of my data that splits my training data into train_x
for the scan and train_y
for the label. When I run the line:
loss, dice_coef = model.evaluate(train_x, train_y, verbose=1)
I get the error:
ResourceExhaustedError: OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_distributed_function_3673]
Function call stack:
distributed_function
This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.
from keras documentation