I am training on a 64 core CPU workstation multiple Keras MLP models simultaneously. Therefore I am using the Python multiprocessing pool to allocate for each CPU one model being trained. For the model being trained I am using an Early Stopping and Model checkpoint callback defined in this manner:
es = EarlyStopping(monitor='val_mse', mode='min', verbose=VERBOSE_ALL, patience=10)
mc = ModelCheckpoint('best_model.h5', monitor='val_mse', mode='min', verbose=VERBOSE_ALL, save_best_only=True)
Using a single model the training runs through without any problems. When I start using the multiprocessing pool however, I end up having issues with the callbacks. A hdf5 model saving issue comes up:
Traceback (most recent call last):
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1029, in _save_model
self.model.save(filepath, overwrite=True)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1008, in save
signatures, options)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\saving\save.py", line 112, in save_model
model, filepath, overwrite, include_optimizer)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\saving\hdf5_format.py", line 92, in save_model_to_hdf5
f = h5py.File(filepath, mode='w')
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\h5py\_hl\files.py", line 394, in __init__
swmr=swmr)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\h5py\_hl\files.py", line 176, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5f.pyx", line 105, in h5py.h5f.create
OSError: Unable to create file (file signature not found)
This error comes more or less sporadically, and through exceptions I can catch it for repeating the model training. But is there a way to work around this issue by setting flags or using a different callback file format?
Tensorflow version: 2.1.0
Keras version: 2.3.1
library include:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint