I'm working with TensorFlow-GPU version 2.10.0, and I'm facing challenges in achieving deterministic training results. I have set all relevant random seeds, including random.seed(42), np.random.seed(42), and tf.random.set_seed(42). Despite these seed settings, I notice that the accuracy results vary increasingly with each epoch during training.
Method1: In my search for a solution, I came across the suggestion to use tf.config.experimental.enable_op_determinism(). However, when I attempt to implement this, I encounter error messages.
Here's a summary of the seed-setting code snippet:
random.seed(42)
np.random.seed(42)
os.environ['PYTHONHASHSEED'] = str(42)
tf.random.set_seed(42)
And the attempt to enable op determinism:
tf.config.experimental.enable_op_determinism()
Error message:
File "C:\Users\anny\anaconda3\envs\ex\lib\site-packages\tensorflow\python\keras\optimizer_v2\optimizer_v2.py", line 467, in _get_gradients
grads = tape.gradient(loss, var_list, grad_loss)
Node: 'gradient_tape/model/FPN2.upsample/resize/ResizeNearestNeighborGrad'
A deterministic GPU implementation of ResizeNearestNeighborGrad is not currently available.
[[{{node gradient_tape/model/FPN2.upsample/resize/ResizeNearestNeighborGrad}}]] [Op:__inference_train_function_38689]
2023-11-19 23:53:15.256109: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
Method2:
def random_seed(seed):
os.environ['PYTHONHASHSEED'] = str(seed) # Python general
random.seed(seed) # Python random
tf.random.set_seed(seed)
keras.utils.set_random_seed(seed)
tf.compat.v1.set_random_seed(42)
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
random_seed(42)
Setting TF_CUDNN_DETERMINISTIC to '1' in TensorFlow is used to enforce deterministic behavior in CuDNN (CUDA Deep Neural Network library) operations. When set to '1', it tries to make the computation reproducible by using deterministic algorithms in CuDNN.
history = model.fit(get_train_batch('train', training_images, bs), steps_per_epoch=len(training_images)//bs, epochs=epoch, verbose=1,
validation_data=get_train_batch('valid', validing_images, 1), validation_steps=800,
callbacks=[ checkpoint, logger, early_stop ],use_multiprocessing=False)
Let use_multiprocessing=False
The initial weights are consistently the same each time, but discrepancies in training results still emerge after the first epoch.
Experiment 1:
Epoch 1/300
2023-11-21 14:51:49.923497: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
52/52 [==============================] - 63s 920ms/step - loss: 1.1201 - accuracy: 0.3029 - val_loss: 1.1028 - val_accuracy: 0.3363
Epoch 2/300
52/52 [==============================] - 45s 863ms/step - loss: 1.0943 - accuracy: 0.3252 - val_loss: 1.1095 - val_accuracy: 0.2812
Epoch 00002: val_loss did not improve from 1.10278
Epoch 3/300
52/52 [==============================] - 44s 861ms/step - loss: 1.0322 - accuracy: 0.4854 - val_loss: 1.1669 - val_accuracy: 0.4475
Epoch 00003: val_loss did not improve from 1.10278
Epoch 4/300
52/52 [==============================] - 45s 877ms/step - loss: 0.8544 - accuracy: 0.6068 - val_loss: 1.4057 - val_accuracy: 0.4025
Epoch 00004: val_loss did not improve from 1.10278
Experiment 2:
Epoch 1/300
2023-11-21 14:56:31.573103: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
52/52 [==============================] - 63s 930ms/step - loss: 1.1201 - accuracy: 0.3029 - val_loss: 1.1028 - val_accuracy: 0.3363
Epoch 2/300
52/52 [==============================] - 45s 869ms/step - loss: 1.0943 - accuracy: 0.3301 - val_loss: 1.1096 - val_accuracy: 0.2812
Epoch 00002: val_loss did not improve from 1.10279
Epoch 3/300
52/52 [==============================] - 44s 855ms/step - loss: 1.0320 - accuracy: 0.4951 - val_loss: 1.1674 - val_accuracy: 0.4475
Epoch 00003: val_loss did not improve from 1.10279
Epoch 4/300
52/52 [==============================] - 47s 921ms/step - loss: 0.8537 - accuracy: 0.6068 - val_loss: 1.4036 - val_accuracy: 0.4588
I train the model on the CPU, and I can achieve identical results each time (identical accuracy and loss), but the speed is slow. Since my application requires running the model training process extensively, I still hope to speed it up. I would like to know achieve consistent training results across epochs with TensorFlow-GPU 2.10.0? Are there alternative approaches or solutions to address this issue?