I tried to apply keras multi-GPU to train my model with the following code:
tf.keras.backend.clear_session()
strategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
......
early_stop = EarlyStopping(monitor='val_accuracy', min_delta=0.001, patience=20, mode='max',restore_best_weights=True)
with strategy.scope():
model = Transformer()
model.compile(
optimizer=optimizers.Adam(1e-4),
loss= CategoricalCrossentropy(),
metrics=[CategoricalAccuracy()]
)
......
train_dataset = tf.data.Dataset.from_tensor_slices(
({"encoder_input": encoder_input_train, "decoder_input": decoder_input_train},
decoder_output_train)).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices(
({"encoder_input": encoder_input_test, "decoder_input": decoder_input_test},
decoder_output_test)).batch(batch_size)
model.fit(
train_dataset,
batch_size=batch_size,
epochs=epochs,
verbose=1,
callbacks=[early_stop],
validation_data=val_dataset
)
In the scope of the distributed strategy, the model compiles properly.But when the code was run to the model.fit function, an error was reported as follows:
2023-11-27 13:36:26.033383: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-11-27 13:36:26.034620: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-11-27 13:36:26.036020: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-11-27 13:36:26.037382: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
Node: 'replica_1/StatefulPartitionedCall_64'
5 root error(s) found.
(0) INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
[[{{node replica_1/StatefulPartitionedCall_64}}]]
[[update_2/AssignAddVariableOp/_855]]
(1) INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
[[{{node replica_1/StatefulPartitionedCall_64}}]]
[[div_no_nan_1/_847]]
(2) INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
[[{{node replica_1/StatefulPartitionedCall_64}}]]
[[div_no_nan_1/_843]]
(3) INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
[[{{node replica_1/StatefulPartitionedCall_64}}]]
[[div_no_nan/ReadVariableOp_2/_796]]
(4) INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
[[{{node replica_1/StatefulPartitionedCall_64}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_11278]
It seems that some kind of resource cannot be passed between the gpu. Can anyone help me with this? thanks!
2023-11-27 13:35:58.465328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14779 MB memory: -> device: 0, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 7.0
2023-11-27 13:35:58.470966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14779 MB memory: -> device: 1, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:2f:00.0, compute capability: 7.0
2023-11-27 13:35:58.475054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14779 MB memory: -> device: 2, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:86:00.0, compute capability: 7.0
2023-11-27 13:35:58.478613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14779 MB memory: -> device: 3, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0
4 Physical GPUs, 4 Logical GPUs
Number of devices: 4
The GPU type is V100 and the fact that all four GPUs are able to be connected properly, could it be a GPU hardware connection issue on the server?