Keras multi-GPU training, GPU resource can't be shared, causing training failure

70 Views Asked by At

I tried to apply keras multi-GPU to train my model with the following code:

tf.keras.backend.clear_session()
strategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
......
early_stop = EarlyStopping(monitor='val_accuracy', min_delta=0.001, patience=20, mode='max',restore_best_weights=True)
with strategy.scope():
    model = Transformer()
    model.compile(
        optimizer=optimizers.Adam(1e-4),
        loss= CategoricalCrossentropy(),
        metrics=[CategoricalAccuracy()]
    )
......
train_dataset = tf.data.Dataset.from_tensor_slices(
        ({"encoder_input": encoder_input_train, "decoder_input": decoder_input_train},
         decoder_output_train)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices(
        ({"encoder_input": encoder_input_test, "decoder_input": decoder_input_test},
         decoder_output_test)).batch(batch_size)

model.fit(
    train_dataset,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    callbacks=[early_stop],
    validation_data=val_dataset
)

In the scope of the distributed strategy, the model compiles properly.But when the code was run to the model.fit function, an error was reported as follows:

2023-11-27 13:36:26.033383: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-11-27 13:36:26.034620: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-11-27 13:36:26.036020: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
2023-11-27 13:36:26.037382: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at xla_ops.cc:289 : INVALID_ARGUMENT: Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
Node: 'replica_1/StatefulPartitionedCall_64'
5 root error(s) found.
  (0) INVALID_ARGUMENT:  Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
     [[{{node replica_1/StatefulPartitionedCall_64}}]]
     [[update_2/AssignAddVariableOp/_855]]
  (1) INVALID_ARGUMENT:  Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
     [[{{node replica_1/StatefulPartitionedCall_64}}]]
     [[div_no_nan_1/_847]]
  (2) INVALID_ARGUMENT:  Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
     [[{{node replica_1/StatefulPartitionedCall_64}}]]
     [[div_no_nan_1/_843]]
  (3) INVALID_ARGUMENT:  Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
     [[{{node replica_1/StatefulPartitionedCall_64}}]]
     [[div_no_nan/ReadVariableOp_2/_796]]
  (4) INVALID_ARGUMENT:  Trying to access resource Resource-282-at-0x13338e62a70 (defined @ C:\anaconda3\envs\keras\lib\site-packages\tensorflow\python\ops\gen_resource_variable_ops.py:1226) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
     [[{{node replica_1/StatefulPartitionedCall_64}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_11278]

It seems that some kind of resource cannot be passed between the gpu. Can anyone help me with this? thanks!

2023-11-27 13:35:58.465328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14779 MB memory:  -> device: 0, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 7.0
2023-11-27 13:35:58.470966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14779 MB memory:  -> device: 1, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:2f:00.0, compute capability: 7.0
2023-11-27 13:35:58.475054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14779 MB memory:  -> device: 2, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:86:00.0, compute capability: 7.0
2023-11-27 13:35:58.478613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14779 MB memory:  -> device: 3, name: NVIDIA Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0
4 Physical GPUs, 4 Logical GPUs
Number of devices: 4

The GPU type is V100 and the fact that all four GPUs are able to be connected properly, could it be a GPU hardware connection issue on the server?

0

There are 0 best solutions below