Tensorflow with 2 GPUs ignores one of them

317 Views Asked by At

I have a problem regarding tensorflow calculation on 2 GPUs connected via SLI technology: only one of them is working and second one is not, although both GPUs are recognized by TF.

Setup: - Ubuntu 18.04 - Python 3 - Tensorflow 2.1 - Cuda 10.1 - Nvidia drivers (officials) 440.64 - AMD Ryzen 2700 - Asus x470 prime - Two GPUs of GTX 1070 connected via SLI techno.

I have already tested many things that I had found in internet. Concretely:

  1. I started with Tensorflow 2.0, it did not work, so I updated it to TF 2.1. The problem remains

  2. Purged and reinstalled the Nvidia drivers 430.50. Updated them to 440.64. The problem remains

  3. I verified each of my GPUs separately. I removed physically one of them, and launched code on the remaining. It worked and it seems that the GPUs are OK.

  4. I verified each of the GPU's ports on my motherboard separately. It worked and it means that each of the ports are fine.

  5. I inserted two GPUs with and without hardware SLI connection and launched the following code:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.applications import Xception
import numpy as np

num_samples = 100
height = 224
width = 224
num_classes = 50

strategy = tf.distribute.MirroredStrategy(devices=['/GPU:0', '/GPU:1'])
with strategy.scope():
    parallel_model = Xception(weights=None,
                              input_shape=(height, width, 3),
                              classes=num_classes)
    parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

### Works only for the first GPU of the 
# parallel_model = Xception(weights=None,
#                           input_shape=(height, width, 3),
#                           classes=num_classes)
# parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

parallel_model.summary()
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=16)

As a result, when strategy = tf.distribute.MirroredStrategy(devices=['/GPU:0']), the code is running fine. However, when devices=['/GPU:1'] or devices=['/GPU:0', '/GPU:1'], the nvidia-smi shows some process on the 2nd GPU, but the code execution is stacked at line

2020-03-28 21:51:14.891325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7162 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:08:00.0, compute capability: 6.1)
2020-03-28 21:51:14.891805: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-28 21:51:14.892399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7624 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:09:00.0, compute capability: 6.1),

so I have to reboot the computer, because it s dead.

  1. Initially, my X11 configuration (xorg.conf) was not configured for SLI:
Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

After google search, I played with sudo nvidia-xconfig -sli=on; sudo nvidia-xconfig -sli=auto, etc

As a result, after reboot, I obtain a bootloop with 2 lines:

recovering journal
/dev/nume0n1p2: clean, XXX/XXX files, XXX/XXX blocks

Every ~3 sec the screen becomes black and then these 2 lines show again. Impossible to access to TTY, because it is in bootloop as well. I looked everything that I could find on this subject, nothing worked. So, I kept the previous X11 config without SLI

If you experienced such type of problem, do not hesitate to share it. Any advice would help.

Thanks!

0

There are 0 best solutions below