Issues with GCP instance and T4 using TensorFlow

493 Views Asked by At

i've created a GCP instance with Tesla T4 and using following image projects/deeplearning-platform-release/global/images/tf-2-8-cu113-v20220516-ubuntu-2004

and everything seems fine from nvidia-smi

| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |    105MiB / 15109MiB |      0%      Default |

and from inside python code using tf.sysconfig.get_build_info() and tf.config.list_physical_devices('GPU'):

gpu: ['Tesla T4'] memory: ['15109 MiB', '256 MiB'] pci: gen ['1'] ['16x'] architecture: [] driver: ['470.129.06'] cuda: ['11.4']
tensorflow: 2.8.1 cuda 11.2 cudnn 8
physical devices: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') {'compute_capability': (7, 5), 'device_name': 'Tesla T4'}
logical  devices: LogicalDevice(name='/device:GPU:0', device_type='GPU')

but model load and inference is slow up to a point of unusable - and T4 is supposed to be pretty good!

for example, running on my home RTX3060 model load is ~13 sec while in GCP with T4 its ~67 sec (model is stored on local disk, no network transfers are involved)

and single inference on RTX3060 is <1sec while on GCP with T4 is ~37sec when it completes (most of the time it results in hang leading to process killed)

same with simple run of tf.test.Benchmark().run_op_benchmark:

  • rtx3060 is <9sec
  • tesla t4 is >20sec

this sounds like too much of an issue to be T4 performance

any ideas?

1

There are 1 best solutions below

0
On

UPDATE:

First, confirming that GPU is definitely being used as nvidia-smi dmon shows that clearly

Second, occasional hangs/aborts are due to OS excessive swapping
Seems that having less system memory (8GB) than GPU memory (16GB) is a no-go, this is now resolved

But...Overall performance of T4 is pretty underwhelming - about 2x slower compared to low-end RTX3060:

  • Tesla T4: benchmark: 19.4 sec / load: 26.2 sec / infer: 13.2 4sec
  • RTX 3060: benchmark: 10.1 sec / load: 9.1 sec / infer: 7.7 sec

Seems that is what is about to be expected given its previous architecture (Turing vs Ampere and slightly less CUDA cores)