i've created a GCP instance with Tesla T4 and using following image projects/deeplearning-platform-release/global/images/tf-2-8-cu113-v20220516-ubuntu-2004
and everything seems fine from nvidia-smi
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 38C P8 9W / 70W | 105MiB / 15109MiB | 0% Default |
and from inside python code using tf.sysconfig.get_build_info()
and tf.config.list_physical_devices('GPU')
:
gpu: ['Tesla T4'] memory: ['15109 MiB', '256 MiB'] pci: gen ['1'] ['16x'] architecture: [] driver: ['470.129.06'] cuda: ['11.4']
tensorflow: 2.8.1 cuda 11.2 cudnn 8
physical devices: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') {'compute_capability': (7, 5), 'device_name': 'Tesla T4'}
logical devices: LogicalDevice(name='/device:GPU:0', device_type='GPU')
but model load and inference is slow up to a point of unusable - and T4 is supposed to be pretty good!
for example, running on my home RTX3060 model load is ~13 sec while in GCP with T4 its ~67 sec (model is stored on local disk, no network transfers are involved)
and single inference on RTX3060 is <1sec while on GCP with T4 is ~37sec when it completes (most of the time it results in hang leading to process killed)
same with simple run of tf.test.Benchmark().run_op_benchmark
:
- rtx3060 is <9sec
- tesla t4 is >20sec
this sounds like too much of an issue to be T4 performance
any ideas?
UPDATE:
First, confirming that GPU is definitely being used as
nvidia-smi dmon
shows that clearlySecond, occasional hangs/aborts are due to OS excessive swapping
Seems that having less system memory (8GB) than GPU memory (16GB) is a no-go, this is now resolved
But...Overall performance of T4 is pretty underwhelming - about 2x slower compared to low-end RTX3060:
Seems that is what is about to be expected given its previous architecture (Turing vs Ampere and slightly less CUDA cores)