Issues with GCP instance and T4 using TensorFlow

Question

Issues with GCP instance and T4 using TensorFlow

482 Views Asked by Vladimir Mandic At 18 August 2025 at 02:01

i've created a GCP instance with Tesla T4 and using following image projects/deeplearning-platform-release/global/images/tf-2-8-cu113-v20220516-ubuntu-2004

and everything seems fine from nvidia-smi

| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |    105MiB / 15109MiB |      0%      Default |

and from inside python code using tf.sysconfig.get_build_info() and tf.config.list_physical_devices('GPU'):

gpu: ['Tesla T4'] memory: ['15109 MiB', '256 MiB'] pci: gen ['1'] ['16x'] architecture: [] driver: ['470.129.06'] cuda: ['11.4']
tensorflow: 2.8.1 cuda 11.2 cudnn 8
physical devices: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') {'compute_capability': (7, 5), 'device_name': 'Tesla T4'}
logical  devices: LogicalDevice(name='/device:GPU:0', device_type='GPU')

but model load and inference is slow up to a point of unusable - and T4 is supposed to be pretty good!

for example, running on my home RTX3060 model load is ~13 sec while in GCP with T4 its ~67 sec (model is stored on local disk, no network transfers are involved)

and single inference on RTX3060 is <1sec while on GCP with T4 is ~37sec when it completes (most of the time it results in hang leading to process killed)

same with simple run of tf.test.Benchmark().run_op_benchmark:

rtx3060 is <9sec
tesla t4 is >20sec

this sounds like too much of an issue to be T4 performance

any ideas?

Original Q&A

There are 1 best solutions below

**Vladimir Mandic** · Answer 1

UPDATE:

First, confirming that GPU is definitely being used as nvidia-smi dmon shows that clearly

Second, occasional hangs/aborts are due to OS excessive swapping
Seems that having less system memory (8GB) than GPU memory (16GB) is a no-go, this is now resolved

But...Overall performance of T4 is pretty underwhelming - about 2x slower compared to low-end RTX3060:

Tesla T4: benchmark: 19.4 sec / load: 26.2 sec / infer: 13.2 4sec
RTX 3060: benchmark: 10.1 sec / load: 9.1 sec / infer: 7.7 sec

Seems that is what is about to be expected given its previous architecture (Turing vs Ampere and slightly less CUDA cores)

Issues with GCP instance and T4 using TensorFlow

There are 1 best solutions below

Related Questions in TENSORFLOW

Related Questions in GOOGLE-COMPUTE-ENGINE

Related Questions in TPU

Related Questions in GOOGLE-DL-PLATFORM

Trending Questions

Popular # Hahtags

Popular Questions