@tensorflow/tfjs-node-gpu works with NVIDIA P4 but fails with V100 on GKE

256 Views Asked by At

My tfjs-node-gpu code works great on an NVIDIA p4 on GKE (and using WebGL in a browser), but it fails on a v100 and t4.

Node is crashing in the first predict call inside my warmup. I'm using small 128x128 tiles to predict a 4x image upscale using the idealo-gans. The v100 initializes fine, shows up with nvidia_smi, is displayed as a TF device and the NUMA stuff is all fine. It just hard crashes my node express server. I'm having trouble finding the crash stack, since this is started in a Docker container and my last attempt to log the crash from stderr failed.

I've tried with both the latest tfjs-node-gpu 3.0 and 2.8.5. GKE is configured to install the NV drivers, currently 410.104, and CUDA 10.0.

I've tried enabling debug mode, and passing {verbose: true} to the failing model.predict() call in my warmup function. Neither added any output to the warmup call, which is odd, since I do see output in the actual, non-warmup call to model.predict()

Any suggestions on how to debug further?

0

There are 0 best solutions below