nvidia error on Azure DSVM/DLVM

533 Views Asked by At

I have been creating a few Ubuntu DSVMs and DLVMs on Azure with GPU and I keep getting intermittent errors. These manifest by nvidia-smi being really slow or getting the following error: 2018/01/11 19:42:33 Error: nvml: Driver/library version mismatch

This will appear if I try to run nvidia-smi or nvidia-docker. A reboot usually fixes it but it can reappear.

Does this sound like an intermittent error? Is there something that I can do to mitigate this?

1

There are 1 best solutions below

0
On

NVIDIA just released a new version of the GPU driver for the GPUs used in Azure. The Ubuntu DSVM is configured to automatically install updates, so these will be installed for you in the background. The issue, though, is that the driver is compiled into the kernel, so you must reboot to get the new driver. The message Driver/library version mismatch means that the version in the kernel can’t use the installed libraries (because they were upgraded). This is why rebooting usually fixes it.

There is a second issue you might be facing: Azure released a new kernel a few days ago that is incompatible with the 387 version of the GPU driver. You won’t get this driver by default on the DSVM, but you might if you installed other packages. This error is different – something like nvidia-smi could not communicate with the nvidia module. The only way to fix it is to (1) get the very latest kernel with apt update and apt upgrade, then reboot, and (2) install a different driver with apt install nvidia-384.