Communicate with the NVIDIA driver after kernel update

3.4k Views Asked by At

I'm running Ubuntu 20.04. I updated my kernel and rebooted and now nvidia-smi returns:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

The kernel version is 5.13.0-35-generic.

nvidia-driver is managed by DKMS, which I'm not super familiar with - though I am under the impression that it is meant to stop this kind of problem from happening.

dkms status returns:

    nvidia, 455.45.01, 5.4.0-58-generic, x86_64: installed
    nvidia, 455.45.01, 5.8.0-36-generic, x86_64: installed
    nvidia, 455.45.01, 5.8.0-38-generic, x86_64: installed

So it's looking like there isn't an entry for the current kernel.

So far I've tried to rebuild nvidia-driver with the current kernel by running sudo dpkg-reconfigure nvidia-driver-455. This runs, but doesn't change anything (including after rebooting).

I also tried rebuilding all DKMS modules for all installed kernels with ls /var/lib/initramfs-tools | sudo xargs -n1 /usr/lib/dkms/dkms_autoinstaller start as suggested here: https://askubuntu.com/questions/53364/command-to-rebuild-all-dkms-modules-for-all-installed-kernels. This returns the following error:

Kernel preparation unnecessary for this kernel.  Skipping...
applying patch disable_fstack-clash-protection_fcf-protection.patch...patching file Kbuild
Hunk #1 succeeded at 84 (offset 13 lines).


Building module:
cleaning build area...
unset ARCH; [ ! -h /usr/bin/cc ] && export CC=/usr/bin/gcc; env NV_VERBOSE=1 'make' -j16 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.13.0-35-generic IGNOR
E_XEN_PRESENCE=1 IGNORE_CC_MISMATCH=1 SYSSRC=/lib/modules/5.13.0-35-generic/build LD=/usr/bin/ld.bfd modules.....(bad exit status: 2)
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-dkms-455.0.crash'
Error! Bad return status for module build on kernel: 5.13.0-35-generic (x86_64)
Consult /var/lib/dkms/nvidia/455.45.01/build/make.log for more information.
Module nvidia/455.45.01 already installed on kernel 5.4.0-58-generic/x86_64
Module nvidia/455.45.01 already installed on kernel 5.8.0-36-generic/x86_64
Module nvidia/455.45.01 already installed on kernel 5.8.0-38-generic/x86_64

I think this error might be something to do with the unset ARCH, but I'm not sure what that is?

Finally I've tried the switch-it-on-and-off-again equivalent sudo apt-get remove nvidia-driver-455; sudo apt-get install nvidia-driver-455, which runs, but doesn't solve the problem.

Any help would be amazing - thanks!

2

There are 2 best solutions below

0
On

I had troubles with the drivers provided by my distribution so I resorted to installing the drivers from nvidia directly, which is a bit cumbersome if secure boot is enabled on your machine. You can read about how to do that here. I was also facing the issue of the driver not being loaded after kernel updates, so I wrote a script that automatically installs the latest driver, which you can find here. In the read me file of the driver it is stated that

If you upgrade your kernel, then the simplest solution is to reinstall the driver.

1
On

i was receving the following error "NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.19.0-20-amd64 IGNORE_CC_MISMATCH" while installing the NVIDIA vGPU Driver on debian 10 with kernel version 4.XXX and i could manage to fix it by do the following : 1-installed proxmox after doing this the nvidia driver error will change (check for the vfio to be configed) after that i rebooted the server 2- then i got the error abot the pve headers so i downloaded .deb header file that was causing the error 3- finally the error fixed for me but now i got stuck to another error :-) im working on that too