I used Compute Engine VM with T4 GPU for quite some time on COS and it has been working fine until recently that cos-extensions install gpu
does not work like before.
I0830 07:32:58.419130 987 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0830 07:32:58.427417 987 install.go:74] Running on COS build id 16108.470.16
I0830 07:32:58.427566 987 installer.go:187] Getting the default GPU driver version
I0830 07:32:58.427911 987 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548403 987 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548594 987 install.go:85] Installing GPU driver version 450.119.04
I0830 07:32:58.549646 987 cache.go:72] map[BUILD_ID:16108.470.11 DRIVER_VERSION:450.119.04]
I0830 07:32:58.549674 987 install.go:120] Did not find cached version, installing the drivers...
I0830 07:32:58.549681 987 installer.go:82] Configuring driver installation directories
I0830 07:32:58.563327 987 installer.go:196] Updating container's ld cache
I0830 07:32:58.793692 987 signature.go:30] Downloading driver signature for version 450.119.04
I0830 07:32:58.793721 987 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.470.16/extensions/gpu/450.119.04.signature.tar.gz
E0830 07:32:58.828902 987 artifacts.go:106] Failed to download extensions/gpu/450.119.04.signature.tar.gz from public GCS: failed to download 450.119.04.signature.tar.gz, status: 404 Not Found
E0830 07:32:58.829401 987 install.go:175] failed to download driver signature: failed to download driver signature for version 450.119.04: failed to download extensions/gpu/450.119.04.signature.tar.gz
It seems like the installer could not find the driver signature. I have looked into this and followed the workaround by doing
/usr/bin/docker run --rm \
--privileged \
--net=host \
--pid=host \
--volume /dev:/dev \
--volume /:/root \
--volume /var/lib/toolbox/nvidia:/usr/local/nvidia \
--env NVIDIA_DRIVER_VERSION=450.119.04 \
gcr.io/cos-cloud/cos-gpu-installer:latest
but got this instead
+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.119.04
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO 2021-08-30 07:36:38 UTC] PRELOAD: false
[INFO 2021-08-30 07:36:38 UTC] Running on COS build id 16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO 2021-08-30 07:36:38 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO 2021-08-30 07:36:38 UTC] Checking cached version
/entrypoint.sh: line 172: CACHE_BUILD_ID: unbound variable
It seems like there are some changes going on with COS and COS GPU driver (maybe?), but just want to know whether there is a workaround on this problem apart from waiting GCP to solve things out.
This is the same case as the one Jan Vansteenlandt linked to.
This happens in some versions of COS;
For example latest stable COS version available now - 89-16108:
There's no driver listed under
[gpu]
and runningcos-extensions install gpu
ends in the same way as in your case. When trying to run the docker container you mentioned also yielded the same results.This is a known issue and has already been raised on IssueTracker. You can fallow the link and click on
+1
button, also you can comment and post your own findings in the thread.There's also a workaround in the thread so you may give it a go.
If you can use some older version of COS (85-13310 for example) - the driver is listed:
And when you run
cos-extensions install gpu
it will result in succesfull installation of NVIDIA drivers: