How to check for which CUDA compute capabilites kernels are available?

167 Views Asked by At

Is there a way to check at runtime for which CUDA compute capabilites the current program was compiled? Or do the arch=compute_xx,code=sm_xx flags set any defines which could be checked?

Background is that I cannot make sure that users have a "correct" setup for a deployed binary. For that, I'd like to compare their device's CC (using cudaGetDeviceProperties) versus the compiled CC before starting lengthy initialization. Up until now, I can only try to launch a kernel and use cudaPeekAtLastError to exit when no kernel image is available for execution on the device.

EDIT: I've tried using cudaFuncGetAttributes like so (main.cu):

#include <iostream>

__global__ void cudaKernel()
{
    ;
}

int main()
{
    cudaFuncAttributes attr;
    cudaError_t err = cudaFuncGetAttributes(&attr, cudaKernel);
    if (err != cudaSuccess) {
        std::cout << "CUDA Error: " << cudaGetErrorString(err);
        return EXIT_FAILURE;
    }
    
    return EXIT_SUCCESS;
}

This works fine on a linux machine with CUDA 11.2 and a GTX 1080 (CC 6.1). However, running on my local Windows machine (CUDA 11.8, GT730 with CC 3.5) this returns: CUDA Error: invalid device function, which indicates

The requested device function does not exist or is not compiled for the proper device architecture.

Using cuobjdump on the exe tells me:

Fatbin ptx code:
================
arch = sm_35
code version = [7,8]
host = windows
compile_size = 64bit
compressed

Fatbin elf code:
================
arch = sm_35
code version = [1,7]
host = windows
compile_size = 64bit

I'm scratching my head on why cudumpobj is able to determine the arch whereas cudaFuncGetAttributes fails even though it should be able to execute that on my CC 3.5 device. I'm running on driver version 456.71, which should work with CUDA 11.x.

1

There are 1 best solutions below

1
On

The only way is what you are already doing, i.e. try to examine the function using cudaFuncGetAttributes, and catch the error if the function doesn't can't load because of an architecture mismatch or lack of JIT path to execute.

I'm scratching my head on why cudumpobj [sic] is able to determine the arch whereas cudaFuncGetAttributes fails even though it should be able to execute that on my CC 3.5 device.

The underlying reason is that the only way the runtime or driver API can examine the payload of an object or cubin file is to try and load it into the context. And the act of loading is what fails when you have an incompatible architecture. It isn't that cudaFuncGetAttributes doesn't work, it is that the API is failing trying to load a cubin file which can't be loaded into your context because of an architecture mismatch.

cuobjdump, on the other hand, is showing you the ELF headers in the files, using an ELF parsing library (NVIDIA's own, which I don't believe is public, but I could be wrong). If I remember correctly, there is a proprietary ELF section called nvinfo which contains all the CUDA specific metadata. It doesn't use the driver or runtime API. If you were really desperate to do this, you might be able to use some kind of ELF parsing library to extract that metadata and replicate the functionality yourself, although that would be a lot of reverse engineering to do well.