Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?

717 Views Asked by At

Suppose I take a CUDA program - for example the CUDA vectorAdd sample, and cut out the kernel's implementation, but still have the launch command:

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

and suppose that I write my own PTX since I'm a DIY kind of a guy, so that now I have vectorAdd.cu without the kernel's CUDA code and vectorAdd.ptx.

Can I now generate an executable which will work like the unmodified vectorAdd would, but running the code in the PTX?

(Assume for the same of discussion that the PTX doesn't try anything funny or do anything wrong.)

Notes:

  • This question is a variant on:

    How can I create an executable to run a kernel in a given PTX file?

    Except that, in that question, the poster was willing to use the driver API to dynamically load and compile a PTX file using the driver API. Here, that's not an option: The C++ code uses a triple-chevron CUDA runtime launch, and this must not change.

  • I don't mind the process of creating the executable involving the generation of other files, e.g. a cubin.

1

There are 1 best solutions below

4
On

EDIT/Update

I had to change some of the procedures for a newer toolkit, see below for the old description.


Some details seem to have changed with new Toolkit versions and I did create a bit more documentation on the way.

Tested on Windows with VS 2022 19.31.31104 with a sm_75 Turing GPU and nvcc 11.7, doing a debug build. It should also work on Linux and/or with other host compilers.

Step 1 and 2

We can start with full kernels and patch the PTX or SASS or we can create stub kernels.

Using here the example project from the VS Cuda Plugin, both the main() and an addWithCuda() helper function for the host and an addKernel() global device kernel are included in a single file kernel.cu.

Step 3

Copying the command line from Visual Studio, removing the paths and adding the -keep option, we also remove -x cu and the output file -o kernel.cu.obj to have to change less later on:

nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include"  -G  --keep -maxrregcount=0  --machine 64 --compile -cudart static  -g  -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd" kernel.cu

This command should be run on the command line with x64 paths for the Visual Studio compiler.

The following files are generated:

kernel.cpp1.ii (raw code after preprocessor, different prepr. macros)
kernel.cpp1.ii.res (command line parameters for preprocessor)
kernel.cpp4.ii (raw code after preprocessor, different prepr. macros)
kernel.cpp4.ii.res (command line parameters for preprocessor)
kernel.cu.obj (compiled code from C++ and Cuda)
kernel.cu.obj.res (command line parameters)
kernel.cudafe1.c (runtime stub, includes kernel.cudafe1.stub.c)
kernel.cudafe1.cpp (C++ code after preprocessor, Cuda code only as declaration (otherwise #if 0) and launches with __cudaPushCallConfiguration)
kernel.cudafe1.gpu (Cuda code after preprocessor, C++ code removed)
kernel.cudafe1.stub.c (runtime stub to register and launch the kernel)
kernel.fatbin (fatbin of the kernel)
kernel.fatbin.c (dump of kernel.fatbin as C array)
kernel.module_id (ID identifying the kernel)
kernel.ptx (The ptx from the original .cu)
kernel.sm_75.cubin (The compiled ptx file from the original .cu)
vc143.pdb (Visual Studio program database - host stuff)

Important to keep are kernel.cudafe1.cpp and kernel.cudafe1.stub.c enabling the registering and runtime launch of the kernel.

Step 4

The kernel.ptx can be used as a structure for recreating PTX code or the kernel.sm_75.cubin or directly the kernel.fatbin.c for patching the SASS code.

Step 5

Now we are assembling the .ptx file with -fatbin -dlink and change the input file from kernel.cu to kernel.ptx:

nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include"  -G  --keep -maxrregcount=0  --machine 64 -cudart static  -g  -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd " -fatbin -dlink kernel.ptx

We get

a_dlink.fatbin
a_dlink.sm_75.cubin
a_dlink.sm_75.cubin.optf
kernel.fatbin
kernel.fatbin.c (fatbin dump from new ptx)
kernel.obj
kernel.obj.res
kernel.sm_75.cubin
vc143.pdb

Note: Instead of a .ptx file also a .cubin file can be made to a .fatbin.c with -fatbin -dlink, but then I had to specify a non-virtual compute architecture e.g. by --gpu-architecture=sm_75.

Now we have the host files together: The kernel.fatbin.c and the kernel.cudafe1.cpp and kernel.cudafe1.stub.c.

Steps 6 to 8

We can put those through host compilation

nvcc -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "D:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX64\x64" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include" -I"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include"  -G  --keep -maxrregcount=0  --machine 64 --compile -cudart static  -g  -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdvc143.pdb /FS /Zi /RTC1 /MDd" --compile kernel.cudafe1.cpp input\kernel.fatbin.c

Step 9

and through the host linker

link /OUT:"kernel.exe" /MANIFEST /NXCOMPAT /DYNAMICBASE "cudart_static.lib" "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" "cudart.lib" "cudadevrt.lib" /DEBUG /MACHINE:X64 /INCREMENTAL /SUBSYSTEM:CONSOLE /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /ERRORREPORT:PROMPT /NOLOGO /LIBPATH:"D:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\x64" /TLBID:1 kernel.cudafe1.obj kernel.fatbin.obj

Finished

creating kernel.exe.


Old Description

Perhaps still relevant with other toolkit versions.

  1. Define your function as
    __ global __ void vectorAdd(void* d_A, void* d_B, void* d_C, int numElements);
    in a header visible to the caller
  2. Create a file vectorAdd.cu with an empty declaration
    __ global __ void vectorAdd(void* d_A, void* d_B, void* d_C, int numElements) {}
  3. Call
    nvcc --keep vectorAdd.cu
    with suitable options
  4. Replace vectorAdd.ptx with your version
  5. Call
    nvcc -fatbin -dlink
    to create the fatbin and cubin files
  6. Call nvcc -link to link the .cubin file and the .cudafe1.cpp or cudafe1.c (depending on language) file. They also include the .cudafe1.stub.c and the .fatbin.c file in turn
  7. Use the resulting .obj or .o file (Windows/Linux) in your project
  8. Call vectorAdd<<<>>> in the CUDA runtime way

(As advanced DIY guy you will want to write SASS code in the future, which is the device-specific lower-level assembly language.)