Suppose I take a CUDA program - for example the CUDA vectorAdd
sample, and cut out the kernel's implementation, but still have the launch command:
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
and suppose that I write my own PTX since I'm a DIY kind of a guy, so that now I have vectorAdd.cu
without the kernel's CUDA code and vectorAdd.ptx
.
Can I now generate an executable which will work like the unmodified vectorAdd would, but running the code in the PTX?
(Assume for the same of discussion that the PTX doesn't try anything funny or do anything wrong.)
Notes:
This question is a variant on:
How can I create an executable to run a kernel in a given PTX file?
Except that, in that question, the poster was willing to use the driver API to dynamically load and compile a PTX file using the driver API. Here, that's not an option: The C++ code uses a triple-chevron CUDA runtime launch, and this must not change.
I don't mind the process of creating the executable involving the generation of other files, e.g. a cubin.
EDIT/Update
I had to change some of the procedures for a newer toolkit, see below for the old description.
Some details seem to have changed with new Toolkit versions and I did create a bit more documentation on the way.
Tested on Windows with VS 2022 19.31.31104 with a sm_75 Turing GPU and nvcc 11.7, doing a debug build. It should also work on Linux and/or with other host compilers.
Step 1 and 2
We can start with full kernels and patch the PTX or SASS or we can create stub kernels.
Using here the example project from the VS Cuda Plugin, both the
main()
and anaddWithCuda()
helper function for the host and anaddKernel()
global device kernel are included in a single filekernel.cu
.Step 3
Copying the command line from Visual Studio, removing the paths and adding the
-keep
option, we also remove-x cu
and the output file-o kernel.cu.obj
to have to change less later on:This command should be run on the command line with x64 paths for the Visual Studio compiler.
The following files are generated:
Important to keep are
kernel.cudafe1.cpp
andkernel.cudafe1.stub.c
enabling the registering and runtime launch of the kernel.Step 4
The
kernel.ptx
can be used as a structure for recreating PTX code or thekernel.sm_75.cubin
or directly thekernel.fatbin.c
for patching the SASS code.Step 5
Now we are assembling the .ptx file with
-fatbin -dlink
and change the input file fromkernel.cu
tokernel.ptx
:We get
Note: Instead of a
.ptx
file also a.cubin
file can be made to a.fatbin.c
with-fatbin -dlink
, but then I had to specify a non-virtual compute architecture e.g. by--gpu-architecture=sm_75
.Now we have the host files together: The
kernel.fatbin.c
and thekernel.cudafe1.cpp
andkernel.cudafe1.stub.c
.Steps 6 to 8
We can put those through host compilation
Step 9
and through the host linker
Finished
creating
kernel.exe
.Old Description
Perhaps still relevant with other toolkit versions.
(As advanced DIY guy you will want to write SASS code in the future, which is the device-specific lower-level assembly language.)