I'm trying to get some information about register usage in my CUDA kernels using NVCC option
--ptxas-options=v
and while with global functions everything is ok, I'm having some difficulties with the device ones since the
ptxas info : Used N registers
line is missing in the output. I tried to use the noinline keyword and to keep them in another file, with respect to the calling global function, since I thought that NVCC was reporting the full register usage of the global function including the called device ones after the inline but nothing changes. I can get the information about register usage of the device functions only defining them as global.
Do you have any suggestions?
Thanks!
As I understand it,
ptxas
(the device assembler) only outputs a register count on code which it links. Standalone__device__
functions are not linked by the assembler, they are only compiled. Therefore, the assembler won't emit a register count value for device functions. I don't believe there is a workaround for this.However, it is still possible to get the register footprint of a
__device__
function by dumping the elf data from the assembler output usingcuobjdump
. You can do this as follows:Here we have a separately compiled set of three
__device__
functions in a device object file. Runningcuobjdump
on it will emit a lot of output, but in it you will get a register count for each function:In the second line of the output for the device function
dot(float, float)
you can see the function uses 6 registers. This is the only way I am aware of to examine device function register footprints.