I'm trying to get some information about register usage in my CUDA kernels using NVCC option
--ptxas-options=v and while with global functions everything is ok, I'm having some difficulties with the device ones since the
ptxas info : Used N registers
line is missing in the output. I tried to use the noinline keyword and to keep them in another file, with respect to the calling global function, since I thought that NVCC was reporting the full register usage of the global function including the called device ones after the inline but nothing changes. I can get the information about register usage of the device functions only defining them as global.
Do you have any suggestions?
Thanks!
As I understand it,
ptxas(the device assembler) only outputs a register count on code which it links. Standalone__device__functions are not linked by the assembler, they are only compiled. Therefore, the assembler won't emit a register count value for device functions. I don't believe there is a workaround for this.However, it is still possible to get the register footprint of a
__device__function by dumping the elf data from the assembler output usingcuobjdump. You can do this as follows:Here we have a separately compiled set of three
__device__functions in a device object file. Runningcuobjdumpon it will emit a lot of output, but in it you will get a register count for each function:In the second line of the output for the device function
dot(float, float)you can see the function uses 6 registers. This is the only way I am aware of to examine device function register footprints.