NVCC register usage report in __device__ function

2.4k Views Asked by At

I'm trying to get some information about register usage in my CUDA kernels using NVCC option
--ptxas-options=v and while with global functions everything is ok, I'm having some difficulties with the device ones since the

ptxas info : Used N registers

line is missing in the output. I tried to use the noinline keyword and to keep them in another file, with respect to the calling global function, since I thought that NVCC was reporting the full register usage of the global function including the called device ones after the inline but nothing changes. I can get the information about register usage of the device functions only defining them as global.

Do you have any suggestions?

Thanks!

2

There are 2 best solutions below

2
On

As I understand it, ptxas (the device assembler) only outputs a register count on code which it links. Standalone __device__ functions are not linked by the assembler, they are only compiled. Therefore, the assembler won't emit a register count value for device functions. I don't believe there is a workaround for this.

However, it is still possible to get the register footprint of a __device__ function by dumping the elf data from the assembler output using cuobjdump. You can do this as follows:

$ cat vdot.cu
__device__  __noinline__ float vdot(float v1, float v2) {
    return (v1 * v2);
}

__device__ __noinline__  float vdot(float2 v1, float2 v2) {
    return (v1.x * v2.x) + (v1.y * v2.y);
}

__device__ __noinline__ float vdot(float4 v1, float4 v2) {
    return (v1.x * v2.x) + (v1.y * v2.y) + (v1.z * v2.z) + (v1.w * v2.w);
}

$ nvcc -std=c++11 -arch=sm_52 -dc -Xptxas="-v" vdot.cu
ptxas info    : 0 bytes gmem
ptxas info    : Function properties for cudaDeviceGetAttribute
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z4vdotff
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessor
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z4vdot6float4S_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaMalloc
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaGetDevice
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z4vdot6float2S_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaFuncGetAttributes
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

Here we have a separately compiled set of three __device__ functions in a device object file. Running cuobjdump on it will emit a lot of output, but in it you will get a register count for each function:

$ cuobjdump -elf ./vdot.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
compressed

<---Snipped--->


.text._Z4vdotff
bar = 0 reg = 6 lmem=0  smem=0
0xfec007f1  0x001fc000  0x00570003  0x5c980780  
0x00470000  0x5c980780  0x00370004  0x5c680000  
0xffe007ff  0x001f8000  0x0007000f  0xe3200000  
0xff87000f  0xe2400fff  0x00070f00  0x50b00000

In the second line of the output for the device function dot(float, float) you can see the function uses 6 registers. This is the only way I am aware of to examine device function register footprints.

0
On

I don't know when it was added but my CUDA 10 cuobjdump has the -res-usage flag which shows something like this:

$ cuobjdump -res-usage .../cuda_compile_1_generated_VisualOdometry.cu.o

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
identifier = /home/mad/automy-system/vision/src/VisualOdometry.cu

Resource usage:
 Common:
  GLOBAL:0 CONSTANT[3]:24
 Function _Z17vo_compute_systemPfS_P6float4S_jS0_S0_f:
  REG:39 STACK:32 SHARED:168 LOCAL:0 CONSTANT[0]:404 CONSTANT[2]:80 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function _Z13vo_pre_filterP6float4PfPjPK5uint2iijff:
  REG:16 STACK:0 SHARED:8 LOCAL:0 CONSTANT[0]:372 TEXTURE:0 SURFACE:0 SAMPLER:0