How to interpret the number shown in the square brackets?

179 Views Asked by At

The number shown in the square brackets after the kernel name correlates to the CUDA API that launched that kernel. (from GPU-Trace and API-Trace Modes)

The number shown in the square brackets after the kernel name are

  • 94,
  • 105,
  • 2191,
  • 2198.

So what exactly is CUDA API [94](and other) in NVIDIA CUDA Runtime API?


==27706== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.36 GFlop/s, Time= 3.707 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27706== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
133.81ms  135.78us                    -               -         -         -         -  409.60KB  3.0167GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.62ms  270.66us                    -               -         -         -         -  819.20KB  3.0267GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.90ms  3.7037ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [94]
138.71ms  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [105]
<...more output...>
1.24341s  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2191]
1.24711s  3.7046ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2198]
1.25089s  248.13us                    -               -         -         -         -  819.20KB  3.3015GB/s  GeForce GT 640M         1         2  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
1

There are 1 best solutions below

0
On BEST ANSWER

It might be clearer if it said:

The number shown in the square brackets after the kernel name correlates to the CUDA API call that launched that kernel.

If you run a given code using the --print-api-trace option, you'll get a sequential list of all the CUDA API calls issued by that application. If you were to number those in order, the number associated with a particular kernel launch would be shown in the square brackets in the --print-gpu-trace output.

Here is a fully-worked example. Note the correlation between [105], [106], and [108] in the api-trace output and in the gpu-trace output:

$ cat t1.cu
__global__ void k(){}

int main(){

  k<<<1,1>>>();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
   Start  Duration  Name
116.17ms  3.0990us  cuDeviceGetPCIBusId
130.20ms     800ns  cuDeviceGetCount
130.20ms     251ns  cuDeviceGetCount
130.41ms  1.0500us  cuDeviceGet
130.41ms     705ns  cuDeviceGetAttribute
130.42ms     539ns  cuDeviceGetAttribute
130.42ms     547ns  cuDeviceGetAttribute
130.46ms     525ns  cuDeviceGetCount
130.46ms     277ns  cuDeviceGet
130.46ms  59.680us  cuDeviceGetName
130.52ms  63.802us  cuDeviceTotalMem
130.59ms     497ns  cuDeviceGetAttribute
130.59ms     226ns  cuDeviceGetAttribute
130.59ms     282ns  cuDeviceGetAttribute
130.59ms     234ns  cuDeviceGetAttribute
130.59ms     229ns  cuDeviceGetAttribute
130.59ms  34.628us  cuDeviceGetAttribute
130.62ms     372ns  cuDeviceGetAttribute
130.63ms     220ns  cuDeviceGetAttribute
130.63ms     284ns  cuDeviceGetAttribute
130.63ms     237ns  cuDeviceGetAttribute
130.63ms     222ns  cuDeviceGetAttribute
130.63ms     231ns  cuDeviceGetAttribute
130.63ms     288ns  cuDeviceGetAttribute
130.63ms     219ns  cuDeviceGetAttribute
130.63ms  3.1870us  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     275ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     213ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     336ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     213ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     210ns  cuDeviceGetAttribute
130.65ms     215ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms  320.65us  cuDeviceGetAttribute
130.97ms     322ns  cuDeviceGetAttribute
130.97ms     206ns  cuDeviceGetAttribute
130.97ms     218ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.98ms     226ns  cuDeviceGetAttribute
130.98ms     220ns  cuDeviceGetAttribute
130.98ms     212ns  cuDeviceGetAttribute
130.98ms     210ns  cuDeviceGetAttribute
130.98ms     206ns  cuDeviceGetAttribute
130.98ms     207ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms     211ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     229ns  cuDeviceGetAttribute
130.98ms     215ns  cuDeviceGetAttribute
130.98ms     216ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms  316.59us  cuDeviceGetAttribute
131.30ms     266ns  cuDeviceGetAttribute
131.30ms     252ns  cuDeviceGetAttribute
131.30ms     212ns  cuDeviceGetAttribute
131.30ms     235ns  cuDeviceGetAttribute
131.30ms     209ns  cuDeviceGetAttribute
131.30ms     272ns  cuDeviceGetAttribute
131.30ms     207ns  cuDeviceGetAttribute
131.30ms     735ns  cuDeviceGetAttribute
131.30ms     254ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     610ns  cuDeviceGetAttribute
131.31ms     273ns  cuDeviceGetAttribute
131.31ms     412ns  cuDeviceGetAttribute
131.31ms     216ns  cuDeviceGetAttribute
131.31ms     211ns  cuDeviceGetAttribute
131.31ms     205ns  cuDeviceGetAttribute
131.31ms  59.911ms  cudaLaunchKernel (k(void) [105])
191.23ms  11.222us  cudaLaunchKernel (k(void) [106])
191.24ms  5.7860us  cudaDeviceSynchronize
191.25ms  9.2890us  cudaLaunchKernel (k(void) [108])
191.26ms  5.1790us  cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
191.20ms  1.6000us              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [105]
191.22ms     896ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [106]
191.23ms     928ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [108]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$