How to measure bank conflicts per warp using NVIDIA Visual Profiler?

1k Views Asked by At

I am doing a detailed code analysis for which I want to measure the total number of bank conflicts per warp.

The nvvp documentation lists this metric, which was the only one I could find related to bank conflicts:

shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed

When I profile the metric using nvprof (or nvvp) I get a result like this:

Invocations            Metric Name                        Metric Description                Min         Max         Avg
Device "Tesla K20m (0)"
Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
301                    shared_replay_overhead             Shared Memory Replay Overhead    0.089730    0.089730    0.089730

I need to utilize this value 0.089730 or devise some other method to arrive at a measurement of number of bank conflicts.

I understand that this value is the 'average' taken across all the warps that are executing. If I had to measure the total number of bank conflicts per warp, is there a way to do it using the nvprof results?

Possible approaches that came to my mind:

  • By using shared_replay_overhead results and using them in a formula to calculate the number of bank conflicts. I am guessing I have to apply some sort of formula like shared_replay_overhead * Total number of warps launched where I know the Total number of warps launched in advance, but I can't figure out what.
  • By first detecting that it's a four-way bank conflict, eight-way bank conflict, etc, and then multiplying 4/8 by the number of times the shared memory operation takes place (how to measure that?).

This probably requires a fairly good technical knowledge about the GPU architecture as well, in addition to nvprof results, which I don't think I have yet. For the record, my GPU is of Kepler architecture, SM 3.5.

Even if I can measure the number of bank conflicts per block instead of per warp, it will suffice. After that I can do the necessary calculations to get the value on a per-warp basis.

1

There are 1 best solutions below

1
On

I think you should look at CUPTI (Cuda Profiling Tools Interface) documentation. There are also few examples with your CUDA SDK inside /extras/CUPTI directory. I'm not very familiar with this library, but It looks like you can write your own profiler, and measure what you want, or collect metrics you're interested in. It will be low level, but this is what you need to get precise answer.