CUDA : How to detect shared memory bank conflict on device with compute capabiliy >= 7.2?

2.1k Views Asked by At

On device with compute capability <= 7.2 , I always use

nvprof --events shared_st_bank_conflict

but when i run it on RTX2080ti with CUDA10 , it returns

Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability greater than 7.2

So how can i detect whether there's share memory bank conflict on this devices ?

I've installed Nvidia Nsight Systems and Nsight Compute , find no such profiling report...

thks

3

There are 3 best solutions below

0
On

As others pointed out, nvprof is replaced by Nsight Compute, check their metrics equivalence mapping.

In particular, shared_efficiency gets mapped to smsp__sass_average_data_bytes_per_wavefront_mem_shared (cryptic!).

I have a feeling that more metrics suffered during this transition. Jokes aside, let's demonstrate how to use it. To this end, take a kernel intentionally causing bank conflicts:

__global__ void kernel(int offset)
{
    __shared__ unsigned int sharedMem[4096];

    int threadId = threadIdx.x;

    // init shared memory
    if (threadId == 0)
    {
        for (int i = 0; i < 4096; i++) sharedMem[i] = 0;
    }
    __syncthreads();

    // repeatedly read and write to shared memory
    unsigned int index = threadId * offset;
    for (int i = 0; i < 10000; i++)
    {
        sharedMem[index] += index * i;
        index += 32;
        index %= 4096;
    }
}

This kernel should cause conflicts unless the offset is relatively prime with 32. Call the kernel:

int main(int argc, char* argv[]) 
{
    int offset = atoi( argv[1] ); 

    // set bank chunk to 4, just in case
    cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);
    
    kernel<<<1, 32>>>(offset); 
    cudaDeviceSynchronize();

}

Compile with nvcc bank_conflicts.cu -o bank_conflicts and we are ready to demonstrate conflicts detection as follows

ncu --metrics smsp__sass_average_data_bytes_per_wavefront_mem_shared bank_conflicts 1 # ~ 99% efficiency = no conflicts :-) 
ncu --metrics smsp__sass_average_data_bytes_per_wavefront_mem_shared bank_conflicts 22 # ~ 49% efficiency = 2-way conflicts :-(
ncu --metrics smsp__sass_average_data_bytes_per_wavefront_mem_shared bank_conflicts 24 # ~ 12.5% efficiency = 8-way conflicts

As a bonus, let's establish the following fact: every bank is accessed k = GCD(offset,32) times, hence the efficiency reported equals 1/k. Why is that? A single int takes 32b = 4B fitting a single bank (the basic slice is 4B ); the thread x requests then the bank number bank=(x * offset)%32. This mapping takes every value exactly k=GCD(offset,32) times, seen for example by properties of linear transforms and elementary number theory :-)

0
On

It seems this is a problem, and is addressed in this post to the NVIDIA forums. Apparently it should be supported using one of the Nsight tools (either the CLI or UI).

0
On

You can use --metrics:

Either

nv-nsight-cu-cli --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum

for conflicts when reading (load'ing) from shared memory, or

nv-nsight-cu-cli --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum

for conflicting when writing (store'ing) to shared memory.