CUDA: Erroneous lmem statistics displayed for sm_20?

812 Views Asked by Ashwin Nanjappa At 24 February 2011 at 06:52

A CUDA kernel compiled with the option --ptxas-options=-v seems to be displaying erroneous lmem (local memory) statistics when sm_20 GPU architecture is specified. The same gives meaningful lmem statistics with sm_10 / sm_11 / sm_12 / sm_13 architectures.

Can someone clarify if the sm_20 lmem statistics need to be read differently or they are plain wrong?

Here is the kernel:

__global__ void fooKernel( int* dResult )
{
        const int num = 1000;
        int val[num]; 

        for ( int i = 0; i < num; ++i )
        val[i] = i * i; 

        int result = 0; 

        for ( int i = 0; i < num; ++i )
        result += val[i]; 

        *dResult = result;

        return;
}

--ptxas-options=-v and sm_20 report:

1>ptxas info    : Compiling entry function '_Z9fooKernelPi' for 'sm_20'
1>ptxas info    : Used 5 registers, 4+0 bytes lmem, 36 bytes cmem[0]

--ptxas-options=-v and sm_10 / sm_11 / sm_12 / sm_13 report:

1>ptxas info    : Compiling entry function '_Z9fooKernelPi' for 'sm_10'
1>ptxas info    : Used 3 registers, 4000+0 bytes lmem, 4+16 bytes smem, 4 bytes cmem[1]

sm_20 reports a lmem of 4 bytes, which is simply not possible if you see the 4x1000 byte array being used in the kernel. The older GPU architectures report the correct 4000 byte lmem statistic.

This was tried with CUDA 3.2. I have referred to the Printing Code Generation Statistics section of the NVCC manual (v3.2), but it does not help explain this anomaly.

Original Q&A

There are 1 best solutions below

LumpN On 24 February 2011 at 12:39

The compiler is correct. Through clever optimization the array doesn't need to be stored. What you do is essentially calculating result += i * i without ever storing temporaries to val.

A look at the generated ptx code won't show any differences for sm_10 vs. sm_20. Decompiling the generated cubins with decuda will reveal the optimization.

BTW: Try to avoid local memory! It is as slow as global memory.

CUDA: Erroneous lmem statistics displayed for sm_20?

There are 1 best solutions below

Related Questions in CUDA

Related Questions in NVCC

Related Questions in PTXAS

Trending Questions

Popular # Hahtags

Popular Questions