L1/Texture cache enabling effect on nvcc(cuda 10.2) on jetson nano (maxwell architecture)

40 Views Asked by rooody At 08 January 2024 at 16:21

In an attempt to understand the usage of texture memory for bilinear interpolation compared to global memory.

I implemented a simple bilinear interpolation kernel using global memory as follows :

__global__ void GlobalBilinear(typeImg * inPtr, typeOutImg* outPtr, int width, int height)
{

    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int iy = blockIdx.y * blockDim.y + threadIdx.y;

    __shared__ float val[1024];

    if(ix < width && iy < height)
    {
        val[threadIdx.x + blockDim.x*threadIdx.y] 
                =   (1.0f - alpha) * (1.0f - beta) * uint2float(inPtr[ix + width * iy]) + 
                    alpha * (1.0f - beta) * uint2float(inPtr[ix + 1 + width * iy])      +
                    (1.0f - alpha) * (beta)*uint2float(inPtr[ix + width * (iy + 1)])     +
                    alpha * (beta)*uint2float(inPtr[ix + 1 + width * (iy + 1)]);
    }
}

and texture memory as follows :

__global__ void MemoryBilinear2DTexture(cudaTextureObject_t texObj, typeOutImg* outPtr, int width, int height)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int iy = blockIdx.y * blockDim.y + threadIdx.y;

    __shared__ float val[1024];
    if(ix < width && iy < height)
    {
        val[threadIdx.x + blockDim.x*threadIdx.y] = (tex2D<typeOutImg>(texObj, ix + 1.0f, iy + 1.0f));
    }
}

A simple timing comparison of 50 runs reports :

Global Kernel Time Bilinear Interpolation : 37.87073 ms

Texture Bilinear Interpolation Time : 25.75578 ms

These results are compiled with default flag, which is for some reason for my compiler the L1 cache is desabled (-Xptxas -dlcm=cg).

Now when I manually set the L1 cache is enabled (-Xptxas -dlcm=ca), I get the following results :

Global Kernel Time Bilinear Interpolation : 48.90469 ms

Texture Bilinear Interpolation Time : 33.65562 ms

Why enabling the L1 cache affect the performance like this ?,
Why the nvcc compiler default L1 cache is disabled for global memory access ?

P.S : The results reported are using an 8Bit image of 1000x1000.

nvprof shows the following results when L1 cache is enabled:

**Kernel: GlobalBilinear(void*, void*, int, long)**
global_hit_rate         Global Hit Rate in unified l1/tex      72.48%      72.48%      72.48%
tex_cache_transactions  Unified Cache Transactions      504000      504000      504000
**Kernel: MemoryBilinear2DTexture(__int64, void*, int, long)**
global_hit_rate         Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
tex_cache_transactions  Unified Cache Transactions      250000      250000      250000

Original Q&A

L1/Texture cache enabling effect on nvcc(cuda 10.2) on jetson nano (maxwell architecture)

There are 0 best solutions below

Related Questions in CACHING

Related Questions in CUDA

Related Questions in NVCC

Related Questions in NVIDIA-JETSON-NANO

Trending Questions

Popular # Hahtags

Popular Questions