L1/Texture cache enabling effect on nvcc(cuda 10.2) on jetson nano (maxwell architecture)

40 Views Asked by At

In an attempt to understand the usage of texture memory for bilinear interpolation compared to global memory.

I implemented a simple bilinear interpolation kernel using global memory as follows :

__global__ void GlobalBilinear(typeImg * inPtr, typeOutImg* outPtr, int width, int height)
{

    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int iy = blockIdx.y * blockDim.y + threadIdx.y;

    __shared__ float val[1024];

    if(ix < width && iy < height)
    {
        val[threadIdx.x + blockDim.x*threadIdx.y] 
                =   (1.0f - alpha) * (1.0f - beta) * uint2float(inPtr[ix + width * iy]) + 
                    alpha * (1.0f - beta) * uint2float(inPtr[ix + 1 + width * iy])      +
                    (1.0f - alpha) * (beta)*uint2float(inPtr[ix + width * (iy + 1)])     +
                    alpha * (beta)*uint2float(inPtr[ix + 1 + width * (iy + 1)]);
    }
}

and texture memory as follows :

__global__ void MemoryBilinear2DTexture(cudaTextureObject_t texObj, typeOutImg* outPtr, int width, int height)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int iy = blockIdx.y * blockDim.y + threadIdx.y;

    __shared__ float val[1024];
    if(ix < width && iy < height)
    {
        val[threadIdx.x + blockDim.x*threadIdx.y] = (tex2D<typeOutImg>(texObj, ix + 1.0f, iy + 1.0f));
    }
}

A simple timing comparison of 50 runs reports :

Global Kernel Time Bilinear Interpolation : 37.87073 ms

Texture Bilinear Interpolation Time : 25.75578 ms

These results are compiled with default flag, which is for some reason for my compiler the L1 cache is desabled (-Xptxas -dlcm=cg).

Now when I manually set the L1 cache is enabled (-Xptxas -dlcm=ca), I get the following results :

Global Kernel Time Bilinear Interpolation : 48.90469 ms

Texture Bilinear Interpolation Time : 33.65562 ms

  1. Why enabling the L1 cache affect the performance like this ?,
  2. Why the nvcc compiler default L1 cache is disabled for global memory access ?

P.S : The results reported are using an 8Bit image of 1000x1000.

nvprof shows the following results when L1 cache is enabled:

**Kernel: GlobalBilinear(void*, void*, int, long)**
global_hit_rate         Global Hit Rate in unified l1/tex      72.48%      72.48%      72.48%
tex_cache_transactions  Unified Cache Transactions      504000      504000      504000
**Kernel: MemoryBilinear2DTexture(__int64, void*, int, long)**
global_hit_rate         Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
tex_cache_transactions  Unified Cache Transactions      250000      250000      250000

0

There are 0 best solutions below