In an attempt to understand the usage of texture memory for bilinear interpolation compared to global memory.
I implemented a simple bilinear interpolation kernel using global memory as follows :
__global__ void GlobalBilinear(typeImg * inPtr, typeOutImg* outPtr, int width, int height)
{
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
__shared__ float val[1024];
if(ix < width && iy < height)
{
val[threadIdx.x + blockDim.x*threadIdx.y]
= (1.0f - alpha) * (1.0f - beta) * uint2float(inPtr[ix + width * iy]) +
alpha * (1.0f - beta) * uint2float(inPtr[ix + 1 + width * iy]) +
(1.0f - alpha) * (beta)*uint2float(inPtr[ix + width * (iy + 1)]) +
alpha * (beta)*uint2float(inPtr[ix + 1 + width * (iy + 1)]);
}
}
and texture memory as follows :
__global__ void MemoryBilinear2DTexture(cudaTextureObject_t texObj, typeOutImg* outPtr, int width, int height)
{
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
__shared__ float val[1024];
if(ix < width && iy < height)
{
val[threadIdx.x + blockDim.x*threadIdx.y] = (tex2D<typeOutImg>(texObj, ix + 1.0f, iy + 1.0f));
}
}
A simple timing comparison of 50 runs reports :
Global Kernel Time Bilinear Interpolation : 37.87073 ms
Texture Bilinear Interpolation Time : 25.75578 ms
These results are compiled with default flag, which is for some reason for my compiler the L1 cache is desabled (-Xptxas -dlcm=cg).
Now when I manually set the L1 cache is enabled (-Xptxas -dlcm=ca), I get the following results :
Global Kernel Time Bilinear Interpolation : 48.90469 ms
Texture Bilinear Interpolation Time : 33.65562 ms
- Why enabling the L1 cache affect the performance like this ?,
- Why the nvcc compiler default L1 cache is disabled for global memory access ?
P.S : The results reported are using an 8Bit image of 1000x1000.
nvprof shows the following results when L1 cache is enabled:
**Kernel: GlobalBilinear(void*, void*, int, long)**
global_hit_rate Global Hit Rate in unified l1/tex 72.48% 72.48% 72.48%
tex_cache_transactions Unified Cache Transactions 504000 504000 504000
**Kernel: MemoryBilinear2DTexture(__int64, void*, int, long)**
global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
tex_cache_transactions Unified Cache Transactions 250000 250000 250000