I am writing a CUDA C++ code for image filtering. CUDA separates the image data into blocks for parallel processing. For regular pixel-wise processing of course it is fast. However in image filtering, for each pixel we need the neighbor pixels to convolve it with the filter Mask (filter kernel).
Now for those pixels of the input image that are on the border of a CUDA block, the adjacent pixels would be on the adjacent block and this needs communication between different blocks of the processor which makes the speed of the process to drops drastically!
As I understood a solution to optimize such case is to use shared memory and to keep all the pixels that we need to process the block on that __shared__ array.
I have considered a padded __shared__ array to keep the adjacent pixels of each block and carry on with the filtering.
Here is a piece of the code I wrote:
/// in the header:
#define MASK_WIDTH 5
#define TILE_SIZE 8
/// ...
__global__ void local_filt
(
const unsigned char* inputImage,
unsigned char* outputImage,
const int * filterKernel,
int height,
int width
)
{
__shared__ unsigned char tile[TILE_SIZE + MASK_WIDTH - 1][TILE_SIZE + MASK_WIDTH - 1];
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x * TILE_SIZE;
int by = blockIdx.y * TILE_SIZE;
int row = by + ty;
int col = bx + tx;
int cx = MASK_WIDTH / 2;
int cy = MASK_WIDTH / 2;
__syncthreads();
if (row < height && col < width) {
tile[ty + cy][tx + cx] = inputImage[row * width + col];
if (ty < cy)
{
///----------------------------
// Some code to fill the border pixels of the tile
///----------------------------
}
}
if (row < height && col < width)
{
unsigned char tmp_ext = 0; /// For max val
for (int i = 0; i < MASK_WIDTH; ++i)
{
for (int j = 0; j < MASK_WIDTH; ++j)
{
if (tmp_ext < tile[ty + i][tx + j])
tmp_ext = tile[ty + i][tx + j] * filterKernel[i][j];
}
}
outputImage[row * width + col] = tmp_ext;
}
}
For the current block pixels which gets copied into tile[ty + cy][tx + cx], there is no issue, but whatever way I try to keep the adjacent block's pixels on the border of the tile, still many of those tile pixels remain unassigned.
(In this part of the code:
if (ty < cy)
{
// Some code to fill the border pixels of the tile
}
I have tried many if-else to fill those specific pixels of the tile but the output image is corrupted in those border area.
To have a better view of the problem, take a look at these images:
The question here is how to properly fill those border pixels (green pixels in the second image) with the pixels of the adjacent block to make the filtering faster?
Please note that I used the above image as a simple sketch, the actual BlockSize that I have on the kernel is 32x32 and the filter size is 5x5.
I am not even sure if this method is reasonable or not, so please mention other solutions if there is any better.
As mentioned above, I had some if-else like below, to fill the tile's border:
if (ty < cy + 1 && ty > 0)
{
if (row - 2 * ty > -1) // For the first block which has no adjacent block on the left
tile[cy - ty][tx + cx] = inputImage[(row - 2 * ty) * width + col]; // replica! We can use inputImage[0 * width + col] for repeat.
else
tile[cy - ty][tx + cx] = inputImage[ty * width + col];
}
I have repeated similar if-elses to cover top/bottom/left/right but no success.


If we look at the "read-only-neighbor" data as a single chunk:
From the images you shared, it looks 12x12 tile. This requires 144 active threads during the initialization of
___shared___neighbor data or a number of repetitions with less threads. The reading are most efficient if each line is loaded with the minimal number of memory accesses. This means, each line should be loaded by multiple of 12 threads. The mapping can be computed easily like this:Here,
globalIndexOfFirstElement()is only the index of the top-left corner of the current tile. So it requires x and y coordinates converted (flattened) to 1D if it is a 1D kernel.The trick is to divide the image into tiles with either a flattened kernel or a 2D kernel and compute all of its neighbors as a single loop and stay away from overflowing the image dimensions. For example, tiles on the border of the image should not read from outside as it would read wrong elements for intermediate parts and overflow for corners. To overcome this, you can compute the border tiles separately by a second slower kernel with multiple if-elses.
When two tiles are computed, their overlapping pixels are loaded from L1/L2 cache depending on their position or the condition of the cache. The rest of the data still comes from shared memory during computation so there shouldn't be any unnecessary global memory accesses.