Use shared memory for neighboring array elements?

179 Views Asked by At

I'd like to process an image with CUDA. Each pixel's new value is calculated based on the two neighboring pixels in one row. Would it make sense to use __shared__ memory for the pixel values, since each value will be used only twice? Aren't tiles also the wrong way to do it, since it doesn't suit the problem structure? My approach would be to run a thread on each pixel and load the neighboring pixel values each time for each thread.

2

There are 2 best solutions below

0
On BEST ANSWER

All currently supported CUDA architectures have caches.
From compute capability 3.5 onward these are particularly efficient for read-only data (as read-write data is only cached in L2, the L1 cache is limited to read-only data). If you mark the pointer to the input data as const __restrict__, the compiler will most likely load it via the L1 texture cache. You can also force this by explicitly using the __ldg() builtin.

While it is possible to explicitly manage the reuse of data from neighboring pixels via shared memory, you will probably find this to provide no benefit over just relying on the cache.

Of course, whether or not you use shared memory, you want to maximize the block size in x-direction and use a blockDim.y of 1 for optimal access locality.

0
On

Combine using shared memory with taking advantage of coalesced memory accesses. All you need to do is to ensure that image is stored row-wise. Each block would process a chunk of linear array. Because of data reuse (every pixel except the first and last ones would take part in processing three times) it would be beneficial if at the beginning of your kernel you would copy the values of all pixels that will be processed to shared memory.