I need to find information about how the Unified Shader Array accessess the GPU memory to have an idea how to use it effectively. The image of the architecture of my graphics card doesn't show it clearly.
I need to load a big image into GPU memory using C++Amp and divide it into small pieces (like 4x4 pixels). Every piece should be computed with a different thread. I don't know how the threads share the access to the image.
Is there any way of doing it in such way that the threads aren't blocking each other while accessing the image? Maybe they have their own memory that can be accesses exclusively?
Or maybe the access to the unified memory is so fast that I shouldn't care about it (however I don't belive in it)? It is really important, because I need to compute about 10k subsets for every image.
For C++ AMP you want to load the data that each thread within a tile uses into
tile_static
memory before starting your convolution calculation. Because each thread accesses pixels which are also read by other threads this allows your to do a single read for each pixel from (slow) global memory and cache it in (fast) tile static memory so that all of the subsequent reads are faster.You can see an example of tiling for convolution here. The
DetectEdgeTiled
method loads all the data that it requires and the callsidx.barrier.wait()
to ensure all the threads have finished writing data into tile static memory. Then it executes the edge detection code taking advantage oftile_static
memory. There are many other examples of this pattern in the samples. Note that the loading code inDetectEdgeTiled
is complex only because it must account for the additional pixels around the edge of the pixels that are being written in the current tile and is essentially an unrolled loop, hence it's length.I'm not sure you are thinking about the problem in quite the right way. There are two levels of partitioning here. To calculate the new value for each pixel the thread doing this work reads the block of surrounding pixels. In addition blocks (tiles) of threads loads larger blocks of pixel data into
tile_static
memory. Each thread on the tile then calculates the result for one pixel within the block.This code was taken from CodePlex and I stripped out a lot of the real implementation to make it clearer.
WRT @sharpneli's answer you can use
texture<>
in C++ AMP to achieve the same result as OpenCL images. There is also an example of this on CodePlex.