I am trying to parallelize a convolution filter using C++Amp. I would like the following function to start working (I don't know how to do it properly):
float* pixel_color[] = new float [16];
concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array);
concurrency::array_view<float, 1> pixel(16, pixel_color); // I don't know which data structure to use here
parallel_for_each(
pixels.extent, [=](concurrency::index<2> idx) restrict(amp)
{
int row=idx[0];
int col=idx[1];
pixels(row, col) = taps(row, col) * pixels(row, col);
pixel[0] += pixels(row, col);
});
pixel_color.synchronize();
pixels_.at<Pixel>(j, i) = pixel_color
}
The main problem is that I don't know how to use the pixel structure properly (which concurrent data structure to use here as I don't need all 16 elements). And I don't know if I can safely add the values this way. The following code doesn't work, it does not add appropriate values to pixel[0]. I also would like to define
concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array);
outside the method (for example in the header file) and initialize it in the costructor or other function (as this is a bottle-neck and takes a lot of time copying the data between CPU and GPU). Does anybody know how to do this?
You're no the right track but doing in place manipulations of arrays on a GPU is tricky as you cannot guarantee the order in which different elements are updated.
Here's an example of something very similar. The
ApplyColorSimplifierTiledHelper
method contains an AMP restricted parallel_for_each that callsSimplifyIndexTiled
for each index in the 2D array.SimplifyIndexTiled
calculates a new value for each pixel indestFrame
based on the value of the pixels surrounding the corresponding pixel insrcFrame
. This solves the race condition issue present in your code.This code comes from the Codeplex site for the C++ AMP book. The Cartoonizer case study includes several examples of these sorts of image processing problems implemented in C++ AMP using; arrays, textures, tiled/untiled and multi-GPU. The C++ AMP book discusses the implementation in some detail.
The code uses
ArgbPackedPixel
, which is simply a mechanism for packing 8-bit RGB values into anunsigned long
as C++ AMP does not supportchar
. If your problem is small enough to fit into a texture then you may want to look at using this instead of an array as the pack/unpack is implemented in hardware on the GPU so is effectively "free", here you have to pay for it with additional compute. There is also an example of this implementation on CodePlex.