AMP C++ compute maximum value over an array

391 Views Asked by At

I want to compare the maximum absolute difference between two three-dimensional arrays with C++ AMP.

With OpenMP it is easy. Considering 2 arrays

float*** u, ***uOld;

The code is:

double residual = 0;
#pragma omp parallel for schedule(static) collapse(3) reduction(max : residual)
for (int i = 0; i < nbX; i++)
    for (int j = 0; j < nbY; j++)
        for (int k = 0; k < nbTheta; k++)
            residual = std::max(residual, fabs(u[i][j][k] - uOld[i][j][k]));

It would be easy to use max_element from AMP Algorithms, but it is not implemented. I think of something like this, but a reduction is needed at the outer loop level:

extent<1> extTheta(nbTheta);
parallel_for_each(extTheta, [=, &u_, &uOld_](index<1> iTheta) restrict(amp)
{
    type residual = 0;
    for (int iX = 0; iX < nbX; iX++)
    for (int iY = 0; iY < nbY; iY++)
    residual = fast_math::fmax(residual, fast_math::fabs(u_[iX][iY][iTheta] - uOld_[iX][iY][iTheta]));
})

The data is on the GPU, and I do not want it to transit on the GPU for efficiency reasons. How to do it efficiently?

1

There are 1 best solutions below

0
seb007 On

Here is a solution, inspired from msdn blog: https://blogs.msdn.microsoft.com/nativeconcurrency/2012/03/08/parallel-reduction-using-c-amp

parallel_for_each(extent<3>(nbTheta, nbX, nbY), [=, &u_, &uOld_](index<3> idx) restrict(amp)
{
    uOld_[idx[0]][idx[1]][idx[2]] = abs1(u_[idx[0]][idx[1]][idx[2]] - uOld_[idx[0]][idx[1]][idx[2]]);
});


array_view<float, 1> residualReduce_ =  uOld_.view_as<1>(extent<1>(nbTheta*nbX*nbY));
array_view<float, 1> residual_ = residualReduce_.section(index<1>(0), extent<1>(1));
for (unsigned shift = nbTheta*nbX*nbY / 2; shift > 0; shift /= 2)
{
    parallel_for_each(extent<1>(shift), [=](index<1> idx) restrict(amp)
    {
        residualReduce_[idx[0]] = fast_math::fmax(residualReduce_[idx[0]], residualReduce_[idx[0] + shift]);
        if (shift % 2){ //If odd, each thread includes a shifted entry. One will match the end of the queue
            residualReduce_[idx[0]] = fast_math::fmax(residualReduce_[idx[0]], residualReduce_[idx[0] + shift + 1]);
        }
    });
}
concurrency::copy(residual_, &residual);
parallel_for_each(extent<3>(nbTheta, nbX, nbY), [=, &u_, &uOld_](index<3> idx) restrict(amp)
{
    uOld_[idx[0]][idx[1]][idx[2]] = u_[idx[0]][idx[1]][idx[2]];
})

Unlike the snippet in the question, the solution includes the update of uOld to U.

The reduction is not the most efficient, but is still fast compared to the rest of the code.