AMP C++ speed up volume calculation

636 Views Asked by At
  • Device : Tesla C2050
  • OS : Windows 7 Enterprise
  • IDE : VS 2012

Hello everyone. I'm using AMP C++ to do some volume calculations.

I have millions tetrahedrons with one point at (0,0,0). so I can get the volume of the tetrahedrons in a simple way:

sum += triangle.x1 * triangle.y2 * triangle.z3 + \
       triangle.y1 * triangle.z2 * triangle.x3 + \
       triangle.x2 * triangle.y3 * triangle.z1 - \
       triangle.x3 * triangle.y2 * triangle.z1 - \
       triangle.x2 * triangle.y1 * triangle.z3 - \
       triangle.y3 * triangle.z2 * triangle.x1;

So, I want to speed up my calculation by using AMP C++.

Here is the code.

typedef struct
{
    double x1;
    double y1;
    double z1;
    double x2;
    double y2;
    double z2;
    double x3;
    double y3;
    double z3;
} Triangle;

And the main function is:

accelerator my_accelerator(accelerator::default_accelerator);
accelerator_view acc_view = my_accelerator.get_default_view();

const int BLOCK_SIZE = 64;
int outputSize = int(numTriangles / BLOCK_SIZE);

int dimA = int(numTriangles / BLOCK_SIZE) * BLOCK_SIZE;
std::cout<<dimA<<std::endl;

//copy triangles from host to device
array<Triangle,1> triangle(numTriangles);
copy(vTriangle.begin(),vTriangle.end(), triangle);

//Volume
std::vector<double> volumeCPP;
for (int i=0; i < outputSize; i++)
{
    volumeCPP.push_back(double(0));
}
array_view<double,1> volume(outputSize,volumeCPP);
volume.discard_data();

clock_t start,finish;
start = clock();
parallel_for_each(
    volume.extent.tile<1>(),
    [=, &triangle](tiled_index<1> t_idx) restrict(amp)
    {
        double sum = 0.0f;
        tile_static Triangle tile_triangle[4];
        tile_triangle[t_idx.local[0]] = triangle[t_idx.global];
        if (t_idx.local[0] == 0)
        {
            for (int idx=0; idx < BLOCK_SIZE; idx++){
                sum += tile_triangle[idx].x1 * tile_triangle[idx].y2 * tile_triangle[idx].z3 + tile_triangle[idx].y1 * tile_triangle[idx].z2 * tile_triangle[idx].x3 + tile_triangle[idx].x2 * tile_triangle[idx].y3 * tile_triangle[idx].z1 - tile_triangle[idx].x3 * tile_triangle[idx].y2 * tile_triangle[idx].z1 - tile_triangle[idx].x2 * tile_triangle[idx].y1 * tile_triangle[idx].z3 - tile_triangle[idx].y3 * tile_triangle[idx].z2 * tile_triangle[idx].x1;
                //t_idx.barrier.wait();
            }
            //t_idx.barrier.wait();
        }
        volume[t_idx.global] = sum;
    }
);

acc_view.wait();
finish = clock();
copy(volume, volumeCPP.begin());

So, every work has down. But interesting things is. It cost more than the CPU(single-core) code.

C++ on CPU(single-core) costs 0.085 seconds to finish 1024 * 1024 * 2 triangles calculation. But the AMP C++ code costs 0.530 seconds. much more than the c++ code.

After searching on the internet, there is a tip: If we warmed up the device first, we can get the "real" time costs on the calculation.

So I first calculate 128 triangles to warm up the device (costs about 0.2 seconds), then get the volume by calculating 1024 * 1024 * 2 triangles. It became much faster (costs about 0.091 seconds), but still slower than the CPU(single-core) code.

I'd like to know why, and anybody who can help me to speed up the calculation.

Thanks a lot.

2

There are 2 best solutions below

5
On

You should be able to speed it up a bit by factoring out.

Note that your formula for tetrahedron volume:

+ x1 * y2 * z3
+ y1 * z2 * x3
+ x2 * y3 * z1
- x3 * y2 * z1
- x2 * y1 * z3
- y3 * z2 * x1

is equivalent to:

+ x1 * (y2 * z3 - y3 * z2)
+ y1 * (z2 * x3 - x2 * z3)
+ z1 * (x2 * y3 - x3 * y2)

Original formula has 12 multiplications, and equivalent formula has 9 multiplications (25% less). It is hard to say how big of total improvement it will be, but I would not be surprised if it gives you 20%.

5
On

Firstly, below is what I think is a slightly better implementation with some comments. You code is doing some things that can be avoided.

However, what you are really doing here is a reduction. This is an algorithm that has been very heavily researched and optimized. There is a C++ AMP implementation on AMP Algorithms Codeplex site It is implemented as an STL-style algorithm. Before concluding that C++ AMP does not meet your needs I would try using this reduce implementation as it will be trivial to do and may give you much better perf. I'd be interested to see how you get on.

The AMP Book Codeplex site contains a helper class for timing C++ AMP kernels. The accompanying book also discusses implementing reduction. It has an entire chapter on it.

void Foo()
{
    const int numTriangles = 128;
    std::vector<Triangle> vTriangle;

    accelerator my_accelerator(accelerator::default_accelerator);
    accelerator_view acc_view = my_accelerator.get_default_view();

    const int BLOCK_SIZE = 64;
    int outputSize = int(numTriangles / BLOCK_SIZE);

    const int dimA = numTriangles;
    std::cout<<dimA<<std::endl;

    //copy triangles from host to device
    // Use and array_view to automatically sync your data. 
    // You can use acc_view.flush() to make sure that copy is complete 
    // when you are running your timing code. Make this const so that AMP does
    // not copy your input data back to the CPU.

    array_view<const Triangle, 1> triangle(vTriangle.size(), vTriangle.data());

    //Volume
    // Don't push_back this causes (re)allocation as the vector grows. 
    // Set size and fill at the same time.

    std::vector<double> volumeCPP(outputSize, 0.0);

    array_view<double, 1> volume(outputSize, volumeCPP);
    volume.discard_data();

    // I would use the timing code on CodePlex. 
    // It will be more accurate than this.
    clock_t start, finish;
    start = clock();
    parallel_for_each(
        // Not sure a tile size of 1 will be handled that 
        // well by the runtime in terms of perf. I see why you
        // are doing it to get tile_static. You might be better off having larger tiles.

        volume.extent.tile<1>(),
        [=](tiled_index<1> t_idx) restrict(amp)
        {
            double sum = 0.0f;
            for (int idx = 0; idx < BLOCK_SIZE; idx++)
            {
                // Loading the single triangle into tiled memory is a good idea because
                // elements are read more than once.
                tile_static Triangle tile_triangle;
                tile_triangle = triangle[t_idx.global * BLOCK_SIZE + idx];

                sum += tile_triangle.x1 * tile_triangle.y2 * tile_triangle.z3 + 
                    tile_triangle.y1 * tile_triangle.z2 * tile_triangle.x3 + 
                    tile_triangle.x2 * tile_triangle.y3 * tile_triangle.z1 - 
                    tile_triangle.x3 * tile_triangle.y2 * tile_triangle.z1 - 
                    tile_triangle.x2 * tile_triangle.y1 * tile_triangle.z3 - 
                    tile_triangle.y3 * tile_triangle.z2 * tile_triangle.x1;
            }
            volume[t_idx.global] = sum;
        }
    );
    // Force data copy back to CPU.
    volume.synchronize();
    double sum = std::accumulate(begin(volumeCPP), end(volumeCPP), 0.0);
}

Here's a further example that uses the AMP Algorithms Library to implement a solution to your problem using a map/reduce pattern.

std::vector<Triangle> triangles_cpu(1000);

array_view<const Triangle, 1> triangles_gpu(triangles_cpu.size(), triangles_cpu.data());
concurrency::array<double, 1> volumes_gpu(triangles_cpu.size());
array_view<double, 1> volumes_gpuvw(volumes_gpu);
amp_stl_algorithms::transform(begin(triangles_gpu), end(triangles_gpu), begin(volumes_gpuvw), 
    [=](const triangle& t) restrict(amp)
{
    return t.x1 * (t.y2 * t.z3 - t.y3 * t.z2)
        + t.y1 * (t.z2 * t.x3 - t.x2 * t.z3)
        + t.z1 * (t.x2 * t.y3 - t.x3 * t.y2);
});
double sum = amp_stl_algorithms::reduce(begin(volumes_gpuvw), end(volumes_gpuvw), 0.0);