- Device : Tesla C2050
- OS : Windows 7 Enterprise
- IDE : VS 2012
Hello everyone. I'm using AMP C++ to do some volume calculations.
I have millions tetrahedrons with one point at (0,0,0). so I can get the volume of the tetrahedrons in a simple way:
sum += triangle.x1 * triangle.y2 * triangle.z3 + \
triangle.y1 * triangle.z2 * triangle.x3 + \
triangle.x2 * triangle.y3 * triangle.z1 - \
triangle.x3 * triangle.y2 * triangle.z1 - \
triangle.x2 * triangle.y1 * triangle.z3 - \
triangle.y3 * triangle.z2 * triangle.x1;
So, I want to speed up my calculation by using AMP C++.
Here is the code.
typedef struct
{
double x1;
double y1;
double z1;
double x2;
double y2;
double z2;
double x3;
double y3;
double z3;
} Triangle;
And the main function is:
accelerator my_accelerator(accelerator::default_accelerator);
accelerator_view acc_view = my_accelerator.get_default_view();
const int BLOCK_SIZE = 64;
int outputSize = int(numTriangles / BLOCK_SIZE);
int dimA = int(numTriangles / BLOCK_SIZE) * BLOCK_SIZE;
std::cout<<dimA<<std::endl;
//copy triangles from host to device
array<Triangle,1> triangle(numTriangles);
copy(vTriangle.begin(),vTriangle.end(), triangle);
//Volume
std::vector<double> volumeCPP;
for (int i=0; i < outputSize; i++)
{
volumeCPP.push_back(double(0));
}
array_view<double,1> volume(outputSize,volumeCPP);
volume.discard_data();
clock_t start,finish;
start = clock();
parallel_for_each(
volume.extent.tile<1>(),
[=, &triangle](tiled_index<1> t_idx) restrict(amp)
{
double sum = 0.0f;
tile_static Triangle tile_triangle[4];
tile_triangle[t_idx.local[0]] = triangle[t_idx.global];
if (t_idx.local[0] == 0)
{
for (int idx=0; idx < BLOCK_SIZE; idx++){
sum += tile_triangle[idx].x1 * tile_triangle[idx].y2 * tile_triangle[idx].z3 + tile_triangle[idx].y1 * tile_triangle[idx].z2 * tile_triangle[idx].x3 + tile_triangle[idx].x2 * tile_triangle[idx].y3 * tile_triangle[idx].z1 - tile_triangle[idx].x3 * tile_triangle[idx].y2 * tile_triangle[idx].z1 - tile_triangle[idx].x2 * tile_triangle[idx].y1 * tile_triangle[idx].z3 - tile_triangle[idx].y3 * tile_triangle[idx].z2 * tile_triangle[idx].x1;
//t_idx.barrier.wait();
}
//t_idx.barrier.wait();
}
volume[t_idx.global] = sum;
}
);
acc_view.wait();
finish = clock();
copy(volume, volumeCPP.begin());
So, every work has down. But interesting things is. It cost more than the CPU(single-core) code.
C++ on CPU(single-core) costs 0.085 seconds to finish 1024 * 1024 * 2 triangles calculation. But the AMP C++ code costs 0.530 seconds. much more than the c++ code.
After searching on the internet, there is a tip: If we warmed up the device first, we can get the "real" time costs on the calculation.
So I first calculate 128 triangles to warm up the device (costs about 0.2 seconds), then get the volume by calculating 1024 * 1024 * 2 triangles. It became much faster (costs about 0.091 seconds), but still slower than the CPU(single-core) code.
I'd like to know why, and anybody who can help me to speed up the calculation.
Thanks a lot.
You should be able to speed it up a bit by factoring out.
Note that your formula for tetrahedron volume:
is equivalent to:
Original formula has 12 multiplications, and equivalent formula has 9 multiplications (25% less). It is hard to say how big of total improvement it will be, but I would not be surprised if it gives you 20%.