Array Sum Benchmark on GPU - Odd Results?

455 Views Asked by At

I am currently doing some benchmark tests using OpenCL on an AMD Radeon HD 7870.

The code that I have written in JOCL (the Java bindings for OpenCL) simply adds two 2D arrays (z= x + y) but it does so many times (z=x+y+y+y+y+y+y...).

The size of the two arrays I am adding is 500 by 501 and I am looping over the number of iterations I want to add them together on the GPU. So first I add them once, then ten times, then one thousand times, etc.

The maximum number of iterations that I loop to is 100,000,000. Below is what the log file looks like when I run my code (counter is the number of times my program executes in 5 seconds):

Number of Iterations: 1
Counter: 87
FLOPS Rate: 0.0043310947 GFLOPs/s

Number of Iterations: 10
Counter: 88
FLOPS Rate: 0.043691948 GFLOPs/s

Number of Iterations: 100
Counter: 84
FLOPS Rate: 0.41841218 GFLOPs/s 

Number of Iterations: 1000
Counter: 71
FLOPS Rate: 3.5104263 GFLOPs/s

Number of Iterations: 10000
Counter: 8
FLOPS Rate: 3.8689642 GFLOPs/s

Number of Iterations: 100000
Counter: 62
FLOPS Rate: 309.70895 GFLOPs/s

Number of Iterations: 1000000
Counter: 17
FLOPS Rate: 832.0814 GFLOPs/s

Number of Iterations: 10000000
Counter: 2
FLOPS Rate: 974.4635 GFLOPs/s

Number of Iterations: 100000000
Counter: 1
FLOPS Rate: 893.7945 GFLOPs/s

Do these numbers make sense? I feel that 0.97 TeraFLOPS is quite high and that I must be calculating the number of FLOPs incorrectly.

Also, I believe that the number of FLOPs I am calculating should at one point level out with an increase in the number of iterations but that is not so evident here. It seems that if I continue to increase the number of iterations, the calculated FLOPS will also increase which also leads me to believe that I am doing something wrong.

Just for reference, I am calculating the FLOPS in the following way:

FLOPS = counter(500)(501)(iterations)/(time_elapsed)

Any help with this issue will be greatly appreciated.

Thank you

EDIT:

I have now done this same benchmark test looping over a range of iterations (the amount of times I add y to x) as well as array sizes. I have generated the following surface plot as can be seen at this GitHub repository

https://github.com/ke0m/Senior_Design/blob/master/JOCL/Graphing/GoodGPUPlot.PNG

I have asked the opinion of others on this plot and they mention to me that while the numbers I am calculating are feasible, they are artificially high. They say this is evident in the steep slope in the plot that does not really make any physical sense. One suggested idea as to why the slope is so steep is because the compiler converts the variable that controls the iterations (of type int) to a short and therefore forces this number to stay below 32000 (approximately). That means that I am doing less work on the GPU then I think I am and calculating a higher GFLOPS value.

Can anyone confirm this idea or offer any other ideas as to why the plot looks the way it does?

Thank you again

2

There are 2 best solutions below

0
On

counter(500)(501)(iterations) - If this is calculated with integers, the result is likely to be too large for an integer register. If so convert to floating point before calculating.

0
On

I did a matrix-matrix multiplication kernel that uses local memory optimization. On my HD7870 @ stock settings, it does just about 500 billion sums and 500 billion multiplications per second which makes 1 Teraflops. This is quite close to your calculations if your card is at stock settings too.

Yes, your calculations make sense since the gpu's peak is about 2.5 Tflops/s and you are doing the calculations in local memory / register space which is needed to get close to peak values of card.

You are doing only additions so you just add 1 per iteration(not doing any multiplication leaves one pipeline per core empty I assume so you have nearly half of the peak).

1 flops per a=b+c

so you are right about the flops values.

But when you dont give the gpu a "resonance condition for total item number" like multiple of 512(multiple of maximum local item size) or 256 or 1280(number of cores) , your gpu will not compute at full efficiently and will degreade on performance for small arrays.

Also if you dont give enough total warps, threads will not be able to hide latency of main memory just like in the 1,10,100 iterations. Hiding memory latency needs multiple warps on a compute unit such that all the ALUs and ADDR units (i mean all pipelines) are occupied most of the time. Occupation is very important here because of so few operations per memory operation. If you decrease the workgroup size from 256 to 64, this can increase occupation so more latency hiding.

Trial&error can give you an optimum peak performance. Otherwise your kernel is bottlenecked by main memory bandwidth and thread start/stop latencies.

Here:

HD 7870 SGEMM with 9x16x16 pblocking algorithm: 1150 Gflops/s for square matrix size=8208

Additionally, divisions and special functions can be percepted as 50 to 200 flops per item and subject to different versions of them(like a software rsqrt() vs hardware rsqrt() approximation).

Try with array sizes of multiple of 256 and with a high iterations like 1M and try 64 or 128 as local items per compute unit. If you could multiply them at the same time, you could reach a higher flops throughput. You can add a multiplication of y with 2 or 3 to use multiplication pipelines too! This way you may approach a higher flops than before.

x=y+z*2.0f+z*3.0f+z*4.0f+z*5.0f ---->8 flops

or against auto-optimizaitons of compiler,

x=y+zrandomv+zrandomval2+zrandomval3+zrandomval4

instead of

x=y+z+z+z+z ----->4 flops

Edit: I dont know if HD7870 uses different(an extra batch of) ALUs for double-precision(64-bit fp) operations, if yes, then you can use them to do mixed-precision operations to have %10 more flops throughput because HD7870 is capable of 64-bit @ 1/8 of 32-bit speed! You can make your card explode with this way.