I need to calculate how many flops per transferred value a code should provide so that running the code on GPU will be worth enough to increase the performance.
Here are the flop rates and assumptions:
1. PCIe 16x v3.0 bus is able to transfer data from CPU to GPU at a rate of 15.75 GB/s.
2. GPU is able to perform 8 single precision TFLOPs/second.
3. CPU is able to perform 400 single precision GFLOPs/second.
4. Single precision floating point number is 4 bytes.
5. Calculation can overlap with data transfers.
6. Data is originally placed in the CPU.
How would a problem like this be solved step by step?
Interpreting assumption 5 to mean the CPU isn't deranged in any way be transferring data to the GPU. There is obviously no reason not to use the GPU, you can only gain.
By not taking assumption 5 into account the question gets more interesting. Assuming while transferring data from CPU to GPU, the CPU can't calculate, we arrive at this: I think you are looking for the computaional intensity (=:ci) FLOP/byte at which it is beneficial to let the CPU halt its calculation to transfer data so the GPU can participate. Let's say you have
d
bytes of data to process with an algorithm of computational intensityci
. You split the data up intod_cpu
andd_gpu
withd_cpu+d_gpu=d
. It takest_1 = d_gpu / (15.75 GB/s)
to transfer the data. Then you let both compute fort_2
. Meaningt_2 = ci * d_gpu / (8 TFLOP/s) = ci * d_cpu / (400 GFLOP/s)
. The total time beeingt_3 = t_1 + t_2
.If the CPU does it all alone it needs
t_4 = ci * d / (400 GFLOP/s)
.So the point where both options take the same time is at
with
resulting in