I started experimenting with AMP - Accelerated Massive Parallelism library by Microsoft.
I wrote code to compute C = A * B (or C = A*B^T).
My CPU intuition suggests that having a transposed B matrix as input should provide a great performance increase. The loops and accesses would then be:
for(y) for(x) for(k) C[y][x] = A[y][k] * B[x][k];
When run on GPU, this variant is an order of magnitude slower than the one without transposition - having B[k][x]
lowers running time from 700ms to 100ms.
This makes no sense for me. Can anyone explain this behavior? More generally, what are memory layout rules for writing performant GPU algorithms?
Code producing above results: https://gist.github.com/Noxitu/d961889140691693072562eac08e50bc
(I think AMP is installed with Visual Studio by default.)
(Note that code claims that most of time is spent in "synchronize" call, but due to consistency of my benchmarks I belive it must be that parallel_for_each
is an asynchronous operation.)