Weird performance in matrix multiplication using AMP dependent on memory layout

105 Views Asked by At

I started experimenting with AMP - Accelerated Massive Parallelism library by Microsoft.

I wrote code to compute C = A * B (or C = A*B^T).

My CPU intuition suggests that having a transposed B matrix as input should provide a great performance increase. The loops and accesses would then be:
for(y) for(x) for(k) C[y][x] = A[y][k] * B[x][k];

When run on GPU, this variant is an order of magnitude slower than the one without transposition - having B[k][x] lowers running time from 700ms to 100ms.

This makes no sense for me. Can anyone explain this behavior? More generally, what are memory layout rules for writing performant GPU algorithms?

Code producing above results: https://gist.github.com/Noxitu/d961889140691693072562eac08e50bc

(I think AMP is installed with Visual Studio by default.)

(Note that code claims that most of time is spent in "synchronize" call, but due to consistency of my benchmarks I belive it must be that parallel_for_each is an asynchronous operation.)

0

There are 0 best solutions below