Weird performance in matrix multiplication using AMP dependent on memory layout

98 Views Asked by Noxitu At 17 August 2025 at 21:26

I started experimenting with AMP - Accelerated Massive Parallelism library by Microsoft.

I wrote code to compute C = A * B (or C = A*B^T).

My CPU intuition suggests that having a transposed B matrix as input should provide a great performance increase. The loops and accesses would then be:
for(y) for(x) for(k) C[y][x] = A[y][k] * B[x][k];

When run on GPU, this variant is an order of magnitude slower than the one without transposition - having B[k][x] lowers running time from 700ms to 100ms.

This makes no sense for me. Can anyone explain this behavior? More generally, what are memory layout rules for writing performant GPU algorithms?

Code producing above results: https://gist.github.com/Noxitu/d961889140691693072562eac08e50bc

(I think AMP is installed with Visual Studio by default.)

(Note that code claims that most of time is spent in "synchronize" call, but due to consistency of my benchmarks I belive it must be that parallel_for_each is an asynchronous operation.)

Original Q&A

Weird performance in matrix multiplication using AMP dependent on memory layout

There are 0 best solutions below

Related Questions in C++

Related Questions in DIRECTX

Related Questions in C++-AMP

Trending Questions

Popular # Hahtags

Popular Questions