Simple loop tiling example for matrix multiplication

2.3k Views Asked by At

I'm trying to understand what really happens (step by step) when I use loop tiling or blocking to multiply two matrices. F.e. I understand what the code on http://en.wikipedia.org/wiki/Loop_tiling does. However, I can't picture what happens within the cache. Let's say I want to multiply two 4x4 matrices. AxB = C.

Now I want to create 4 submatrices (2x2) for each A and B. so A = [A1 A2 ; A3 A4] and B = [B1 B2 ; B3 B4]. All Elements in the memory for C are initialized to zero. f.e. using calloc.

1) Let's assume the matrices are stored in memory in row-major ordering: row1,row2,row3,row4 ...

2) let's assume I have two cachline with 4 matrix elements each. So when performing the naive matrix multiplication for the first element in c C[0,0] i will have a memory access for A[0,0] and copy a whole matrix row into the cacheline. Then i have a second memory access for B[0,0]. Then C[0,0] = A[0,0] * B[0,0] + C[0,0]. The next step would be C[0,0] = A[0,1] * B[1,0] + C[0,0]. Since A[0,1] is in the first cache line i will have a cache hit. However, B[1,0] is not in the second cache line and i will have a memory access.

Would Loop tiling be of any help in this example? Could anyone explain (step by step) what happens within the cache and why memory accesses are reduced? If this example is not suitable, Could anyone make up one where the benefits of blocking are visible?

Thanks in advance.

0

There are 0 best solutions below