I just ran a matrix * matrix multiplication once with LAPACK/BLAS and once with custom loop optimizations (tiling). I am a bit irritated because a simple loop tiling approach is approximately 43% faster than the BLAS algorithm. Basically, my question is whether I am making a mistake applying the BLAS routine. Here is my code:
program test
implicit none
integer, parameter :: N = 1000, tile = 2
real*4, dimension(N,N) :: a,b,c,temp
integer :: i,j,k,x,y,z
double precision :: E,S
real :: alpha = 1.0, beta = 0.0
call random_seed()
call random_number(a)
call random_number(b)
call cpu_time(S)
! call sgemm('n','n',N, N, N, alpha,a,N,b,N, beta,c,N)
do j = 1,N,tile
do k = 1,N,tile
do i = 1,N,tile
do y = j, min( j+tile-1,N)
do x = i, min( i+tile-1,N)
do z = k, min( k+tile-1,N)
c(x,y) = c(x,y) + a(x,z) * b(z,y)
enddo
enddo
enddo
enddo
enddo
enddo
call cpu_time(E)
print*,(E-S)
end program test
I run this calculation on a Intel Dual Core2 machine with 4gb DRAM and 3096kb Cache. The program is compiled with:
$gfortran -O3 test.f03 -o test
0.9359
for the loop and:
$gfortran test.f03 -lblas -O3 -o test
1.3399
So am I not getting something about BLAS, am I missing something (compiler optimization, or well I just don't know what)? I ran a similar code with C++ with and without Eigen::Matrix and got considerable gain from using the Eigen library for MMM, which is why my expectation was similar high to the BLAS library.
The BLAS routine is used correctly. The only difference is that BLAS is performing
and your loop
In your loop you are trying to improve usage of cpu's cache memory. There are variants of BLAS that perform similar actions. I suggest you to try openblas, atlas or mkl (intel compiler) libraries. You will get great time improvements.