I've boiled a long running function down to a "simple" series of matrix vector multiplications. The matrix does not change, but there are a a lot of vectors. I have put together a test program with the current state of the algorithm.
I've chased a few options for performance, but what is below is the best I have and it seems to work pretty well.
module maths
contains
subroutine lots_of_MVM(Y,P,t1,t2,startRow)
implicit none
! args
complex, intent(in), contiguous :: Y(:,:),P(:,:)
complex, intent(inout), contiguous :: t1(:,:),t2(:,:)
integer, intent(in) :: startRow
! locals
integer :: ii,jj,zz,nrhs,n,pCol,tRow,yCol
! indexing
nrhs = size(P,2)/2
n = size(Y,1)
! Do lots of maths
!$OMP PARALLEL PRIVATE(jj,pCol,tRow,yCol,zz)
!$OMP DO
do jj=1,nrhs
pCol = jj*2-1
tRow = startRow
do yCol=1,size(Y,2)
! This is faster than doing sum(P(:,pCol)*Y(:,yCol))
do zz=1,n
t1(tRow,jj) = t1(tRow,jj) + P(zz,pCol )*Y(zz,yCol)
t2(tRow,jj) = t2(tRow,jj) + P(zz,pCol+1)*Y(zz,yCol)
end do
tRow = tRow + 1
end do
end do
!$OMP END DO
!$OMP END PARALLEL
end subroutine
end module
program test
use maths
use omp_lib
implicit none
! variables
complex, allocatable,dimension(:,:) :: Y,P,t1,t2
integer :: n,nrhs,nY,yStart,yStop,mult
double precision startTime
! setup (change mult to make problem larger)
! real problem size mult = 1000 to 2000
mult = 300
n = 10*mult
nY = 30*mult
nrhs = 20*mult
yStart = 5
yStop = yStart + nrhs - 1
! allocate
allocate(Y(n,nY),P(n,nrhs*2))
allocate(t1(nrhs,nrhs),t2(nrhs,nrhs))
! make some data
call random_number(Y%re)
call random_number(Y%im)
call random_number(P%re)
call random_number(P%im)
t1 = 0
t2 = 0
! do maths
startTime = omp_get_wtime()
call lots_of_MVM(Y(:,yStart:yStop),P,t1,t2,1)
write(*,*) omp_get_wtime()-startTime
end program
Things I tried for performance (maybe incorrectly)
- Alignment of the data to 64 byte boundaries (start of matrix and each columns) I used associated directive to tell the compiler this and it seemed to make no difference. This was implement by copying Y and P with extra padding. I would like to avoid this increase in memory anyways.
- MKL cgemv_batch_strided. The OMP nested do loops wins over MKL. MKL is probably not optimized for stride 0 on the A matrix.
- Swapping 2nd and 3rd loops so t1 and t2 fill full columns. Requires in place transpose at the end
In addition to performance I would like better accuracy. More accuracy for similar performance would be acceptable. I tried a few things for this, but ended up with significantly slower performance.
Restrictions
- I can't just throw more cores at it with OMP. Probably will use only 2-4 cores
- Memory consumption should not increase drastically
Other info
- I'm using intel fortran compiler on RHEL 8
- Compiler flags -O3 -xHost
- Set mult variable to 1000 or 2000 for more representative problem size
- This code runs on a dual socket intel system at the moment, but could go to AMD.
- I want to run as many instance of this code as possible at a time. I'm currently able to run 24 concurrently on a dual 28 core processor system with 768GB of RAM. I am basically RAM and core limited for that run (2 cores per run)
- This is part of a larger code. Much of the rest is single threaded and its not trivial to make it multi-threaded. I'm targeting this section because it is the most time consuming portion.
- I have implemented this using CUDA API batched cgemv on GPU (GV100). It is much faster there (6x), but the CPU wins on total throughput as a single run on the GPU saturates compute or memory bandwidth.
Rewriting it using
cgemm/zgennincreases the speed by a factor of about 4-10 using either ifort or gfortran with openblas. Here's the code I knocked together:Note I have used double precision throughout as I know what to expect in terms of errors here. Results for ifort on 2 threads:
Results for gfortran:
Those differences are about what I would expect for double precision.
If the arguments to
zgemmare confusing take a look at BLAS LDB using DGEMMRunning in single precision (and changing the call to be to cgemm) has a comparable story, obviously with bigger differences, around 10^-3 to 10^-4, at least for gfortran. As I don't use single precision in my own work I have less of a feel for what to expect here, but this doesn't seem unreasonable:
As for what precision you want and what you consider accurate, well you don't say so I can't really address that, save to say the simplest way is to move to double precision if you can take the memory hit - the use of zgemm will easily outweigh any performance hit you would take. But for performance it's the same story for any code - if you can rewrite it in terms of a matrix multiply then you will win.