Repeated single precison complex matrix vector multiplication (speed and accuracy improvement)

Question

Repeated single precison complex matrix vector multiplication (speed and accuracy improvement)

126 Views Asked by js1 At 26 October 2023 at 02:36

I've boiled a long running function down to a "simple" series of matrix vector multiplications. The matrix does not change, but there are a a lot of vectors. I have put together a test program with the current state of the algorithm.

I've chased a few options for performance, but what is below is the best I have and it seems to work pretty well.

module maths

contains
subroutine lots_of_MVM(Y,P,t1,t2,startRow)
    implicit none
    ! args
    complex, intent(in), contiguous    :: Y(:,:),P(:,:)
    complex, intent(inout), contiguous :: t1(:,:),t2(:,:)
    integer, intent(in)                :: startRow
    
    ! locals
    integer :: ii,jj,zz,nrhs,n,pCol,tRow,yCol
    ! indexing
    nrhs = size(P,2)/2
    n    = size(Y,1)
    
    ! Do lots of maths
    !$OMP PARALLEL PRIVATE(jj,pCol,tRow,yCol,zz)
    !$OMP DO
    do jj=1,nrhs
        pCol = jj*2-1
        tRow = startRow
        do yCol=1,size(Y,2)
            ! This is faster than doing sum(P(:,pCol)*Y(:,yCol))
            do zz=1,n
                t1(tRow,jj) = t1(tRow,jj) + P(zz,pCol  )*Y(zz,yCol)
                t2(tRow,jj) = t2(tRow,jj) + P(zz,pCol+1)*Y(zz,yCol)
            end do
            tRow = tRow + 1
        end do
    end do
    !$OMP END DO
    !$OMP END PARALLEL
    
end subroutine

end module
    
program test
    use maths
    use omp_lib
    implicit none
    ! variables
    complex, allocatable,dimension(:,:) :: Y,P,t1,t2
    integer :: n,nrhs,nY,yStart,yStop,mult
    double precision startTime
    
    ! setup (change mult to make problem larger)
    ! real problem size mult = 1000 to 2000
    mult = 300
    n = 10*mult
    nY = 30*mult
    nrhs = 20*mult
    yStart = 5
    yStop  = yStart + nrhs - 1
    
    ! allocate
    allocate(Y(n,nY),P(n,nrhs*2))
    allocate(t1(nrhs,nrhs),t2(nrhs,nrhs))
    
    ! make some data
    call random_number(Y%re)
    call random_number(Y%im)
    call random_number(P%re) 
    call random_number(P%im)
    t1 = 0
    t2 = 0
    
    ! do maths
    startTime = omp_get_wtime()
    call lots_of_MVM(Y(:,yStart:yStop),P,t1,t2,1)
    write(*,*) omp_get_wtime()-startTime
end program

Things I tried for performance (maybe incorrectly)

Alignment of the data to 64 byte boundaries (start of matrix and each columns) I used associated directive to tell the compiler this and it seemed to make no difference. This was implement by copying Y and P with extra padding. I would like to avoid this increase in memory anyways.
MKL cgemv_batch_strided. The OMP nested do loops wins over MKL. MKL is probably not optimized for stride 0 on the A matrix.
Swapping 2nd and 3rd loops so t1 and t2 fill full columns. Requires in place transpose at the end

In addition to performance I would like better accuracy. More accuracy for similar performance would be acceptable. I tried a few things for this, but ended up with significantly slower performance.

Restrictions

I can't just throw more cores at it with OMP. Probably will use only 2-4 cores
Memory consumption should not increase drastically

Other info

I'm using intel fortran compiler on RHEL 8
Compiler flags -O3 -xHost
Set mult variable to 1000 or 2000 for more representative problem size
This code runs on a dual socket intel system at the moment, but could go to AMD.
I want to run as many instance of this code as possible at a time. I'm currently able to run 24 concurrently on a dual 28 core processor system with 768GB of RAM. I am basically RAM and core limited for that run (2 cores per run)
This is part of a larger code. Much of the rest is single threaded and its not trivial to make it multi-threaded. I'm targeting this section because it is the most time consuming portion.
I have implemented this using CUDA API batched cgemv on GPU (GV100). It is much faster there (6x), but the CPU wins on total throughput as a single run on the GPU saturates compute or memory bandwidth.

Original Q&A

There are 1 best solutions below

**Ian Bush** · Accepted Answer · 2023-10-26T11:46:53.740000

Rewriting it using cgemm/zgenn increases the speed by a factor of about 4-10 using either ifort or gfortran with openblas. Here's the code I knocked together:

Module Precision_module

  Use, Intrinsic :: iso_fortran_env, Only : wp => real64
!  Use, Intrinsic :: iso_fortran_env, Only : wp => real32

  Implicit None

  Private

  Public :: wp

End Module Precision_module


Module maths

!  Use, Intrinsic :: iso_fortran_env, Only : wp => real64

  Use Precision_module, Only : wp

Contains
  Subroutine lots_of_MVM(Y,P,t1,t2,startRow)
    Implicit None
    ! args
    Complex( wp ), Intent(in), Contiguous    :: Y(:,:),P(:,:)
    Complex( wp ), Intent(inout), Contiguous :: t1(:,:),t2(:,:)
    Integer, Intent(in)                :: startRow

    ! locals
    Integer :: jj,zz,nrhs,n,pCol,tRow,yCol
    ! indexing
    nrhs = Size(P,2)/2
    n    = Size(Y,1)

    ! Do lots of maths
    !$OMP PARALLEL PRIVATE(jj,pCol,tRow,yCol,zz)
    !$OMP DO
    Do jj=1,nrhs
       pCol = jj*2-1
       tRow = startRow
       Do yCol=1,Size(Y,2)
          ! This is faster than doing sum(P(:,pCol)*Y(:,yCol))
          Do zz=1,n
             t1(tRow,jj) = t1(tRow,jj) + P(zz,pCol  )*Y(zz,yCol)
             t2(tRow,jj) = t2(tRow,jj) + P(zz,pCol+1)*Y(zz,yCol)
          End Do
          tRow = tRow + 1
       End Do
    End Do
    !$OMP END DO
    !$OMP END PARALLEL

  End Subroutine lots_of_MVM

End Module maths

Program test

  Use, Intrinsic :: iso_fortran_env, Only : numeric_storage_size, real32

  Use Precision_module, Only : wp
  
  Use maths
  Use omp_lib, Only : omp_get_wtime, omp_get_max_threads
  Implicit None

  ! variables
  Complex( wp ), Allocatable,Dimension(:,:) :: Y,P,t1,t2
  Integer :: n,nrhs,nY,yStart,yStop,mult
  Real( wp ) :: startTime

  Complex( wp ), Allocatable, Dimension( :, : ) :: t3, t4
  Real( wp ) :: mem_reqd, mem_reqd_Gelements
  Real( wp ) :: tloop, tblas

  ! setup (change mult to make problem larger)
  ! real problem size mult = 1000 to 2000
  mult = 300
  !mult = 50 ! for debug
  n = 10*mult
  nY = 30*mult
  nrhs = 20*mult
  yStart = 5
  yStop  = yStart + nrhs - 1

  ! allocate
  Allocate(Y(n,nY),P(n,nrhs*2))
  Allocate(t1(nrhs,nrhs),t2(nrhs,nrhs))

  mem_reqd = Size( Y ) + Size( P ) + Size( t1 ) + Size( t2 )
  mem_reqd_Gelements = mem_reqd / ( 1024.0_wp * 1024.0_wp * 1024.0_wp )
  Write( *, * ) 'Mem reqd: ', mem_reqd_Gelements, ' Gelements'


  ! make some data
  Call random_Number(Y%re)
  Call random_Number(Y%im)
  Call random_Number(P%re) 
  Call random_Number(P%im)
  t1 = 0
  t2 = 0

  ! do maths
  Write( *, * ) 'Using ', omp_get_max_threads(), ' threads'
  Write( *, * ) 'Using ', Merge( 'single', 'double', Kind( y ) == real32 ), ' precision'
  startTime = Real( omp_get_wtime(), wp )
  Call lots_of_MVM(Y(:,yStart:yStop),P,t1,t2,1)
  tloop = Real( omp_get_wtime(), wp ) - startTime
  Write(*,*) 'TLoop: ', tloop

  Allocate( t3, mold = t1 )
  Allocate( t4, mold = t2 )

  t3 = 0.0_wp
  t4 = 0.0_wp

  startTime = Real( omp_get_wtime(), wp )
  Call zgemm( 'T', 'N', nrhs, nrhs, n, ( 1.0_wp, 0.0_wp ), Y ( 1, ystart ), n    , &
                                                          P ( 1, 1      ), 2 * n, &
                                      ( 1.0_wp, 0.0_wp ), t3             , nrhs )
  Call zgemm( 'T', 'N', nrhs, nrhs, n, ( 1.0_wp, 0.0_wp ), Y ( 1, ystart ), n    , &
                                                          P ( 1, 2      ), 2 * n, &
                                      ( 1.0_wp, 0.0_wp ), t4             , nrhs )
  tblas = Real( omp_get_wtime(), wp ) - startTime
  Write(*,*) 'TBlas: ', tblas
  Write( *, * ) 'Time ratio ', tloop / tblas, ' ( big means blas better )'
  Write( *, * ) "Max diff in t1 ", Maxval( Abs( t3 - t1 ) )
  Write( *, * ) "Max diff in t2 ", Maxval( Abs( t4 - t2 ) )

End Program test

Note I have used double precision throughout as I know what to expect in terms of errors here. Results for ifort on 2 threads:

ijb@ijb-Latitude-5410:~/work/stack$ ifort -O3 -qopenmp mm.f90 -lopenblas
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 Mem reqd:   0.125728547573090       Gelements
 Using            2  threads
 Using double precision
 TLoop:    71.4670290946960     
 TBlas:    17.8680319786072     
 Time ratio    3.99971464010481       ( big means blas better )
 Max diff in t1   1.296029998720414E-011
 Max diff in t2   1.273302296896508E-011

Results for gfortran:

ijb@ijb-Latitude-5410:~/work/stack$ gfortran-12 -fopenmp -O3 -Wall -Wextra -pedantic -Werror -std=f2018  mm.f90 -lopenblas
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 Mem reqd:   0.12572854757308960       Gelements
 Using            2  threads
 Using double precision
 TLoop:    185.08875890000490     
 TBlas:    16.093782140000258     
 Time ratio    11.500637779852656       ( big means blas better )
 Max diff in t1    1.2732928769591198E-011
 Max diff in t2    1.3642443193628513E-011

Those differences are about what I would expect for double precision.

If the arguments to zgemm are confusing take a look at BLAS LDB using DGEMM

Running in single precision (and changing the call to be to cgemm) has a comparable story, obviously with bigger differences, around 10^-3 to 10^-4, at least for gfortran. As I don't use single precision in my own work I have less of a feel for what to expect here, but this doesn't seem unreasonable:

 ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 Mem reqd:   0.125728548      Gelements
 Using            2  threads
 Using single precision
 TLoop:    147.786453    
 TBlas:    8.18331814    
 Time ratio    18.0594788      ( big means blas better )
 Max diff in t1    7.32427742E-03
 Max diff in t2    6.83614612E-03

As for what precision you want and what you consider accurate, well you don't say so I can't really address that, save to say the simplest way is to move to double precision if you can take the memory hit - the use of zgemm will easily outweigh any performance hit you would take. But for performance it's the same story for any code - if you can rewrite it in terms of a matrix multiply then you will win.

Repeated single precison complex matrix vector multiplication (speed and accuracy improvement)

There are 1 best solutions below

Related Questions in FORTRAN

Related Questions in OPENMP

Related Questions in BLAS

Related Questions in INTEL-MKL

Trending Questions

Popular # Hahtags

Popular Questions