Accelerating incomplete LDL^T preconditioner with OpenACC

132 Views Asked by At

Original question:

I am relatively new to OpenACC, but have so far managed to accelerate my suite of iterative Fortran solvers (based on CG) with relative success, and am getting speed-ups of around 7 on my Nvidia RTX GeForce card. All works fine if no preconditioning is used, or if diagonal preconditioning is used. But, problems start if I want to speed up slightly more sophisticated preconditioners - the inclomplete LDL^T being my favorite.

The piece of code which performs incomplete LDL^T factorization looks like this:

20   do i = 1, n                        ! browse through rows
21     sum1 = a % val(a % dia(i))       ! take diagonal entry
22     do j = a % row(i), a % dia(i)-1  ! browse only lower triangular part
23       k = a % col(j)                 ! fetch the column
24       sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
25     end do
26
27     ! Keep only the diagonal from LDL decomposition
28     f % val(f % dia(i)) = 1.0 / sum1
29   end do

This piece of code is inherently sequential. As I browse through rows, I use the factorization performed in previous rows, thus should not be easily parallelizable from the outset. The only way I managed to find to compile it through OpenACC directive, and having in mind its sequential nature, is:

20   !$acc  parallel loop seq                               &
21   !$acc& present(a, a % row, a % col, a % dia, a % val)  &
22   !$acc& present(f, f % row, f % col, f % dia, f % val)
23   do i = 1, n                        ! browse through rows
24     sum1 = a % val(a % dia(i))       ! take diagonal entry
25     !$acc loop vector reduction(+:sum1)
26     do j = a % row(i), a % dia(i)-1  ! browse only lower triangular part
27       k = a % col(j)                 ! fetch the column
28       sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
29     end do 
30
31     ! Keep only the diagonal from LDL decomposition
32     f % val(f % dia(i)) = 1.0 / sum1
33   end do

Although these OpenACC directives keeps calculations on GPUs and give the same results as non-GPU variant, those of you more experienced than me will already see that there is not a hell of lot of parallelism and speed-up here. The outer loop (i, through rows) is sequential, and the inner loop (j and k) browses through several elements only.

Overall, the GPU variant of incomplete LDL^T preconditioned CG solver is several times slower than the non-GPU version.

Does anyone have an idea how to work around this? I understand that it might be far from trivial, but maybe some work has already been done on this issue that I am unaware of, or maybe there is a better way to use OpenACC.

Update a couple of days later

I did my homework, read some articles on Nvidia site and beyond, and yeah, the factorization algorithm is indeed sequential and it seems to be an open question how to go about it on GPUs. In this recent blog: https://docs.nvidia.com/cuda/incomplete-lu-cholesky/index.html the author is using cuBLAS and cuSPARSE for CG, but still does factorization on CPU. The average speed-up he reports is around two, which is slightly underwhelming I dare to say.

So, I decided for a workaround by coloring matrix rows and perform re-factorization color by color, like this:

21   n_colors = 8
22
23   do color = 1, n_colors
24
25     c_low = (color-1) * n / n_colors + 1  ! color's upper bound
26     c_upp =  color    * n / n_colors      ! color's lower bound
27
28     do i = c_low, c_upp                ! browse your color band
29       sum1 = a % val(a % dia(i))       ! take diagonal entry
30       do j = a % row(i), a % dia(i)-1  ! only lower traingular
31         k = a % col(j)
32         if(k >= c_low) then            ! limit entries to your color
33           sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
34         end if
35       end do
36
37       ! This is only the diagonal from LDL decomposition
38       f % val(f % dia(i)) = 1.0 / sum1
39     end do
40   end do    ! colors

With line 32 I am limiting matrix entries only to the color belonging to its loop. Clearly, since the incompleteness of the factorized matrix is more pronounced (more entries are neglected), convergence properties are poorer, but still better than with simple diagonal preconditoner.

I did the following with OpenACC:

21   n_colors = 8
22
23   !$acc  parallel loop num_gangs(8) tile(8)  ! do each tile in its own color
24   !$acc& present(a, a % row, a % col, a % dia, a % val)  &
25   !$acc& present(f, f % row, f % col, f % dia, f % val)
26   do color = 1, n_colors
27
28     c_low = (color-1) * n / n_colors + 1  ! color's upper bound
29     c_upp =  color    * n / n_colors      ! color's lower bound
30
31     !$acc loop seq                     ! inherently sequential
32     do i = c_low, c_upp                ! browse your color band
33       sum1 = a % val(a % dia(i))       ! take diagonal entry
34       !$acc loop vector reduction(+:sum1)
35       do j = a % row(i), a % dia(i)-1  ! only lower traingular
36         k = a % col(j)
37         if(k >= c_low) then            ! limit entries to your color
38           sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
39         end if
40       end do
41
42       ! This is only the diagonal from LDL decomposition
43       f % val(f % dia(i)) = 1.0 / sum1
44     end do 
45   end do    ! colors

The results I obtain from OpenACC variant are identical to those from CPU only, it is just that performance is still much slower on GPUs. Yet, it seems that tile directive works as I expected, each gang seems to be working on its own color since results I get from GPUs are identical.

Any advice on how to improve performance? Am I doing something utterly silly in the above code? (Profiler shows that the computation is GPU-bound since the entire system is on GPU, but performance is really poor.)

Best regards

1

There are 1 best solutions below

0
On

Not sure this will help, but this paper describes a method for solving CG on GPUs. It uses cuSPARSE and CUDA for the implementations, but you might be able to get ideas that could be applied to your code.

https://www.dcs.warwick.ac.uk/pmbs/pmbs14/PMBS14/Workshop_Schedule_files/8-CUDAHPCG.pdf