Original question:
I am relatively new to OpenACC, but have so far managed to accelerate my suite of iterative Fortran solvers (based on CG) with relative success, and am getting speed-ups of around 7 on my Nvidia RTX GeForce card. All works fine if no preconditioning is used, or if diagonal preconditioning is used. But, problems start if I want to speed up slightly more sophisticated preconditioners - the inclomplete LDL^T being my favorite.
The piece of code which performs incomplete LDL^T factorization looks like this:
20 do i = 1, n ! browse through rows
21 sum1 = a % val(a % dia(i)) ! take diagonal entry
22 do j = a % row(i), a % dia(i)-1 ! browse only lower triangular part
23 k = a % col(j) ! fetch the column
24 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
25 end do
26
27 ! Keep only the diagonal from LDL decomposition
28 f % val(f % dia(i)) = 1.0 / sum1
29 end do
This piece of code is inherently sequential. As I browse through rows, I use the factorization performed in previous rows, thus should not be easily parallelizable from the outset. The only way I managed to find to compile it through OpenACC directive, and having in mind its sequential nature, is:
20 !$acc parallel loop seq &
21 !$acc& present(a, a % row, a % col, a % dia, a % val) &
22 !$acc& present(f, f % row, f % col, f % dia, f % val)
23 do i = 1, n ! browse through rows
24 sum1 = a % val(a % dia(i)) ! take diagonal entry
25 !$acc loop vector reduction(+:sum1)
26 do j = a % row(i), a % dia(i)-1 ! browse only lower triangular part
27 k = a % col(j) ! fetch the column
28 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
29 end do
30
31 ! Keep only the diagonal from LDL decomposition
32 f % val(f % dia(i)) = 1.0 / sum1
33 end do
Although these OpenACC directives keeps calculations on GPUs and give the same results as non-GPU variant, those of you more experienced than me will already see that there is not a hell of lot of parallelism and speed-up here. The outer loop (i
, through rows) is sequential, and the inner loop (j
and k
) browses through several elements only.
Overall, the GPU variant of incomplete LDL^T preconditioned CG solver is several times slower than the non-GPU version.
Does anyone have an idea how to work around this? I understand that it might be far from trivial, but maybe some work has already been done on this issue that I am unaware of, or maybe there is a better way to use OpenACC.
Update a couple of days later
I did my homework, read some articles on Nvidia site and beyond, and yeah, the factorization algorithm is indeed sequential and it seems to be an open question how to go about it on GPUs. In this recent blog: https://docs.nvidia.com/cuda/incomplete-lu-cholesky/index.html the author is using cuBLAS and cuSPARSE for CG, but still does factorization on CPU. The average speed-up he reports is around two, which is slightly underwhelming I dare to say.
So, I decided for a workaround by coloring matrix rows and perform re-factorization color by color, like this:
21 n_colors = 8
22
23 do color = 1, n_colors
24
25 c_low = (color-1) * n / n_colors + 1 ! color's upper bound
26 c_upp = color * n / n_colors ! color's lower bound
27
28 do i = c_low, c_upp ! browse your color band
29 sum1 = a % val(a % dia(i)) ! take diagonal entry
30 do j = a % row(i), a % dia(i)-1 ! only lower traingular
31 k = a % col(j)
32 if(k >= c_low) then ! limit entries to your color
33 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
34 end if
35 end do
36
37 ! This is only the diagonal from LDL decomposition
38 f % val(f % dia(i)) = 1.0 / sum1
39 end do
40 end do ! colors
With line 32 I am limiting matrix entries only to the color belonging to its loop. Clearly, since the incompleteness of the factorized matrix is more pronounced (more entries are neglected), convergence properties are poorer, but still better than with simple diagonal preconditoner.
I did the following with OpenACC:
21 n_colors = 8
22
23 !$acc parallel loop num_gangs(8) tile(8) ! do each tile in its own color
24 !$acc& present(a, a % row, a % col, a % dia, a % val) &
25 !$acc& present(f, f % row, f % col, f % dia, f % val)
26 do color = 1, n_colors
27
28 c_low = (color-1) * n / n_colors + 1 ! color's upper bound
29 c_upp = color * n / n_colors ! color's lower bound
30
31 !$acc loop seq ! inherently sequential
32 do i = c_low, c_upp ! browse your color band
33 sum1 = a % val(a % dia(i)) ! take diagonal entry
34 !$acc loop vector reduction(+:sum1)
35 do j = a % row(i), a % dia(i)-1 ! only lower traingular
36 k = a % col(j)
37 if(k >= c_low) then ! limit entries to your color
38 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
39 end if
40 end do
41
42 ! This is only the diagonal from LDL decomposition
43 f % val(f % dia(i)) = 1.0 / sum1
44 end do
45 end do ! colors
The results I obtain from OpenACC variant are identical to those from CPU only, it is just that performance is still much slower on GPUs. Yet, it seems that tile
directive works as I expected, each gang seems to be working on its own color since results I get from GPUs are identical.
Any advice on how to improve performance? Am I doing something utterly silly in the above code? (Profiler shows that the computation is GPU-bound since the entire system is on GPU, but performance is really poor.)
Best regards
Not sure this will help, but this paper describes a method for solving CG on GPUs. It uses cuSPARSE and CUDA for the implementations, but you might be able to get ideas that could be applied to your code.
https://www.dcs.warwick.ac.uk/pmbs/pmbs14/PMBS14/Workshop_Schedule_files/8-CUDAHPCG.pdf