Optimizing Memory-Bound Loop with Indirect Prefetching

18 Views Asked by At

I'm currently working on optimizing a kernel, and one of the most time-consuming loops, despite optimization efforts, still accounts for 80% of the benchmark's execution time. The loop's performance bottleneck stems from its memory-bound nature, primarily due to an indirect memory access pattern that hampers prefetching. I'm seeking guidance on how to hint to the compiler to perform indirect prefetching without compromising other optimizations.

Here's the relevant snippet of the source code:

   for (j = 0; j < lastrow - firstrow + 1; j++) {
    sum = 0.0;
    for (k = rowstr[j]; k < rowstr[j+1]; k++) {
        sum = sum + a[k] * p[colidx[k]];
    }
       q[j] = sum;
   }

The bottleneck arises from the indirect memory access pattern, particularly in accessing the p array at the colidx[k] position.

After compiling with the Intel compiler using the following flags: -O3 -xhost -mcmodel=medium -qopt-prefetch=5 -xCOMMON-AVX512 It seems the compiler unrolled the loop by 8-factor, vectorized and added some prefetch instructions due to high qopt-prefetch level.

Block 1:
vmovupdy  (%r11,%r10,4), %ymm15
kxnorw %k0, %k0, %k1
vpxord %zmm16, %zmm16, %zmm16
vgatherdpdq  (%r9,%ymm15,8), %k1, %zmm16
vfmadd231pdz  (%rbx,%r10,8), %zmm16, %zmm14
prefetcht0z  0x900(%rbx,%r10,8)
prefetcht0z  0x480(%r11,%r10,4)
add $0x8, %r10
cmp %rsi, %r10
jbe 0x403370 <Block 1>

I've experimented with hints and tried using the volatile keyword, but it seems like I'm forced to choose between vectorization and prefetching p[colidx[k+PREFETCH_FACTOR]]. However, I believe it's possible to leverage both for optimal performance.

Would appreciate any insights or suggestions on how to achieve this balance. Thank you!

0

There are 0 best solutions below