PGI Compiler Parallelization +=

169 Views Asked by At

I am working on getting a vector and matrix class parallelized and have run into an issue. Any time I have a loop in the form of

for (int i = 0; i < n; i++) b[i] += a[i] ;

the code has a data dependency and will not parallelize. When working with the intel compiler it is smart enough to handle this without any pragmas (I would like to avoid the pragma for no dependency check just due to the vast number of loops similar to this and because the cases are actually more complicated than this and I would like it to check just in case one does exist).

Does anyone know of a compiler flag for the PGI compiler that would allow this?

Thank you,

Justin

edit: Error in the for loop. Wasn't copy pasting an actual loop

1

There are 1 best solutions below

1
On

I think the problem is you're not using the restrict keyword in these routines, so the C compiler has to worry about pointer aliasing.

Compiling this program:

#include <stdlib.h>
#include <stdio.h>

void dbpa(double *b, double *a, const int n) {
    for (int i = 0; i < n; i++) b[i] += a[i] ;

    return;
}

void dbpa_restrict(double *restrict b, double *restrict a, const int n) {
    for (int i = 0; i < n; i++) b[i] += a[i] ;

    return;
}

int main(int argc, char **argv) {
    const int n=10000;
    double *a = malloc(n*sizeof(double));
    double *b = malloc(n*sizeof(double));

    for (int i=0; i<n; i++) {
        a[i] = 1;
        b[i] = 2;
    }

    dbpa(b, a, n);
    double error = 0.;
    for (int i=0; i<n; i++)
        error += (3 - b[i]);

    if (error < 0.1)
        printf("Success\n");

    dbpa_restrict(b, a, n);
    error = 0.;
    for (int i=0; i<n; i++)
        error += (4 - b[i]);

    if (error < 0.1)
        printf("Success\n");

    free(b);
    free(a);
    return 0;
}

with the PGI compiler:

$ pgcc  -o tryautop tryautop.c -Mconcur -Mvect -Minfo
dbpa:
      5, Loop not vectorized: data dependency
dbpa_restrict:
     11, Parallel code generated with block distribution for inner loop if trip count is greater than or equal to 100
main:
     21, Loop not vectorized: data dependency
     28, Loop not parallelized: may not be beneficial
     36, Loop not parallelized: may not be beneficial

gives us the information that the dbpa() routine without the restrict keyword wasn't parallelized, but the dbpa_restict() routine was.

Really, for this sort of stuff, though, you're better off just using OpenMP (or TBB or ABB or...) rather than trying to convince the compiler to autoparallelize for you; probably better still is just to use existing linear algebra packages, either dense or sparse, depending on what you're doing.