How to use Vector Class Library for AVX vectorization together with the openmp #pragma omp parallel for reduction?

279 Views Asked by At

I'm using OpenMP to parallelize the loop, that is internally using AVX-512 with Agner Fog's VCL Vector Class Library.

Here is the code:

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);

  #pragma omp parallel for reduction(+:sumV,divV)
  for(i=0; i<N; ++i) {
    sumV += oneV / divV;
    divV += addV;
  }
  return horizontal_add(sumV);
}

When trying to compile the code above, I'm getting

g++ -Wall -Wextra -O3 -g -I include -fopenmp -m64 -mavx2 -mfma -std=c++17 -o harmonic_series harmonic_series.cpp
harmonic_series.cpp:87:40: error: user defined reduction not found for ‘sumV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)
      |                                        ^~~~
harmonic_series.cpp:87:45: error: user defined reduction not found for ‘divV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)

Any hints on how to solve this and provide the user-defined reduction for the Vec8d class? It's simply the plus operator which is defined by the VCL class, but I cannot find any example how to code this.

Thanks a lot for any help!

2

There are 2 best solutions below

1
Jirka On BEST ANSWER

Here is the final solution. It's using automatic reduction and avoids u64->fp conversion by computing divV = startdivV + i * addV only in the first iteration of each loop and then using divV += addV for all other iterations. Runtime to compute the sum of first 9.6e10 elements is {real 9s, user 1m46s} with 12 threads on Intel Core i7-10850H CPU - it's the same as for the manual reduction.

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);
  const Vec8d startdivV = divV;
  bool first_loop = true;
  #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
//It's important to mark "first_loop" variable as firstprivate so that each private copy gets initialized.
  #pragma omp parallel for firstprivate(first_loop) lastprivate(divV) reduction(+:sumV)
  for(i=0; i<N; ++i) {
    if (first_loop) {
      divV = startdivV + i * addV;
      first_loop = false;
    } else {
      divV += addV;
    }
    sumV += oneV / divV;
  }
  return horizontal_add(sumV);
}
8
Jirka On

I have found two solutions:

  1. Using #pragma omp declare reduction
double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);
  const Vec8d startdivV = divV;

  #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
  #pragma omp parallel for private(divV) reduction(+:sumV)
  for(i=0; i<N; ++i) {
    divV = startdivV + i * addV;
    sumV += oneV / divV;
  }
  return horizontal_add(sumV);
}

Drawback - there is some performance penalty since divisor has to be computed each time as

divV = startdivV + i * addV;

instead of incrementing it at each step by addV

divV += addV;

(multiplication is slower than addition).

Question: can I use omp parallel for private, but set divisor at start for each thread and then only increment it?

divV = startdivV + start_value_of_i * addV;
  for(i=0; i<N; ++i) {
    sumV += oneV / divV;
    divV += addV;
  }
  1. Use omp parallel, omp for, and omp critical to implement reduction manually. It gives me more control, but code is longer and more error prone. The implementation below also assumes that number of iterations is divisible by number of threads.
double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);

/*
Manual reduction
This code blindly assumes that N is divisible by number of threads
https://stackoverflow.com/questions/18746282/openmp-schedulestatic-with-no-chunk-size-specified-chunk-size-and-order-of-as
*/
  unsigned long long int start, end;
  int thread_num, num_threads;
  Vec8d localsumV;
  Vec8d localdivV;

  #pragma omp parallel private( i, thread_num, num_threads, start, end, localsumV, localdivV )
  {
    thread_num = omp_get_thread_num();
    num_threads = omp_get_num_threads();
    start = thread_num * N / num_threads;
    end = (thread_num + 1) * N / num_threads;
    localsumV = sumV;
    localdivV = start* addV + divV;
    for(i=start; i<end; ++i) {
      localsumV += oneV / localdivV;
      localdivV += addV;
    }
    #pragma omp critical
    {
      sumV += localsumV;
    }
  return horizontal_add(sumV);
}

I like the first version more.

Any hints how to improve the first version and increment divisor in each iteration, instead of compute it as divV = startdivV + i * addV; ?