16 cores, yet performance plateaus when computing inner product with >= 4 threads. What's happening?

172 Views Asked by At

I'd greatly appreciate anyone's insight on this question. My machine has 16 cores (Intel i9-12900HX) and I want to better understand how to directly leverage them.

To do this, I ran a simple experiment that computes the inner product of two 1-D vectors of doubles, where each vector has 109 elements.

First, I compute the inner product using a simple loop (i.e. in series), and then I compute it in parallel using threads, first by using 1 thread, then by using 2 threads,...,up to and including 8 threads.

Here are the results I obtained for one execution instance of the calculations. The numbers are similar from one execution instance to the next (the self-contained code is below):

N Threads ----- Plel DotProd --- Plel Time(ms)  -- Series DotProd - Series Time(ms)
   
    1           166666666.67     714.55      166666666.67     694.82
    2           166666666.67     416.97
    3           166666666.67     387.20
    4           166666666.67     316.38
    5           166666666.67     315.09
    6           166666666.67     293.89
    7           166666666.67     286.43
    8           166666666.67     277.68

As a check, notice that all the dot product results are the same, and the time duration for the series calculation (top row, far right) is in line with the time duration for the "parallel" calculation when using only 1 thread (top row, middle). But for the parallel calculations in general, the speedup seems (possibly) reasonable when going from 1 to 2 threads, not as good as I expected when going from 2 to 3, or from 3 to 4, and there is no material benefit for 4 or more threads.

Are all these threads running on only 3 (maybe 4) cores out of the total of 16? Is it possible to know how many cores are actually being used, or to be able to guarantee how many cores are to be used in a calculation? Does the problem lie elsewhere? Have I not accounted for something? Thanks very much in advance for anyone's thoughts. Here's the code (it takes about 6 or so seconds to run on my Thinkpad):

#include <vector>
#include <thread>
#include <chrono>
#include <iostream>
#include <iomanip>

using std::chrono::high_resolution_clock;
using std::chrono::duration;

//The functor being parallelized
void innerProduct(const int  threadID,
                  const long start,
                  const long end,
                  const std::vector<double>& a,
                  const std::vector<double>& b,
                  std::vector<double>& innerProductResult //ith element holds result from ith thread
                 )
{
    innerProductResult[threadID] = 0.0;
    for (long i = start; i < end; i++)
        innerProductResult[threadID] += a[i] * b[i];
}


int main(void)
{
    //Create and populate
    const int DIMENSION = 1'000'000'000;
    std::vector<double> a(DIMENSION);
    std::vector<double> b(DIMENSION);

    //Use values that won't explode the result or invoke any multiplication optimzations
    for (int i = 0; i < DIMENSION; i++)
    {
        double arg = i / (double)DIMENSION;
        a[i] = arg;
        b[i] = 1.0 - arg;
    }

    //Compute inner product in SERIES
    double seriesInnerProduct = 0.0;
    auto t1 = high_resolution_clock::now();

    for (int i = 0; i < DIMENSION; i++)
        seriesInnerProduct += a[i] * b[i];

    auto t2 = high_resolution_clock::now();
    duration<double, std::milli> millisecs_duration = t2 - t1;   //milliseconds as double
    const double seriesTime = millisecs_duration.count();

    //Compute inner product in PARALLEL
    const int MAX_NUMBER_OF_THREADS = 8;
    std::vector<double> parallelTimings(MAX_NUMBER_OF_THREADS);
    std::vector<double> parallelInnerProducts(MAX_NUMBER_OF_THREADS, 0.0); //initialize elements to zero

    //Compute the inner product using different numbers of threads
    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {
        //Create batches for subjobs
        const long batch_size = DIMENSION / numThreadsUsed;   //but there may be a remainder, so...
        const long sizeOfLastBatch = batch_size + DIMENSION % numThreadsUsed; //include any remainder

        std::vector<double> innerProductByThread(numThreadsUsed);
        std::vector< std::thread > threads(numThreadsUsed);

        t1 = high_resolution_clock::now();

        //Spread the work over the number of threads being used
        for (int i = 0; i < numThreadsUsed; i++)
        {
            long start = i * batch_size;
            long end = start + (i < numThreadsUsed - 1 ? batch_size : sizeOfLastBatch);

            threads[i] = std::thread(innerProduct, i, start, end, std::ref(a), std::ref(b), std::ref(innerProductByThread));
        }

        // Wait for everyone to finish   
        for (int i = 0; i < numThreadsUsed; i++)
            threads[i].join();

        //Aggregate the (sub) inner product results from the threads used
        for (int i = 0; i < numThreadsUsed; i++)
            parallelInnerProducts[numThreadsUsed - 1] += innerProductByThread[i];

        t2 = high_resolution_clock::now();
        millisecs_duration = t2 - t1;

        parallelTimings[numThreadsUsed - 1] = millisecs_duration.count();
    }

    //Output to screen
    std::cout << "   Num      Plel        Plel     Series     Series " << std::endl;
    std::cout << " Threads  DotProd     Time(ms)   DotProd   Time(ms)" << std::endl << std::endl;

    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {
        std::cout << std::setw(5) << numThreadsUsed
                  << std::fixed << std::setw(14) << std::setprecision(2) << parallelInnerProducts[numThreadsUsed - 1]
                  << std::fixed << std::setw(9) << std::setprecision(2) << parallelTimings[numThreadsUsed - 1];

        if (numThreadsUsed == 1)
            std::cout << std::fixed << std::setw(13) << std::setprecision(2) << seriesInnerProduct
                      << std::setw(8) << std::setprecision(2) << seriesTime;
        
        std::cout << std::endl;

            
    }
  
  
    std::cout << std::endl << "Press Enter to exit...";
    std::cin.get();
    return 1;
}
2

There are 2 best solutions below

0
On

Thanks (@Stephen Newell) for the link to the video for background on cache lines (great video!), and to everyone else for guidance on thrashing of cache lines, OpenMP, and comments on the timing.

After further investigations using the suggestions, I believe that the scaling is being hampered by other constraints which I have yet to identify. Let me first describe my next set of findings, which I will follow with tables and code examples below.

(1) Prompted by the comments, I modified the parallelized function to use a local variable for the inner product accumulator to prevent the cache lines from frequently becoming dirty and thrashing which presumably was taking place because I was updating nearby elements in a result vector directly. But I saw no change in the timing results (see results and code below).

(2) In my original code example above, the vectors in the inner product were being passed to the parallelized function as std::vector objects. The elements of these vectors are only read from, so they should not be causing any cache line thrashing. Nevertheless, I changed them to double* allocated on the heap just to see if for some reason there was unnecessary overhead with the vector objects. As suspected, this made no difference in the timings.

(3) I utilized OpenMP allowing me to avoid altogether having to use a functor over which to parallelize. I still see the same timings.

Conclusion: From (1), (2) and (3) I see no major change in time scaling from the original code provided (above). The dot product results themselves all remain correct and consistent.

In Table 2 below are the timings and results (still using std::thread, code given below) using the local accumulator in the parallelized functor (and double* instead of vector for the dot product components):

TABLE 2 (Direct use of std::thread and local accumulator):

N Threads -- Plel DotProd -- Plel Time(ms) - Series DotProd - Series Time(ms)

1        166666666.67     692.03      166666666.67     695.12
2        166666666.67     430.89
3        166666666.67     352.62
4        166666666.67     320.15
5        166666666.67     301.24
6        166666666.67     293.24
7        166666666.67     286.17
8        166666666.67     281.84 

Note the scaling pattern is very much similar to that generated from the original version of the code (above).

And here are the timing results when using OpenMP (code provided below):

TABLE 3 (Uses OpenMP):

N Threads -- Plel DotProd -- Plel Time(ms)

1       166666666.67     712.20
2       166666666.67     422.51
3       166666666.67     348.47
4       166666666.67     323.28
5       166666666.67     300.59
6       166666666.67     288.93
7       166666666.67     282.65
8       166666666.67     279.42

Again, the time scaling is more or less unchanged.

Whatever is going wrong is likely the same in both approaches.

And below are the two respectively self-contained sets of code that produce Tables 2 and 3 above:

CODE FOR TABLE 2:

#include <vector>
#include <thread>
#include <chrono>
#include <iostream>
#include <iomanip>

using std::chrono::steady_clock;
using std::chrono::duration;

//The functor being parallelized
void innerProduct2(const int  threadID,
                   const long start,
                   const long end,
                   double* a,  
                   double* b,
                   std::vector<double>& innerProductResult
                   )
{
    double res = 0.0;
    for (long i = start; i < end; i++)
        res += a[i] * b[i];
    innerProductResult[threadID] = res;
}


int main(void)
{
    //Create and populate
    const int DIMENSION = 1'000'000'000;
    double* a = NULL;
    double* b = NULL;

    try
    {
        a = new double[DIMENSION];
        b = new double[DIMENSION];
    }
    catch (...)
    {
        delete[] a;
        delete[] b;

        return 0;
    }

    //Use values that won't explode the result or invoke any multiplication optimzations
    for (int i = 0; i < DIMENSION; i++)
    {
        double arg = i / (double)DIMENSION;
        a[i] = arg;
        b[i] = 1.0 - arg;
    }

    //Compute inner product in SERIES
    double seriesInnerProduct = 0.0;
    auto t1 = steady_clock::now();

    for (int i = 0; i < DIMENSION; i++)
        seriesInnerProduct += a[i] * b[i];

    auto t2 = steady_clock::now();
    duration<double, std::milli> millisecs_duration = t2 - t1;   //milliseconds as double
    const double seriesTime = millisecs_duration.count();

    //Compute inner product in PARALLEL
    const int MAX_NUMBER_OF_THREADS = 8;
    std::vector<double> parallelTimings(MAX_NUMBER_OF_THREADS);
    std::vector<double> parallelInnerProducts(MAX_NUMBER_OF_THREADS, 0.0); //initialize elements to zero

    //Compute the inner product using different numbers of threads
    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {
        //Create batches for subjobs
        const long batch_size = DIMENSION / numThreadsUsed;   //but there may be a remainder, so...
        const long sizeOfLastBatch = batch_size + DIMENSION % numThreadsUsed; //include any remainder

        std::vector<double> innerProductByThread(numThreadsUsed);
        std::vector< std::thread > threads(numThreadsUsed);

        t1 = steady_clock::now(); 

        //Spread the work over the number of threads being used
        for (int i = 0; i < numThreadsUsed; i++)
        {
            long start = i * batch_size;
            long end = start + (i < numThreadsUsed - 1 ? batch_size : sizeOfLastBatch);

            threads[i] = std::thread(innerProduct2, i, start, end, a, b, std::ref(innerProductByThread));
        }

        // Wait for everyone to finish   
        for (int i = 0; i < numThreadsUsed; i++)
            threads[i].join();

        //Aggregate the (sub) inner product results from the threads used
        for (int i = 0; i < numThreadsUsed; i++)
            parallelInnerProducts[numThreadsUsed - 1] += innerProductByThread[i];

        t2 = steady_clock::now(); 
        millisecs_duration = t2 - t1;

        parallelTimings[numThreadsUsed - 1] = millisecs_duration.count();
    }

    //Output to screen
    std::cout << "   Num      Plel        Plel     Series     Series " << std::endl;
    std::cout << " Threads  DotProd     Time(ms)   DotProd   Time(ms)" << std::endl << std::endl;

    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {
        std::cout << std::setw(5) << numThreadsUsed
            << std::fixed << std::setw(14) << std::setprecision(2) << parallelInnerProducts[numThreadsUsed - 1]
            << std::fixed << std::setw(9) << std::setprecision(2) << parallelTimings[numThreadsUsed - 1];

        if (numThreadsUsed == 1)
            std::cout << std::fixed << std::setw(13) << std::setprecision(2) << seriesInnerProduct
            << std::setw(8) << std::setprecision(2) << seriesTime;

        std::cout << std::endl;


    }

    delete[] a;
    delete[] b;

    std::cout << std::endl << "Press Enter to exit...";
    std::cin.get();
    return 1;
}

CODE FOR TABLE 3 (uses OpenMP):

#include <vector>
#include <omp.h>
#include <chrono>
#include <iostream>
#include <iomanip>

using std::chrono::steady_clock;
using std::chrono::duration;


int main(void)
{
    //Create arrays and populate
    const int DIMENSION = 1'000'000'000;
    double* a = NULL;
    double* b = NULL;

    try
    {
        a = new double[DIMENSION];
        b = new double[DIMENSION];
    }
    catch (...)
    {
        delete[] a;
        delete[] b;

        return 0;
    }

    //Use values that won't explode the result or invoke any multiplication optimzations
    for (int i = 0; i < DIMENSION; i++)
    {
        double arg = i / (double)DIMENSION;
        a[i] = arg;
        b[i] = 1.0 - arg;
    }

    const int MAX_NUMBER_OF_THREADS = 8;
    std::vector<double> parallelTimings(MAX_NUMBER_OF_THREADS);
    std::vector<double> parallelInnerProducts(MAX_NUMBER_OF_THREADS, 0.0); //initialize elements to zero

    //Compute the inner product using different numbers of threads
    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {

        auto t1 = steady_clock::now();   
    
        omp_set_dynamic(0);
        omp_set_num_threads(numThreadsUsed);

        double result = 0;

        #pragma omp parallel
        {          
            #pragma omp for reduction(+:result)

            for (int i = 0; i < DIMENSION; i++) {
                result += a[i] * b[i];
            }

            //#pragma omp critical
            //{
            //    //Print the actual number of threads in a thread-safe manner
            //    std::cout << "openML Number of threads used: " << omp_get_num_threads() 
            //              << ",  ThreadID= " << omp_get_thread_num() << std::endl;
            //}
        }

        auto t2 = steady_clock::now();
        duration<double, std::milli> millisecs_duration = t2 - t1;
        parallelTimings[numThreadsUsed - 1] = millisecs_duration.count();
        parallelInnerProducts[numThreadsUsed - 1] = result;
    }

    //Output to screen
    std::cout << "   Num      Plel        Plel " << std::endl;
    std::cout << " Threads  DotProd     Time(ms)   " << std::endl << std::endl;

    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {
        std::cout << std::setw(5) << numThreadsUsed
            << std::fixed << std::setw(14) << std::setprecision(2) << parallelInnerProducts[numThreadsUsed - 1]
            << std::fixed << std::setw(9) << std::setprecision(2) << parallelTimings[numThreadsUsed - 1];

        std::cout << std::endl;
    }

    delete[] a;
    delete[] b;

    std::cout << std::endl << "Press Enter to exit...";
    std::cin.get();
    return 1;
}

Lastly, using the OpenMP code above but reducing the length of the vectors from 10^9 to 10^6 gives the following timings:

N Threads -- Plel DotProd -- Plel Time(ms)

1         166666.67     0.89
2         166666.67     0.72
3         166666.67     0.43
4         166666.67     0.49
5         166666.67     0.25
6         166666.67     0.43
7         166666.67     0.51
8         166666.67     0.58

The confusing thing here is that there is not much within one's control except to invoke OpenMP in a simple loop, yet with increasing thread number note that the timings decrease (too slowly) and then increase!

I'm not sure what OpenMP must be doing, and the only thing I can imagine I must be doing incorrectly is somehow mishandling the data itself (unless I missed something?). But it currently is escaping me what to change. I have used omp_get_thread_num() in the outer pragma to verify that the number of threads used in each loop is correct, but I've commented out that method in the example above so as not to bias the timing.

I will begin profiling, but in the meantime any additional thoughts are more than welcome!

0
On

Below you will find results and timings that explain what is happening in the problems mentioned in the prior posts in this thread. I present it in the order I investigated it, and I hope folks find it helpful.

In what follows are timings of a dot product for 2 vectors each of length $10^7$. Both std::thread() and OpenMP were used to run tests. For both libraries, the tasks for each thread were exercised an additional number of times using an outer FOR loop in order to investigate behavior that is dependent on runtime, as follows:

The parallelized functor for std::thread() is modified relative to the prior post so that it now contains the extra FOR loop, and it looks like this (with iterations hardwired here to 1,000 although the number varies across tests):

void innerProduct2(const int  threadID,
                   const long start,
                   const long end,
                   double* a,
                   double* b,
                   std::vector<double>& innerProductResult
                   )
{
    double res = 0.0;
              
    for (int j = 0; j < 1000; j++) //the (new) outer for  
        for (long i = start; i < end; i++) 
            res += a[i] * b[i];
              
    innerProductResult[threadID] = res;
}    

Note that the local "res" variable in the above slice is used to avoid thrashing of the cache lines in the L1 cache.

And when using OpenMP, the modified slice looks like this:

omp_set_dynamic(0);
omp_set_num_threads(numThreadsUsed);
                    
double result = 0;
#pragma omp parallel
{
    #pragma omp for reduction(+:result)
              
    for (int j = 0; j < 1000; j++) //the (new) for loop 
        for (int i = 0; i < DIMENSION; i++) 
            result += a[i] * b[i];                  
    
}    

In the graph below of Timing Benchmarks, some of the curves use std::thread() denoted by "THD" in the legend, and others use OpenMP, indicated by "OMP" in the legend. The number of times the outer FOR loop was executed is also indicated in the legend: "x10^2 OMP" means OpenMP was used and the dot product (for each thread) was run a total of 100 times before returning its sub-result.

FIGURE 1:

Benchmark parallel timings of dot product

The curves are normalized as follows: The points on any given THD curve are normalized relative to that curve's respective time for a single thread. Each THD curve (and the 1/N theoretical curve) therefore all start at the point (1,1).

To compare OMP (OpenMP) curves to THD curves, each OMP curve is normalized by the single thread value of the corresponding THD curve. So the time measurements on the "x1 OMP" curve are normalized by the single thread time of the "x1 THD" curve, and the "x10 OMP" points are normalized by the single thread time of the "x10 THD" curve, etc. Some curves therefore don't begin exactly at (1,1). This keeps the plateaus between corresponding THD and OMP curves comparable to one another and avoids the bias that would otherwise arise from normalizing with the slightly different respective single thread times they have. (It turns out not to make much of a difference.)

The top curve is OpenMP with no extra iterations (x1 OMP). It shows no scaling and is similar to the behavior I noted in the post above when operating on vectors of length $10^9$. For this case (x1 OMP), overhead is swamping the scaling because the calculations are too short in duration on each thread. It is only after artificially inflating the runtime with the outer FOR loop that the (approximate) 1/N scaling starts to emerge -- these OMP curves appear just above the theoretical 1/N curve. This behavior helps put my prior observation in context and makes better sense. OpenMP ultimately produces a small shelf or plateau above the 1/N curve as N increases.

The THD curves are still a bit annoying though because of their elevated plateau. I also observed this plateau with preliminary tests in my prior post. As with the OMP curves, the shapes of the THD curves do smooth out and stabilize as the number of outer FOR iterations increases, but their plateau remains almost triple that of the OMP curves. What can be said, however, is that the plateau is not arising because of a nonlocal variable in the functor inducing cache line thrashing in the L1 cache.

However, additional insight is gained if the outer FOR loop in the OpenMP code slice is moved so that the slice now looks like this:

omp_set_dynamic(0);
omp_set_num_threads(numThreadsUsed);
                    
double result = 0;

for (int j = 0; j < 1000; j++) //MOVED for loop
{
    #pragma omp parallel
    {
        #pragma omp for reduction(+:result)
              
        //for (int j = 0; j < 1000; j++) //the (new) for loop 
        for (int i = 0; i < DIMENSION; i++) 
            result += a[i] * b[i];                  
     } 
}    

The above slice runs 1,000 parallelized dot product calculations in a row, rather than parallelizing 1,000 repeated dot products. Here are the results:

FIGURE 2:

enter image description here

The lowest curve corresponds to the OpenMP calculation prior to moving the FOR loop, and also corresponds to the lower plateau in FIGURE 1. But the OpenMP calculation after the FOR loop is moved now exhibits the same plateau as in the curve generated from std::thread().

This means the extra overhead introduced after moving the FOR loop might be of the same type as that present in the std::thread() approach. Perhaps it is the case that, even though there is no non-local variable causing excessive L1 cache line thrashing in the functor code using std::thread(), there is still a good deal of cache refreshing since the outer FOR loop in the std::thread() functor requires all the vector indices be accessed prior to advancing to the next index in the outer FOR. Could this be enough overhead to explain the std::thread() plateau?

To find out, note that interchanging the order of the FOR loops in the functor would show to what degree that extra refreshing is having an effect because the switch would invoke the artificial repetitions of the exercising FOR prior to the cache needing to update for the next set of vector indices, reducing the number of cache updates. To this end, the following modified functor code is used which shows the interchanged FOR loops:

    void innerProduct2(const int  threadID,
                   const long start,
                   const long end,
                   double* a,
                   double* b,
                   std::vector<double>& innerProductResult
                   )
{
    double res = 0.0;
    
    for (long i = start; i < end; i++) //for is moved here          
       for (int j = 0; j < 1000; j++) //the (new) outer for  
       //for (long i = start; i < end; i++) 
            res += a[i] * b[i];
              
    innerProductResult[threadID] = res;
}    

And FIGURE 3 includes the output of this modified functor slice along with the same information in FIGURE 2 for comparison:

FIGURE 3:

enter image description here

That did the trick. The timings for the updated functor appear as the lowest curve in the graph. The plateau is gone.

Conclusions:

  1. The original plateaus I observed (in the prior posts) were being caused by excessively short runtimes on each thread in the case of OpenMP, and in std::thread(). They were not (in this case) being caused by cache line thrashing in L1.

  2. To investigate the plateaus, the execution time on each thread was lengthened in the computations for both libraries with outer FOR loops. OpenMP seemed to handle this well, but std::thread(), despite using a local accumulator variable, did not: The outer FOR loop in the functor repeated the expected cache updates as the vector indices were traversed. These repeated updates were still significant enough to produce a noticeable plateau (although this was not yet apparent until (3) below).

  3. Switching the FOR loops in OpenMP made for less efficient execution and reproduced the plateau generated by std::thread(), which hinted that the remaining cache updates in std::thread() might be a source of the plateau problem. This prompted switching the FOR loops in the std::thread() functor and this removed the plateau.

  4. It's interesting that OpenMP was able to handle the outer FOR loop very well without me having to interchange the FORs as I needed to do in the functor slice. Any insight on this from anyone?

Understanding what is going on with the L1 cache is essential to successful parallelization. Throwing threads at a problem is not the way forward. By the way, I did not refer in this post to all the side streets I investigated regarding CPU cache management that everyone's suggestions pointed me to, so thank you very much! In the end, I hope that some found this investigation useful and not too basic.