I'd greatly appreciate anyone's insight on this question. My machine has 16 cores (Intel i9-12900HX) and I want to better understand how to directly leverage them.
To do this, I ran a simple experiment that computes the inner product of two 1-D vectors of doubles, where each vector has 109 elements.
First, I compute the inner product using a simple loop (i.e. in series), and then I compute it in parallel using threads, first by using 1 thread, then by using 2 threads,...,up to and including 8 threads.
Here are the results I obtained for one execution instance of the calculations. The numbers are similar from one execution instance to the next (the self-contained code is below):
N Threads ----- Plel DotProd --- Plel Time(ms) -- Series DotProd - Series Time(ms)
1 166666666.67 714.55 166666666.67 694.82
2 166666666.67 416.97
3 166666666.67 387.20
4 166666666.67 316.38
5 166666666.67 315.09
6 166666666.67 293.89
7 166666666.67 286.43
8 166666666.67 277.68
As a check, notice that all the dot product results are the same, and the time duration for the series calculation (top row, far right) is in line with the time duration for the "parallel" calculation when using only 1 thread (top row, middle). But for the parallel calculations in general, the speedup seems (possibly) reasonable when going from 1 to 2 threads, not as good as I expected when going from 2 to 3, or from 3 to 4, and there is no material benefit for 4 or more threads.
Are all these threads running on only 3 (maybe 4) cores out of the total of 16? Is it possible to know how many cores are actually being used, or to be able to guarantee how many cores are to be used in a calculation? Does the problem lie elsewhere? Have I not accounted for something? Thanks very much in advance for anyone's thoughts. Here's the code (it takes about 6 or so seconds to run on my Thinkpad):
#include <vector>
#include <thread>
#include <chrono>
#include <iostream>
#include <iomanip>
using std::chrono::high_resolution_clock;
using std::chrono::duration;
//The functor being parallelized
void innerProduct(const int threadID,
const long start,
const long end,
const std::vector<double>& a,
const std::vector<double>& b,
std::vector<double>& innerProductResult //ith element holds result from ith thread
)
{
innerProductResult[threadID] = 0.0;
for (long i = start; i < end; i++)
innerProductResult[threadID] += a[i] * b[i];
}
int main(void)
{
//Create and populate
const int DIMENSION = 1'000'000'000;
std::vector<double> a(DIMENSION);
std::vector<double> b(DIMENSION);
//Use values that won't explode the result or invoke any multiplication optimzations
for (int i = 0; i < DIMENSION; i++)
{
double arg = i / (double)DIMENSION;
a[i] = arg;
b[i] = 1.0 - arg;
}
//Compute inner product in SERIES
double seriesInnerProduct = 0.0;
auto t1 = high_resolution_clock::now();
for (int i = 0; i < DIMENSION; i++)
seriesInnerProduct += a[i] * b[i];
auto t2 = high_resolution_clock::now();
duration<double, std::milli> millisecs_duration = t2 - t1; //milliseconds as double
const double seriesTime = millisecs_duration.count();
//Compute inner product in PARALLEL
const int MAX_NUMBER_OF_THREADS = 8;
std::vector<double> parallelTimings(MAX_NUMBER_OF_THREADS);
std::vector<double> parallelInnerProducts(MAX_NUMBER_OF_THREADS, 0.0); //initialize elements to zero
//Compute the inner product using different numbers of threads
for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
{
//Create batches for subjobs
const long batch_size = DIMENSION / numThreadsUsed; //but there may be a remainder, so...
const long sizeOfLastBatch = batch_size + DIMENSION % numThreadsUsed; //include any remainder
std::vector<double> innerProductByThread(numThreadsUsed);
std::vector< std::thread > threads(numThreadsUsed);
t1 = high_resolution_clock::now();
//Spread the work over the number of threads being used
for (int i = 0; i < numThreadsUsed; i++)
{
long start = i * batch_size;
long end = start + (i < numThreadsUsed - 1 ? batch_size : sizeOfLastBatch);
threads[i] = std::thread(innerProduct, i, start, end, std::ref(a), std::ref(b), std::ref(innerProductByThread));
}
// Wait for everyone to finish
for (int i = 0; i < numThreadsUsed; i++)
threads[i].join();
//Aggregate the (sub) inner product results from the threads used
for (int i = 0; i < numThreadsUsed; i++)
parallelInnerProducts[numThreadsUsed - 1] += innerProductByThread[i];
t2 = high_resolution_clock::now();
millisecs_duration = t2 - t1;
parallelTimings[numThreadsUsed - 1] = millisecs_duration.count();
}
//Output to screen
std::cout << " Num Plel Plel Series Series " << std::endl;
std::cout << " Threads DotProd Time(ms) DotProd Time(ms)" << std::endl << std::endl;
for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
{
std::cout << std::setw(5) << numThreadsUsed
<< std::fixed << std::setw(14) << std::setprecision(2) << parallelInnerProducts[numThreadsUsed - 1]
<< std::fixed << std::setw(9) << std::setprecision(2) << parallelTimings[numThreadsUsed - 1];
if (numThreadsUsed == 1)
std::cout << std::fixed << std::setw(13) << std::setprecision(2) << seriesInnerProduct
<< std::setw(8) << std::setprecision(2) << seriesTime;
std::cout << std::endl;
}
std::cout << std::endl << "Press Enter to exit...";
std::cin.get();
return 1;
}
Thanks (@Stephen Newell) for the link to the video for background on cache lines (great video!), and to everyone else for guidance on thrashing of cache lines, OpenMP, and comments on the timing.
After further investigations using the suggestions, I believe that the scaling is being hampered by other constraints which I have yet to identify. Let me first describe my next set of findings, which I will follow with tables and code examples below.
(1) Prompted by the comments, I modified the parallelized function to use a local variable for the inner product accumulator to prevent the cache lines from frequently becoming dirty and thrashing which presumably was taking place because I was updating nearby elements in a result vector directly. But I saw no change in the timing results (see results and code below).
(2) In my original code example above, the vectors in the inner product were being passed to the parallelized function as std::vector objects. The elements of these vectors are only read from, so they should not be causing any cache line thrashing. Nevertheless, I changed them to double* allocated on the heap just to see if for some reason there was unnecessary overhead with the vector objects. As suspected, this made no difference in the timings.
(3) I utilized OpenMP allowing me to avoid altogether having to use a functor over which to parallelize. I still see the same timings.
Conclusion: From (1), (2) and (3) I see no major change in time scaling from the original code provided (above). The dot product results themselves all remain correct and consistent.
In Table 2 below are the timings and results (still using std::thread, code given below) using the local accumulator in the parallelized functor (and double* instead of vector for the dot product components):
TABLE 2 (Direct use of std::thread and local accumulator):
N Threads -- Plel DotProd -- Plel Time(ms) - Series DotProd - Series Time(ms)
Note the scaling pattern is very much similar to that generated from the original version of the code (above).
And here are the timing results when using OpenMP (code provided below):
TABLE 3 (Uses OpenMP):
N Threads -- Plel DotProd -- Plel Time(ms)
Again, the time scaling is more or less unchanged.
Whatever is going wrong is likely the same in both approaches.
And below are the two respectively self-contained sets of code that produce Tables 2 and 3 above:
CODE FOR TABLE 2:
CODE FOR TABLE 3 (uses OpenMP):
Lastly, using the OpenMP code above but reducing the length of the vectors from 10^9 to 10^6 gives the following timings:
N Threads -- Plel DotProd -- Plel Time(ms)
The confusing thing here is that there is not much within one's control except to invoke OpenMP in a simple loop, yet with increasing thread number note that the timings decrease (too slowly) and then increase!
I'm not sure what OpenMP must be doing, and the only thing I can imagine I must be doing incorrectly is somehow mishandling the data itself (unless I missed something?). But it currently is escaping me what to change. I have used omp_get_thread_num() in the outer pragma to verify that the number of threads used in each loop is correct, but I've commented out that method in the example above so as not to bias the timing.
I will begin profiling, but in the meantime any additional thoughts are more than welcome!