#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000; double step;
#define PAD 8
#define NUM_THREADS 6
void main(){
int i, nthreads; double pi=0, sum[NUM_THREADS][PAD]={0};
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
//Starting Timer
double time_start = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id==0) nthreads = nthrds;
for(i=id;i<num_steps;i=i+nthrds){
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0; i<nthreads; i++)pi +=sum[i][0]*step;
//Ending Timer
double time_end = omp_get_wtime();
double timepass = time_end-time_start;
//New Run, how many threads
printf("Integration Program runs with %d threads\n", nthreads);
//Print Result of Integral
printf("Integration Result: %lf\n", pi);
//Print How much Time has passed
printf("%lf Time passed for Integration...\n", timepass);
//Print Effective Time
printf("Effective Total Time: %lf\n\n", timepass*nthreads);
}
This snippet of code is taken from an OpenMP tutorial by Tim Matson. This code integrates the function 4.0/(1+x*x) but holds each partial result in a 2d-array named sum. I use a linux machine and have checked I have the standard 64 bit cache lines on L1, L2, and L3. I compiled using gcc, no optimizations and was expecting runtime to decrease. This is what I got for the runtime:
1 threads: 0.356362
2 threads: 0.541903
3 threads: 0.416097
4 threads: 0.346139
5 threads: 0.286879
6 threads: 0.315139
It seems that false sharing still occurs even with the padding and I am confused why. I have changed the padding to larger sizes and performance scalability is similarly poor. The only thing that seems to fix the poor scalability problem is by turning on the compiler optimizations, even just the -O1 would make the code scale great. I am not sure why this is the case though.
TL;DR: compiler optimizations and hyper-threading plays a huge role on the observed effect. Frequency scaling can impact the scalability too. In fact, the provided results are actually not a sufficient evidence to claim false sharing is the main issue.
Compiler optimizations
First of all, optimizations have a huge impact on the benchmark since they prevent any false sharing effect. Indeed, with optimization
-O1
, GCC 12 is able to store many variable in registers (but notsum
). In-O2
and-O3
, GCC 12 is able to store thesum
array only in registers so any false sharing effect cannot be seen. This is why optimization must be disabled not to introduce any bias in this benchmark. Alternatively, on can use thevolatile
keyword to prevent the compiler optimizing memory accesses (so to be able to use optimizations).Here is the assembly code of the hot loop in
-O0
with GCC 12.1:Here is the same code and the same compiler with
-O1
:Here is the same code and the same compiler with
-O2
:One can see that not load/store operations are used with
-O2
in the hot computing loop using GCC 12. This can also be seen on Godbolt. Results may change from one version of GCC to another.Hyper-threading
Regarding the effect of threads on the performance, I am not able to reproduce the problem on my i5-9600KF processor: I see no significant effect of false sharing. More precisely, the value of timepass is about 5.5x~5.6x time smaller with 6 threads (on 6 cores, which is very good -- see later). This processor has the same architecture than your i7-8750H: it is an Intel Coffee Lake processor (though mine is a "Refresh"). Thus, the behaviour of the core should be exactly the same on this benchmark. The layout of the cores might change, but the two processor have the same number of cores (6) and AFAIK there is no change in the layout of the cores between the two (at least based on informations provided by Intel). The major difference is that i7 processors have Hyper-Threading while i5 processors does not. This is certainly why results are so different on your processor. In fact, your results are very unstable even when the same number of thread is used and with the same
PAD
value which mean that the execution context play a huge role in the performance results. I think two threads are sometimes bound to the same core resulting in a much slower execution time. In fact 2 time slower in the worst case (threads of the same core can share only a part of the resources).To check this hypothesis, you need to force each threads to be bound to different cores. This can be done using the
OMP_PROC_BIND
andOMP_PLACES
. You can usehwloc-ls
andhwloc-ps
tools to actually check the layout of the logical cores and the binding of the application threads on logical/physical cores.hwloc-calc
can be used to script the binding.In practice, you can use the following Bash script to run your program with a better thread binding:
Frequency scaling
Note that Intel processors use a frequency scaling method to adapt the frequency regarding the number of working threads and regarding what they actually do (eg. using wide SIMD instructions like AVX one cause a lower frequency to be used). Intel does that so the overall processor package does not consume more than a power budget (so to reduce power and thermal issues). For example, on my processor, 1 core operates at 4.5 GHz while 6 core operate at 4.3 GHz in practice on your benchmark. This impacts a bit the scalability of your code since using more cores makes them run a bit slower. AFAIK, this is especially true on energy-efficient processors like yours. Indeed, the
H
class means "high-performance optimized for mobile" and such processor have more thermal limitations than high-performance desktop processor like mine. Additionally, I have a "Refresh" Coffee Lake architecture which also impact the thermal throttling of the processor (they are better than non-"Refresh" processor like yours). To quote Wikipedia:Still, I expect the effect of thermal throttling to be relatively small and not the main issue though it plays a role in the resulting scalability.
Better benchmarking with performance counters
Since the timing can be affected by other effect than false sharing, it is wise to take a more scientific approach than simply analysing the execution time and guessing the probable cause. More specifically, if false sharing is responsible for the biggest part of the time, the cache should be impacted: a cache line bouncing effect should be seen. X86-64 processors have hardware performance counters to monitor such an effect. This require a good understanding of the cache coherence protocol like MESI or MOESI. I expect the number of Request For Ownership (RFO) operations between cores to sharply increase if there is some false sharing happening. This metric can be seen using
perf
on Linux (or Intel VTune). I think the hardware counterl2_rqsts.all_rfo
should be the right one to check the effect on your processor. On my machine, I confirm the metric is >10 times bigger when there are false sharing issues (eg. when pad is small and the program poorly scale).