Array Padding does not mitigate false sharing? C, OpenMP

340 Views Asked by At
#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000; double step;
#define PAD 8
#define NUM_THREADS 6
void main(){
int i, nthreads; double pi=0, sum[NUM_THREADS][PAD]={0};
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);

//Starting Timer
double time_start = omp_get_wtime();

#pragma omp parallel
{
    int i, id, nthrds;
    double x;
    id = omp_get_thread_num();
    nthrds = omp_get_num_threads();
    if(id==0) nthreads = nthrds;
    for(i=id;i<num_steps;i=i+nthrds){
        x = (i+0.5)*step;
        sum[id][0] += 4.0/(1.0+x*x);
    }
}
for(i=0; i<nthreads; i++)pi +=sum[i][0]*step;

//Ending Timer
double time_end = omp_get_wtime();

double timepass = time_end-time_start;

//New Run, how many threads
printf("Integration Program runs with %d threads\n", nthreads);

//Print Result of Integral
printf("Integration Result: %lf\n", pi);

//Print How much Time has passed
printf("%lf Time passed for Integration...\n", timepass);

//Print Effective Time
printf("Effective Total Time: %lf\n\n", timepass*nthreads);
}

This snippet of code is taken from an OpenMP tutorial by Tim Matson. This code integrates the function 4.0/(1+x*x) but holds each partial result in a 2d-array named sum. I use a linux machine and have checked I have the standard 64 bit cache lines on L1, L2, and L3. I compiled using gcc, no optimizations and was expecting runtime to decrease. This is what I got for the runtime:

1 threads: 0.356362

2 threads: 0.541903

3 threads: 0.416097

4 threads: 0.346139

5 threads: 0.286879

6 threads: 0.315139

It seems that false sharing still occurs even with the padding and I am confused why. I have changed the padding to larger sizes and performance scalability is similarly poor. The only thing that seems to fix the poor scalability problem is by turning on the compiler optimizations, even just the -O1 would make the code scale great. I am not sure why this is the case though.

2

There are 2 best solutions below

8
On

TL;DR: compiler optimizations and hyper-threading plays a huge role on the observed effect. Frequency scaling can impact the scalability too. In fact, the provided results are actually not a sufficient evidence to claim false sharing is the main issue.


Compiler optimizations

First of all, optimizations have a huge impact on the benchmark since they prevent any false sharing effect. Indeed, with optimization -O1, GCC 12 is able to store many variable in registers (but not sum). In -O2 and -O3, GCC 12 is able to store the sum array only in registers so any false sharing effect cannot be seen. This is why optimization must be disabled not to introduce any bias in this benchmark. Alternatively, on can use the volatile keyword to prevent the compiler optimizing memory accesses (so to be able to use optimizations).

Here is the assembly code of the hot loop in -O0 with GCC 12.1:

.L8:
        mov     eax, DWORD PTR [rbp-4]
        movsx   rdx, eax
        mov     rax, QWORD PTR num_steps[rip]
        cmp     rdx, rax
        jge     .L11
        pxor    xmm1, xmm1
        cvtsi2sd        xmm1, DWORD PTR [rbp-4]
        movsd   xmm0, QWORD PTR .LC6[rip]
        addsd   xmm1, xmm0
        movsd   xmm0, QWORD PTR step[rip]
        mulsd   xmm0, xmm1
        movsd   QWORD PTR [rbp-24], xmm0
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax]
        mov     edx, DWORD PTR [rbp-8]
        movsx   rdx, edx
        sal     rdx, 6
        add     rax, rdx
        movsd   xmm1, QWORD PTR [rax]
        movsd   xmm0, QWORD PTR [rbp-24]
        movapd  xmm2, xmm0
        mulsd   xmm2, xmm0
        movsd   xmm0, QWORD PTR .LC1[rip]
        addsd   xmm2, xmm0
        movsd   xmm0, QWORD PTR .LC7[rip]
        divsd   xmm0, xmm2
        addsd   xmm0, xmm1
        mov     rax, QWORD PTR [rbp-40]
        mov     rax, QWORD PTR [rax]
        mov     edx, DWORD PTR [rbp-8]
        movsx   rdx, edx
        sal     rdx, 6
        add     rax, rdx
        movsd   QWORD PTR [rax], xmm0
        mov     eax, DWORD PTR [rbp-12]
        add     DWORD PTR [rbp-4], eax
        jmp     .L8

Here is the same code and the same compiler with -O1:

.L4:
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, edx
        addsd   xmm0, xmm4
        mulsd   xmm0, QWORD PTR step[rip]
        mulsd   xmm0, xmm0
        addsd   xmm0, xmm3
        movapd  xmm1, xmm2
        divsd   xmm1, xmm0
        addsd   xmm1, QWORD PTR [rcx]
        movsd   QWORD PTR [rcx], xmm1
        add     edx, eax
        cmp     edx, 99999999
        jle     .L4

Here is the same code and the same compiler with -O2:

.L4:
        pxor    xmm0, xmm0
        movapd  xmm2, xmm3
        cvtsi2sd        xmm0, edx
        add     edx, eax
        addsd   xmm0, xmm5
        mulsd   xmm0, xmm6
        mulsd   xmm0, xmm0
        addsd   xmm0, xmm4
        divsd   xmm2, xmm0
        addsd   xmm1, xmm2
        cmp     edx, 99999999
        jle     .L4

One can see that not load/store operations are used with -O2 in the hot computing loop using GCC 12. This can also be seen on Godbolt. Results may change from one version of GCC to another.


Hyper-threading

Regarding the effect of threads on the performance, I am not able to reproduce the problem on my i5-9600KF processor: I see no significant effect of false sharing. More precisely, the value of timepass is about 5.5x~5.6x time smaller with 6 threads (on 6 cores, which is very good -- see later). This processor has the same architecture than your i7-8750H: it is an Intel Coffee Lake processor (though mine is a "Refresh"). Thus, the behaviour of the core should be exactly the same on this benchmark. The layout of the cores might change, but the two processor have the same number of cores (6) and AFAIK there is no change in the layout of the cores between the two (at least based on informations provided by Intel). The major difference is that i7 processors have Hyper-Threading while i5 processors does not. This is certainly why results are so different on your processor. In fact, your results are very unstable even when the same number of thread is used and with the same PAD value which mean that the execution context play a huge role in the performance results. I think two threads are sometimes bound to the same core resulting in a much slower execution time. In fact 2 time slower in the worst case (threads of the same core can share only a part of the resources).

To check this hypothesis, you need to force each threads to be bound to different cores. This can be done using the OMP_PROC_BIND and OMP_PLACES. You can use hwloc-ls and hwloc-ps tools to actually check the layout of the logical cores and the binding of the application threads on logical/physical cores. hwloc-calc can be used to script the binding.

In practice, you can use the following Bash script to run your program with a better thread binding:

# Bind each thread to the logical core 0 of each physical core
export OMP_PROC_BIND=TRUE
export OMP_PLACES={$(hwloc-calc --li --po -I PU CORE:all.PU:0 --sep "},{")}
export OMP_DISPLAY_ENV=TRUE
./your_program

Frequency scaling

Note that Intel processors use a frequency scaling method to adapt the frequency regarding the number of working threads and regarding what they actually do (eg. using wide SIMD instructions like AVX one cause a lower frequency to be used). Intel does that so the overall processor package does not consume more than a power budget (so to reduce power and thermal issues). For example, on my processor, 1 core operates at 4.5 GHz while 6 core operate at 4.3 GHz in practice on your benchmark. This impacts a bit the scalability of your code since using more cores makes them run a bit slower. AFAIK, this is especially true on energy-efficient processors like yours. Indeed, the H class means "high-performance optimized for mobile" and such processor have more thermal limitations than high-performance desktop processor like mine. Additionally, I have a "Refresh" Coffee Lake architecture which also impact the thermal throttling of the processor (they are better than non-"Refresh" processor like yours). To quote Wikipedia:

On October 8, 2018, Intel announced what it branded its ninth generation of Core processors, the Coffee Lake Refresh family. To avoid running into thermal problems at high clock speeds, Intel soldered the integrated heat spreader (IHS) to the CPU die instead of using thermal paste as on the Coffee Lake processors.

Still, I expect the effect of thermal throttling to be relatively small and not the main issue though it plays a role in the resulting scalability.


Better benchmarking with performance counters

Since the timing can be affected by other effect than false sharing, it is wise to take a more scientific approach than simply analysing the execution time and guessing the probable cause. More specifically, if false sharing is responsible for the biggest part of the time, the cache should be impacted: a cache line bouncing effect should be seen. X86-64 processors have hardware performance counters to monitor such an effect. This require a good understanding of the cache coherence protocol like MESI or MOESI. I expect the number of Request For Ownership (RFO) operations between cores to sharply increase if there is some false sharing happening. This metric can be seen using perf on Linux (or Intel VTune). I think the hardware counter l2_rqsts.all_rfo should be the right one to check the effect on your processor. On my machine, I confirm the metric is >10 times bigger when there are false sharing issues (eg. when pad is small and the program poorly scale).

11
On

I wonder if the story about false sharing needs to be revisited. I've adapted the code to

#ifndef PAD
#define PAD 8
#endif

#ifndef NTHREADS
#define NTHREADS 6
#endif

void main(){
  int i, nthreads; double pi=0, sum[NTHREADS][PAD]={0};
  step = 1.0/(double) num_steps;
  omp_set_num_threads(NTHREADS);

also:

  printf("Integration Program runs with %d threads, padding=%d\n", nthreads,PAD);

so that I can run a quick shell loop:

for p in 1 2 3 4 5 6 7 8 ; do
    ## compile with -DPAD=$p -DNTHREADS=whatever

and this is what I get:

Integration Program runs with 56 threads, padding=1
Integration Result: 3.141593
0.006488 Time passed for Integration...
Effective Total Time: 0.363319

Integration Program runs with 56 threads, padding=2
Integration Result: 3.141593
0.006484 Time passed for Integration...
Effective Total Time: 0.363106

Integration Program runs with 56 threads, padding=3
Integration Result: 3.141593
0.006213 Time passed for Integration...
Effective Total Time: 0.347925

Integration Program runs with 56 threads, padding=4
Integration Result: 3.141593
0.006125 Time passed for Integration...
Effective Total Time: 0.342999

Integration Program runs with 56 threads, padding=5
Integration Result: 3.141593
0.006641 Time passed for Integration...
Effective Total Time: 0.371904

Integration Program runs with 56 threads, padding=6
Integration Result: 3.141593
0.006988 Time passed for Integration...
Effective Total Time: 0.391317

Integration Program runs with 56 threads, padding=7
Integration Result: 3.141593
0.006617 Time passed for Integration...
Effective Total Time: 0.370556

Integration Program runs with 56 threads, padding=8
Integration Result: 3.141593
0.006138 Time passed for Integration...
Effective Total Time: 0.343719

In other words: with modern processors false sharing is no longer a problem. The processor keeps a separate accumulator on each core and does not write to the falsely shared locations until it's absolutely necessary.

EDIT since there was a suggestion that this only works because of the static bounds, I've made a version of the code with

#define TPINDEX(t,p) t*PAD+p

void main(){
  //  int i, nthreads;
  omp_set_num_threads(NTHREADS);
  double pi=0,
    *sum = (double*) malloc( NTHREADS*PAD*sizeof(double) );
#pragma omp parallel for
  for (int t=0; t<NTHREADS; t++)
    for (int p=0; p<PAD; p++)
      sum[ TPINDEX(t,p) ] = 0;

and

  int nthreads;
#pragma omp parallel
  {
    int id, nthrds;
    double x;
    id = omp_get_thread_num();
    nthrds = omp_get_num_threads();
    if(id==0) nthreads = nthrds;
    for(int i=id;i<num_steps;i=i+nthrds){
      x = (i+0.5)*step;
      sum[ TPINDEX(id,0) ] += 4.0/(1.0+x*x);
    }
  }
  for (int i=0; i<nthreads; i++)
    pi += sum[ TPINDEX(i,0) ]*step;

and I get basically the same:

[c202-001 c:7] make run_mattmal ECHO=0
Integration Program runs with 56 threads, padding=1
Integration Result: 3.141593
0.001773 Time passed for Integration...
Effective Total Time: 0.099295

Integration Program runs with 56 threads, padding=2
Integration Result: 3.141593
0.001569 Time passed for Integration...
Effective Total Time: 0.087866

Integration Program runs with 56 threads, padding=3
Integration Result: 3.141593
0.002002 Time passed for Integration...
Effective Total Time: 0.112112

Integration Program runs with 56 threads, padding=4
Integration Result: 3.141593
0.001569 Time passed for Integration...
Effective Total Time: 0.087852

Integration Program runs with 56 threads, padding=5
Integration Result: 3.141593
0.001550 Time passed for Integration...
Effective Total Time: 0.086798

Integration Program runs with 56 threads, padding=6
Integration Result: 3.141593
0.001598 Time passed for Integration...
Effective Total Time: 0.089481

Integration Program runs with 56 threads, padding=7
Integration Result: 3.141593
0.001582 Time passed for Integration...
Effective Total Time: 0.088587

Integration Program runs with 56 threads, padding=8
Integration Result: 3.141593
0.001573 Time passed for Integration...
Effective Total Time: 0.088093