#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000; double step;
#define PAD 8
#define NUM_THREADS 6
void main(){
int i, nthreads; double pi=0, sum[NUM_THREADS][PAD]={0};
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
//Starting Timer
double time_start = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id==0) nthreads = nthrds;
for(i=id;i<num_steps;i=i+nthrds){
x = (i+0.5)*step;
sum[id][0] += 4.0/(1.0+x*x);
}
}
for(i=0; i<nthreads; i++)pi +=sum[i][0]*step;
//Ending Timer
double time_end = omp_get_wtime();
double timepass = time_end-time_start;
//New Run, how many threads
printf("Integration Program runs with %d threads\n", nthreads);
//Print Result of Integral
printf("Integration Result: %lf\n", pi);
//Print How much Time has passed
printf("%lf Time passed for Integration...\n", timepass);
//Print Effective Time
printf("Effective Total Time: %lf\n\n", timepass*nthreads);
}
This snippet of code is taken from an OpenMP tutorial by Tim Matson. This code integrates the function 4.0/(1+x*x) but holds each partial result in a 2d-array named sum. I use a linux machine and have checked I have the standard 64 bit cache lines on L1, L2, and L3. I compiled using gcc, no optimizations and was expecting runtime to decrease. This is what I got for the runtime:
1 threads: 0.356362
2 threads: 0.541903
3 threads: 0.416097
4 threads: 0.346139
5 threads: 0.286879
6 threads: 0.315139
It seems that false sharing still occurs even with the padding and I am confused why. I have changed the padding to larger sizes and performance scalability is similarly poor. The only thing that seems to fix the poor scalability problem is by turning on the compiler optimizations, even just the -O1 would make the code scale great. I am not sure why this is the case though.
I wonder if the story about false sharing needs to be revisited. I've adapted the code to
also:
so that I can run a quick shell loop:
and this is what I get:
In other words: with modern processors false sharing is no longer a problem. The processor keeps a separate accumulator on each core and does not write to the falsely shared locations until it's absolutely necessary.
EDIT since there was a suggestion that this only works because of the static bounds, I've made a version of the code with
and
and I get basically the same: