In Is OpenMP (parallel for) in g++ 4.7 not very efficient? 2.5x at 5x CPU, I determined that the performance of my programme varies between 11s and 13s (mostly always above 12s, and sometimes as slow as 13.4s) at around 500% CPU when using the default #pragma omp parallel for
, and the OpenMP speed up is only 2.5x at 5x CPU w/ g++-4.7 -O3 -fopenmp
, on a 4-core 8-thread Xeon.
I tried using schedule(static) num_threads(4)
, and noticed that my programme always completes in 11.5s to 11.7s (always below 12s) at about 320% CPU, e.g., runs more consistently, and uses less resources (even if the best run is half a second slower than the rare outlier with hyperthreading).
Is there any simple OpenMP-way to detect hyperthreading, and reduce num_threads()
to the actual number of CPU cores?
(There is a similar question, Poor performance due to hyper-threading with OpenMP: how to bind threads to cores, but in my testing, I found that a mere reduction from 8 to 4 threads somehow already does that job w/ g++-4.7 on Debian 7 wheezy and Xeon E3-1240v3, so, this very question is merely about reducing num_threads()
to the number of cores.)
If you were running under Linux [also assuming an x86 arch], you could look at
/proc/cpuinfo
. There are two fieldscpu cores
andsiblings
. The first is number of [real] cores and the latter is the number of hyperthreads. (e.g. on my system they are 4 and 8 respectively for my four core hyperthreaded machine).Because Linux can detect this [and from the link in Zulan's comment], the information is also available from the x86
cpuid
instruction.Either way, there is also an environment variable for this:
OMP_NUM_THREADS
which may be easier to use in conjunction with a launcher/wrapper scriptOne thing you may wish to consider is that beyond a certain number of threads, you can saturate the memory bus, and no increase in threads [or cores] will improve performance, and, may in fact, reduce performance.
From this question: Atomically increment two integers with CAS there is a link to a video talk from CppCon 2015 that is in two parts: https://www.youtube.com/watch?v=lVBvHbJsg5Y and https://www.youtube.com/watch?v=1obZeHnAwz4
They're about 1.5 hours each, but, IMO, well worth it.
In the talk, the speaker [who has done a lot of multithread/multicore optimization] says, that from his experience, the memory bus/system tends to get saturated after about four threads.