So I just, after an inappropriate amount of time, found out, that even if you have nested OpenMP disabled, the inner parallel region in the following sample will still run in parallel:
#pragma omp parallel num_threads(1)
{
printf("BBB thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
#pragma omp parallel
{
printf("CCC thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
}
}
Yes, I did set num_threads to 1, but it still is a parallel region. Why does it not behave as one in terms of nested OpenMP? Why is (an equivalent of) omp_get_active_level() used to determine the nest-ism, instead of omp_get_level()? It just does not make sense to me. Why num_threads(3) behaves analogously to num_threads(2), but num_threads(1) behaves differently?
Is this behavior expected? I tested with g++ and icpx compilers and both work in the same way.
if(false) has the same effect as num_threads(1), but that's is expectable, since with this you actually specify that you don't want to launch a parallel region. But it still affects omp_get_level(), which seems weird.
I did read this algorithm, so this is more of a question of why is it designed in such a way?
btw this is the output I am getting, when OMP_NUM_THREADS=4 (AAA is completely outside any parallel region):
AAA thread 0/1 level 0 0
num_threads(3):
BBB thread 1/3 level 1 1
CCC thread 0/1 level 1 2
BBB thread 0/3 level 1 1
CCC thread 0/1 level 1 2
BBB thread 2/3 level 1 1
CCC thread 0/1 level 1 2
num_threads(2):
BBB thread 1/2 level 1 1
CCC thread 0/1 level 1 2
BBB thread 0/2 level 1 1
CCC thread 0/1 level 1 2
num_threads(1):
BBB thread 0/1 level 0 1
CCC thread 0/4 level 1 2
CCC thread 1/4 level 1 2
CCC thread 2/4 level 1 2
CCC thread 3/4 level 1 2
and the full program:
#include <cstdio>
#include <omp.h>
int main()
{
printf("AAA thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
printf("\nnum_threads(3):\n");
#pragma omp parallel num_threads(3)
{
printf("BBB thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
#pragma omp parallel
{
printf("CCC thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
}
}
printf("\nnum_threads(2):\n");
#pragma omp parallel num_threads(2)
{
printf("BBB thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
#pragma omp parallel
{
printf("CCC thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
}
}
printf("\nnum_threads(1):\n");
#pragma omp parallel num_threads(1)
{
printf("BBB thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
#pragma omp parallel
{
printf("CCC thread %d/%d level %d %d\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_active_level(), omp_get_level());
}
}
return 0;
}
Disabling nesting is equivalent to setting max-active-levels-var to 1 - either using the environmental variable (
OMP_MAX_ACTIVE_LEVELS=1) or using the runtime function (omp_set_max_active_levels(1)). A parallel region executing with a single thread is defined as inactive parallel region. Therefore such parallel region does not count towards the max active regions limit. As other comments suggested, thenum_threadsclause should only be used when really necessary. The more flexible way is to exportOMP_NUM_THREADS=1,4to get the output for your last experiement.