I have a parrallel part of a code which uses a THREADPRIVATE ALLOCATABLE array of a derived type which, in turns, contains other ALLOCATABLE variables:
MODULE MYMOD
TYPE OBJ
REAL, DIMENSION(:), ALLOCATABLE :: foo1
REAL, DIMENSION(:), ALLOCATABLE :: foo2
END TYPE
TYPE(OBJ), DIMENSION(:), ALLOCATABLE :: priv
TYPE(OBJ), DIMENSION(:), ALLOCATABLE :: shared
!$OMP THREADPRIVATE(priv)
END MODULE
The variable "priv" is used by each thread as buffer for heavy calculations and is then copied on a shared variable.
MODULE MOD2
SUBROUTINE DOSTUFF()
!$OMP PARALLEL PRIVATE(n,dim)
CALL ALLOCATESTUFF(n,dim)
CALL HEAVYSTUFF()
CALL COPYSUFFONSHARED()
!$OMP END PARALLEL
END SUBROUTINE DOSTUFF
SUBROUTINE ALLOCATESTUFF(n,dim)
USE MYMOD, ONLY : priv
ALLOCATE(priv(n))
DO i=1,i
ALLOCATE(priv(i)%foo1(dim))
ALLOCATE(priv(i)%foo2(dim))
ENDDO
END SUBROUTINE ALLOCATESTUFF
SUBROUTINE COPYSTUFFONSHARED()
USE MYMOD
...
END SUBROUTINE COPYSTUFFONSHARED
SUBROUTINE HEAVYSTUFF()
USE MYMOD, ONLY : priv
...
END SUBROUTINE HEAVYSTUFF
END MODULE
I'm running this code on a machine with two CPUs, each one with 10 cores, and I'm experiencing a strong loss of performance when passing the limit of 10 threads: basically, the codes scales linearly up to 10 threads, and then the slope is strongly reduced after this barrier. I obtain a very similar behavior on a machine with 8 CPUs, each one with 4 cores but this time the loss is around 5/6 threads.
As order of magnitude "n" of priv is small (less than 10), whereas "dim" for each "foo" is of the order of some milions.
What I guess from this behavior is that there's a sort of bottleneck in accessing the memory because of the connection between the CPUs. The strange behavior is that if I mesure separately the time required for doing HEAVYSTUFF and COPYSTUFFONSHARED, it is HEAVYSTUFF that slowes down, whereas COPYSTUFFONSHARED has an "almost linear" speed-up.
The question is: am I assured that the memory in a THREADPRIVATE derived type will be actually allocated locally on the CPU to which the thread belongs? If so, what else can be the explanation of this behavior? Otherwise, how can I force data locality?
Thank you