I've implemented the code bellow that generate vectors of random number using the MKL VSL library:
! ifort -mkl test1.f90 -cpp -openmp
include "mkl_vsl.f90"
#define ITERATION 1000000
#define LENGH 10000
program test
use mkl_vsl_type
use mkl_vsl
use mkl_service
use omp_lib
implicit none
integer i,brng, method, seed, dm,n,errcode
real(kind=8) r(LENGH) , s
real(kind=8) a, b, start,endd
TYPE (VSL_STREAM_STATE) :: stream
integer(4) :: nt
! *****
brng = VSL_BRNG_SOBOL
method = VSL_RNG_METHOD_UNIFORM_STD
seed = 777
a = 0.0
b = 1.0
s = 0.0
!call omp_set_num_threads(4)
call omp_set_dynamic(0)
nt = omp_get_max_threads()
! *****
print *,'max OMP threads number',nt
if (1 == omp_get_dynamic()) then
print '(" Intel OMP may use less than "I0" threads for a large problem")', nt
else
print '(" Intel OMP should use "I0" threads for a large problem")', nt
end if
if (1 == omp_get_max_threads()) print *, "Intel MKL does not employ threading"
!call mkl_set_num_threads(4)
call mkl_set_dynamic(0)
nt = mkl_get_max_threads()
print *,'max MKL threads number',nt
if (1 == mkl_get_dynamic()) then
print '(" Intel MKL may use less than "I0" threads for a large problem")', nt
else
print '(" Intel MKL should use "I0" threads for a large problem")', nt
end if
if (1 == mkl_get_max_threads()) print *, "Intel MKL does not employ threading"
! ***** Initialize *****
errcode=vslnewstream( stream, brng, seed )
! ***** Call RNG *****
start=omp_get_wtime()
do i=1,ITERATION
errcode=vdrnguniform( method, stream, LENGH, r, a, b )
s = s + sum(r)/LENGH
end do
endd=omp_get_wtime()
! ***** DEleting the stream *****
errcode=vsldeletestream(stream)
! *****
print *, s/ITERATION, endd-start
end program test
I don't see any speedup when using 4 and 32 threads for instance.
I use the Intel compiler version 13.1.3 and compile doing
ifort -mkl test1.f90 -cpp -openmp
It's like the random numbers are not generated in parallel.
Any hints here?
Thank you,
Éric.
Your code doesn't contain any OpenMP directives to actually parallelise the work, when it executes it runs only 1 thread. It is not sufficient to
use omp_lib
and to scatter a few calls to functions such asomp_get_wtime
around, you actually have to insert some worksharing directives.If I run your code, as is, my performance monitor shows that only one thread is active, and your code reports
If I simply wrap the loop in an OpenMP worksharing directive, like this
then the performance monitor on my dual-quad-core-with-hyperthreading-PC shows that 16 threads are active and your program reports
I guess the hint I would offer is: study your favourite OpenMP tutorial, in particular the sections covering the parallel and do directives. I offer no warranty that the simple modification I have made does not break your program; in particular I don't guarantee that I haven't introduced a race condition.
I leave you the exercise of determining whether the speed-up on going from 1 to 16 (hyper-)threads is acceptable and any analysis of why it appears to be so modest.