I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:
double doModelFit(int model, ...) {
...
while( !done ) {
cblas_dgemm(...);
cblas_dgemm(...);
...
dgesv(...);
...
}
return result;
}
int main(int argc, char **argv) {
...
c_start = 1; c_stop = nmodel;
for(int c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):
int main(int argc, char **argv) {
...
int numthreads=omp_max_num_threads();
int c;
#pragma omp parallel for private(c)
for(int t=0; t<numthreads; t++) {
// assuming nmodel divisible by numthreads...
c_start = t*nmodel/numthreads+1;
c_end = (t+1)*nmodel/numthreads;
for(c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
}
When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:
- Version 1 takes the same roughly 30 seconds, and the hotspot analysis shows that most time is spent in __kmp_wait_sleep and __kmp_static_yield. Out of 7710s CPU time, 5804s are spent in Spin Time.
- Version 2 takes fooooorrrreevvvver... I kill it after running a couple minutes in VTune. The hotspot analysis shows that of 25254s of CPU time, 21585s are spent in [vmlinux].
Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.
Thanks, Andrew
The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of
cblas_dgemm()
anddgesv
. E.g. see this example.This version is supported and explained by Jim Dempsey at the Intel forum.