I've been playing around with OpenMP, and am trying to see if I can get a speedup in a particular bit of C++ code.
#pragma omp parallel for
for (Index j=alignedSize; j<size; ++j)
{
res[j] = cj.pmadd(lhs0(j), pfirst(ptmp0), res[j]);
res[j] = cj.pmadd(lhs1(j), pfirst(ptmp1), res[j]);
res[j] = cj.pmadd(lhs2(j), pfirst(ptmp2), res[j]);
res[j] = cj.pmadd(lhs3(j), pfirst(ptmp3), res[j]);
}
I'm a complete newbie with OpenMP so be gentle with me, but could someone shed some light on why this code ends up doubling the execution time rather than speeding it up?
I'm running with 4 cores, just in case that matters.
What is the size of a
res
entry? If its less than the size of a cache line then its likely false sharing.