I have this loop this function:
Mat HessianDetector::hessianResponse(const Mat &inputImage, float norm)
{
//...
const float *in = inputImage.ptr<float>(1);
Mat outputImage(rows, cols, CV_32FC1);
float *out = outputImage.ptr<float>(1) + 1;
//...
for (int r = 1; r < rows - 1; ++r)
{
float v11, v12, v21, v22, v31, v32;
v11 = in[-stride]; v12 = in[1 - stride];
v21 = in[ 0]; v22 = in[1 ];
v31 = in[+stride]; v32 = in[1 + stride];
in += 2;
for (int c = 1; c < cols - 1; ++c, in++, out++)
{
/* fetch remaining values (last column) */
const float v13 = in[-stride];
const float v23 = *in;
const float v33 = in[+stride];
// compute 3x3 Hessian values from symmetric differences.
float Lxx = (v21 - 2*v22 + v23);
float Lyy = (v12 - 2*v22 + v32);
float Lxy = (v13 - v11 + v31 - v33)/4.0f;
/* normalize and write out */
*out = (Lxx * Lyy - Lxy * Lxy)*norm2;
/* move window */
v11=v12; v12=v13;
v21=v22; v22=v23;
v31=v32; v32=v33;
/* move input/output pointers */
}
out += 2;
}
return outputImage;
}
Which is called with:
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
In particular, Intel Advisor says that the inner loop is time consuming and should be vectorized:
for (int c = 1; c < cols - 1; ++c, in++, out++)
However, it says also that there is a read after write dependency at these two lines:
Read:
float Lyy = (v12 - 2*v22 + v32);
Write:
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
But I don't really understand why this happens (even if I know the meaning of RAW dependency).
This is the optimization report:
LOOP BEGIN at /home/luca/Dropbox/HKUST/CloudCache/cloudcache/CloudCache/Descriptors/hesaff/pyramid.cpp(92,7)
remark #17104: loop was not parallelized: existence of parallel dependence
remark #17106: parallel dependence: assumed ANTI dependence between *(in+cols*4) (95:28) and *out (105:11)
remark #17106: parallel dependence: assumed FLOW dependence between *out (105:11) and *(in+cols*4) (95:28)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed ANTI dependence between *(in+cols*4) (95:28) and *out (105:11)
remark #15346: vector dependence: assumed FLOW dependence between *out (105:11) and *(in+cols*4) (95:28)
LOOP END
Line 95 is:
const float v13 = in[-stride];
Line 105 is:
*out = (Lxx * Lyy - Lxy * Lxy)*norm2;
What the optimization report is telling you is that you have some values in one iteration of your loop that depend on values from the previous iteration. In particular, the "move window" block copies values between locals so that the value of
v11
,v12
, etc in the next iteration depend on the values ofv12
,v23
, etc in this iteration. This prevents the compiler from vectorizing the loop.The solution there is to initialize all 9 of the
v
variables within the body of thec
loop.I don't know if fixing this will clear up the original RAW issue.
One other tweak is to move
scaleCyclesLevel
out of thej
loop (so that it is thei
loop instead) since its value doesn't depend onj
.