What is this Read after Write dependency?

693 Views Asked by At

I have this loop this function:

Mat HessianDetector::hessianResponse(const Mat &inputImage, float norm)
{
   //...
   const float *in = inputImage.ptr<float>(1);
   Mat outputImage(rows, cols, CV_32FC1);
   float      *out = outputImage.ptr<float>(1) + 1;
   //...
   for (int r = 1; r < rows - 1; ++r)
   {
      float v11, v12, v21, v22, v31, v32;      
      v11 = in[-stride]; v12 = in[1 - stride];
      v21 = in[      0]; v22 = in[1         ];
      v31 = in[+stride]; v32 = in[1 + stride];
      in += 2;
      for (int c = 1; c < cols - 1; ++c, in++, out++)
      {
         /* fetch remaining values (last column) */
         const float v13 = in[-stride];
         const float v23 = *in;
         const float v33 = in[+stride];

         // compute 3x3 Hessian values from symmetric differences.
         float Lxx = (v21 - 2*v22 + v23);
         float Lyy = (v12 - 2*v22 + v32);
         float Lxy = (v13 - v11 + v31 - v33)/4.0f;

         /* normalize and write out */
         *out = (Lxx * Lyy - Lxy * Lxy)*norm2;

         /* move window */
         v11=v12; v12=v13;
         v21=v22; v22=v23;
         v31=v32; v32=v33;

         /* move input/output pointers */
      }
      out += 2;
   }
   return outputImage;
}

Which is called with:

#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
    for (int j = 1; j <= scaleCycles; j++)
    {
        int scaleCyclesLevel = scaleCycles * i;
        float curSigma = par.sigmas[j];
        hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
    }

In particular, Intel Advisor says that the inner loop is time consuming and should be vectorized:

for (int c = 1; c < cols - 1; ++c, in++, out++)

However, it says also that there is a read after write dependency at these two lines:

Read:

float Lyy = (v12 - 2*v22 + v32);

Write:

hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);

But I don't really understand why this happens (even if I know the meaning of RAW dependency).

This is the optimization report:

   LOOP BEGIN at /home/luca/Dropbox/HKUST/CloudCache/cloudcache/CloudCache/Descriptors/hesaff/pyramid.cpp(92,7)
      remark #17104: loop was not parallelized: existence of parallel dependence
      remark #17106: parallel dependence: assumed ANTI dependence between *(in+cols*4) (95:28) and *out (105:11)
      remark #17106: parallel dependence: assumed FLOW dependence between *out (105:11) and *(in+cols*4) (95:28)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed ANTI dependence between *(in+cols*4) (95:28) and *out (105:11)
      remark #15346: vector dependence: assumed FLOW dependence between *out (105:11) and *(in+cols*4) (95:28)
   LOOP END

Line 95 is:

     const float v13 = in[-stride];

Line 105 is:

     *out = (Lxx * Lyy - Lxy * Lxy)*norm2;
2

There are 2 best solutions below

4
On

What the optimization report is telling you is that you have some values in one iteration of your loop that depend on values from the previous iteration. In particular, the "move window" block copies values between locals so that the value of v11, v12, etc in the next iteration depend on the values of v12, v23, etc in this iteration. This prevents the compiler from vectorizing the loop.

The solution there is to initialize all 9 of the v variables within the body of the c loop.

I don't know if fixing this will clear up the original RAW issue.

One other tweak is to move scaleCyclesLevel out of the j loop (so that it is the i loop instead) since its value doesn't depend on j.

2
On

I don't know how inputImage and outputImage are passed to the function. If you don't pass them as restricted, the compiler doesn't know, whether the data overlaps, so it would be unsafe to write to *out, because it may overwrite *in of the next iteration.

Have a look at how to tell your compiler that image data doesn't overlap. For gcc it's restrict