Qualcomm Hexagon: Optimize MAC-loop

271 Views Asked by At

I would like to DSP-optimize a simple multiply-accumulate for-loop for the QC Hexagon. From the manual, it's not perfectly clear to me how to do that, both for the vector version and the non-vector version.

Assume my loop has a length which is a multiple of 4 (e.g., 64), i.e., I want to unroll the loop with a factor of 4. How would I do that? I can use either C-intrinsics or asm-code, but I don't understand how to do the 4x-memory load in first place.

Here is how my loop could look like in C:

Word32 sum = 0;
Word16 *pointer1; Word16 *pointer2;

for (i=0; i<64; i++)
{
    sum += pointer1[I]*pointer2[i];
}

Any suggestions?

1

There are 1 best solutions below

0
On

Here is a FIR filter implementation that demonstrates how to use Q6_P_vrmpyhacc_PP, the multiply halfword/accumulate. This instruction is described as 'big mac' in the PRM

This instruction is in the scalar core so it does not require the HVX vector coprocessor.

void FIR08(short_8B_align Input[],
           short_8B_align Coeff[],
           short_8B_align Output[],
           int unused, int ntaps,
           int nsamples)
{
  Word64 * vInput = (Word64*)Input;
  Word64 * vCoeff = (Word64*)Coeff;
  Word64 *__restrict vOutput = (Word64*)Output;
  int i, j;
  Word64 sum0, sum1, sum2, sum3;

  for (i = 0; i < nsamples/4; i++)
  {
      sum0 = sum1 = sum2 = sum3 = 0;
      for (j = 0; j < ntaps/4; j++)
      {
          Word64 vIn1 = vInput[i+j];
          Word64 vIn2 = vInput[i+j+1];
          Word64 curCoeff = vCoeff[j];
          Word64 curIn;

          curIn = vIn1;
          sum0 = Q6_P_vrmpyhacc_PP(sum0, curIn, curCoeff);

          curIn = Q6_P_valignb_PPI(vIn2, vIn1, 2);
          sum1 = Q6_P_vrmpyhacc_PP(sum1, curIn, curCoeff);

          curIn = Q6_P_valignb_PPI(vIn2, vIn1, 4);
          sum2 = Q6_P_vrmpyhacc_PP(sum2, curIn, curCoeff);

          curIn = Q6_P_valignb_PPI(vIn2, vIn1, 6);
          sum3 = Q6_P_vrmpyhacc_PP(sum3, curIn, curCoeff);
      }

      Word64 curOut = Q6_P_combine_RR(Q6_R_combine_RhRh(sum3, sum2), Q6_R_combine_RhRh(sum1, sum0));
      vOutput[i + 1] = Q6_P_vasrh_PI(curOut, 2);
  }
}