I would like to DSP-optimize a simple multiply-accumulate for-loop for the QC Hexagon. From the manual, it's not perfectly clear to me how to do that, both for the vector version and the non-vector version.
Assume my loop has a length which is a multiple of 4 (e.g., 64), i.e., I want to unroll the loop with a factor of 4. How would I do that? I can use either C-intrinsics or asm-code, but I don't understand how to do the 4x-memory load in first place.
Here is how my loop could look like in C:
Word32 sum = 0;
Word16 *pointer1; Word16 *pointer2;
for (i=0; i<64; i++)
{
sum += pointer1[I]*pointer2[i];
}
Any suggestions?
Here is a FIR filter implementation that demonstrates how to use
Q6_P_vrmpyhacc_PP
, the multiply halfword/accumulate. This instruction is described as 'big mac' in the PRMThis instruction is in the scalar core so it does not require the HVX vector coprocessor.