ARM Neon Intrinsics - Lanes in FMA

55 Views Asked by At

I'm new to ARM NEON intrinsics and was looking over the documentation for it. They provided a great set of examples including one for matrix multiplication, which uses their vector FMA instruction. I was however rather confused by the last parameter. Here's an excerpt from the code.

    C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
    C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
    C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
    C0 = vfmaq_laneq_f32(C0, A3, B0, 3);

The 0, 1, 2, 3 at the end is the part that is confusing me. From the documentation for it found here: https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vfmaq_laneq_f32 this refers to the lane. From the other documentation I've read, the lane refers to whether the packed variable is divided up into 64, 32 , 16, or 8 bit sized data types, which does not make sense in this context. I'm probably missing something, but to me, it seems like they're using the same word here but with a different meaning.

So what does lane mean in this context? What would happen if I reversed the order? What would happen if I set them all to 0?

Note: Here is the link to the matrix multiplication example https://developer.arm.com/documentation/102467/0201/Example---matrix-multiplication

1

There are 1 best solutions below

2
Peter Cordes On

Weird, the corresponding asm instruction (https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en linked from the FMLA hyperlink in the intrinsic doc you linked) doesn't mention an index into one of the input vectors.

But FMLA (by element) does: it broadcasts one element of the second multiplier vector instead of pure vertical.

Its Operation pseudocode is:

CheckFPAdvSIMDEnabled64();
bits(datasize) operand1 = V[n];
bits(idxdsize) operand2 = V[m];
bits(datasize) operand3 = V[d];
bits(esize) element1;
bits(esize) element2 = Elem[operand2, index, esize];
FPCRType fpcr = FPCR[];
boolean merge    = elements == 1 && IsMerging(fpcr);
bits(128) result = if merge then V[d] else Zeros();

for e = 0 to elements-1
    element1 = Elem[operand1, e, esize];
    if sub_op then element1 = FPNeg(element1);
    Elem[result, e, esize] = FPMulAdd(Elem[operand3, e, esize], element1, element2, fpcr);

V[d] = result;

Notice that element2 is indexed outside the loop and used for each of the 4 elements inside the loop. (Or 2 elements for double-precision 128-bit, or for single-precision 64-bit vectors, or 8 elements for 128-bit half-precision).

I think the intrinsic uses a different operand-order than asm: in asm, the accumulator is the first operand, so it can also be destination. But the intrinsic I think matches the usual ISO C fma(mul1, mul2, add) operand-order.

So this is using each element of B0 as a multiplier for different A vectors, instead of doing separate broadcast-loads for each element of that row. That's something a matmul would want to do.


So the intrinsic docs are confusing because they linked (and copied pseudocode from) the pure-vertical FMLA instruction, not the lane-broadcast version.

There's also vfmaq_n_f32 where the last arg is a scalar float32_t. It might be the same as vfmaq_laneq_f32(a,b,c, 0) but without having to create a C vector type from the scalar.

For pure vertical FMA, you want vfmaq_f32 which just takes 3 float32x4_t vectors, no immediate. (The q is to distinguish from the version that take 3 float32x2_t 64-bit vectors in D registers, instead of 128-bit Q registers, in 32-bit code which used register widths and different mnemonics instead of just arrangement specifiers with v names.)

I found it by searching the asm instruction mnemonic (FLMA) in the search bar in the intrinsics guide.