I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.
In this sequence, eight results are obtained.
...
vmov r10, r11, d18
vld1.32 {d21}, [r10]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r10]
vld1.32 {d21}, [r11]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r11]
vmov r10, r11, d19
...
vmov r10, r11, d16
...
vmov r10, r11, d17
...
The scalar loop is composed of six instructions with one result per iteration:
loop:
ldr.w r1, [r2], #4
ldr.w r3, [r4, r1, lsl #2]
adds r3, #1
str.w r3, [r4, r1, lsl #2]
cmp r0, r2
bhi.n 118 <loop>
According to my naïve assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence there are six instructions. That would lead to a 2X decrease in the processing time, but instead I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.
I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:
- ARM and NEON registers are being moved frequently
- there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
- there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence
Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate for pointers that help to producing effective intrinsic code for NEON.
Thanks, Julio de Melo
Tools: arm-linux-gnueabihf-gcc-8.2.1
Thanks, both Nate and artless for the comments. Yes, Nate, you´re right, it's just a basic histogram loop. As you both mention, it won't help vectorize it, case closed. Thanks, again.