Difference between vld1.32 {d20-d21} and vld1q q10?

1.6k Views Asked by At

I was looking at some ARM disassemblies for a few ARM dev-boards we test on. They were produced with NEON intrinsic vld1q_u32 using -march=armv7-a -mfloat-abi=hard -mfpu=neon.

One one particular machine with NEON we see (/proc/cpuinfo half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm):

 0: b5f0        push    {r4, r5, r6, r7, lr}
...
20: f964 4a8f   vld1.32 {d20-d21}, [r4]

On another NEON machine we see (/proc/cpuinfo : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt):

 0:   e92d 4ff0       stmdb   sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
...
28:   f964 2a8f       vld1.32 {d18-d19}, [r4]

And on a ARMv8 machine we see (/proc/cpuinfo : fp asimd evtstrm aes pmull sha1 sha2 crc32):

 0:   3dc00021        ldr     q1, [x1]
...
10:   3dc00c22        ldr     q2, [x1,#48]
14:   3dc01023        ldr     q3, [x1,#64]

I understand the 2-D and 1-Q are simply different views of the same bank of registers. What I am not clear on is why ARMv7 NEON is performing the multiple register load instead of a 1Q load.

My question is, what is the difference between the vld1.32 {2-D} and vld1q.32 1-Q. Or why is the compiler not generating the 1-Q loads in all cases?

1

There are 1 best solutions below

0
On

The difference here lies between 32 bit ARM (aka AArch32) and AArch64.

The fact that 2 D registers are aliased over one Q register is true for 32 bit mode, but not in 64 bit mode. In AArch64, dX is the first half of qX, not of q(X/2) as in AArch32, and there's no d register name for addressing the upper half of the q register.

If you, in AArch32, assemble the instruction vld1.32 {q0}, [r0], it will turn into the same opcode f920 0a8f (in thumb mode) as you get if you assemble vld1.32 {d0-d1}, [r0]. So it's basically up to the disassembler to choose which form it prefers to use for display (although there may be guidelines for disassemblers, saying it should prefer to use the D register form).

On AArch64, the two forms are distinct since the registers aren't aliased in the same way, so if you ask for a 128 bit load to a Q register, that's what you get and there's no ambiguity about it.