XOR all elements/lanes of NEON vector/register (pairwise?) in assembly on ARM Cortex A8

1.2k Views Asked by At

I'm not sure what the exact nomenclature is here, but here's the question:

I'm working on a checksum, and I want to take a number of different [32 bit] values, store them in the elements of a NEON vector(s), XOR them together, and then pass the results back to an ARM register for future computation. [The checksum has a number of different blocks based on a nonce, so I want to XOR these secondary results "into" the nonce, without losing entropy].

I'm not worried about performance (although less operations is always preferable, as is minimizing stalls of the ARM; the NEON can stall all it needs to), or the fact that this is not a particularly vectorizable operation; I need to use the NEON unit for this.

It would be ideal if there were some sort of horizontal XOR, wherein it would XOR the [4] elements of the vector with each other, and return the result, but that doesn't exist. I could obviously do something like (excuse the brutal pseudo-code):

load value1 s0
load value2 s2
veon d2, d0, d1
load value3 s0
load value4 s2
veon d0, d0,d1
veon d0, d0, d2

But is there a better way? I know there's pairwise addition, but seemingly no pairwise XOR. I'm flexible as far as using as many register lanes or registers as possible.

TL;DR: I need to do: res = val1 ^ val2 ^ val3 ^ val4 on the NEON, which is probably dumb, but I'm looking for the least-dumb way of doing it.

Thanks!

1

There are 1 best solutions below

2
On BEST ANSWER

The NEON way of doing it. Need to unroll the loop for better performance because it tries to use data which takes time to load.

vld1.u32 {q0},[r0]!        ; load 4 32-bit values into Q0
veor.u32 d0,d0,d1          ; XOR 2 pairs of values (0<-2, 1<-3)
vext.u8 d1,d0,d0,#4    ; shift down "high" value of d0
veor.u32 d0,d0,d1          ; now element 0 of d0 has all 4 values XOR'd together
vmov.u32 r2,d0[0]          ; transfer back to an ARM register
str r2,[r1]!           ; store in output

The ARM way of doing it. Loads the data a little slower, but doesn't have the delay of waiting for the transfer from NEON to ARM registers.

ldmia r0!,{r4-r7}      ; load 4 32-bit values
eor r4,r4,r5
eor r4,r4,r6
eor r4,r4,r7           ; XOR all 4 values together
str r4,[r1]!           ; store in output

If you can count on doing multiple groups of 4 32-bit values, then NEON can give you an advantage by loading up a bunch of registers, then processing them. If you're just calling a function which will work on 4 integers, then performance of the ARM version may be better.