mm_shuffle_epi8 equivalent on ARM machines

459 Views Asked by At

In a project which is focussed on accelerating the performance on ARM, I am using the mm_shuffle_epi8 implementation from the below page https://github.com/f4exb/cm256cc/blob/master/sse2neon.h#L981.

But above implementation is sub optimal and leading to performance costs.

Is there a right equivalent for _mm_shuffle_epi8 for ARM ?

2

There are 2 best solutions below

1
Jake 'Alquimista' LEE On

vtbl2 (and possibly vtbx2) is exactly what you are looking for.

But beware, these instructions come with a long latency, especially on Cortex-a57 and Cortex-a72. (aarch64 mode) It doesn't even pipeline on the A-57.

I myself try to avoid them at all costs: too pricey.

NEON has superior permutation instructions over AVX. Maybe you can find a workaround.

PS: SSE2NEON.... not a good idea at all IMO. And the way the link you gave is doing is just horrible.

8
Aki Suihkonen On

The equivalent should be something like

uint8x16_t shuffle_epi8(uint8x16_t table, uint8x16_t index) {
   int8x16_t mask = vshrq_n_s8(vreinterpretq_s8_u8(index), 7);
   index = vandq_u8(index, vdupq_n_u8(15));
   index = vqtbl1q_u8(table, index);
   return vbicq_u8(index, vreinterpretq_u8_s8(mask));
}

On armv7 one needs to emulate the 16-bit wide table by

inline uint8x16_t vqtbl1q_u8(uint8x16_t table, uint8x16_t idx) {
    uint8x8x2_t table2{vget_low_u8(table), vget_high_u8(table)};
    uint8x8_t lo = vtbl2_u8(table2, vget_low_u8(idx));
    uint8x8_t hi = vtbl2_u8(table2, vget_high_u8(idx));
    return vcombine_u8(lo, hi);
}