In a project which is focussed on accelerating the performance on ARM, I am using the mm_shuffle_epi8 implementation from the below page https://github.com/f4exb/cm256cc/blob/master/sse2neon.h#L981.
But above implementation is sub optimal and leading to performance costs.
Is there a right equivalent for _mm_shuffle_epi8 for ARM ?
vtbl2(and possiblyvtbx2) is exactly what you are looking for.But beware, these instructions come with a long latency, especially on Cortex-a57 and Cortex-a72. (
aarch64mode) It doesn't even pipeline on the A-57.I myself try to avoid them at all costs: too pricey.
NEON has superior permutation instructions over
AVX. Maybe you can find a workaround.PS: SSE2NEON.... not a good idea at all IMO. And the way the link you gave is doing is just horrible.