I'm looking for the fastest way to divide an __m256i of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2.
As far as I know, my options are:
- Drop down to SSE2
- Something like AVX __m256i integer division for signed 32-bit elements
In case I need to go down to SSE2 I'd appreciate the best SSE2 implementation. In case it's 2), I'd like to know the intrinsics to use and also if there's a more optimized implementation for specifically dividing by 2. Thanks!
Assuming you know what you’re doing, here’s that function.
However, doing that is not necessarily faster than dealing with 16-byte vectors. On most CPUs, the performance of these insert/extract instructions ain’t great, except maybe AMD Zen 1 CPU.