Unpack m128i/m256i to m64 (MMX, SSE2, AVX2)

1.5k Views Asked by At

I have memory organized like this:

block1(m64), block2(m64), block3(m64), block4(m64), ....

Now I do in a for loop this operation:

iteration 1.....

    x = block1 XOR block2
    y = block1 AND block2
    block1 = x
    block2 = y

iteration 2.....

    x = block3 XOR block4
    y = block3 AND block4
    block3 = x
    block4 = y

And so on...

I tried now to combine the m64 blocks to m128i blocks:

block1_block3(m128), block2_block4(m128),....

Now I am able to use 128 bit SIMD instructions and the for loop will be only 50% of m64 instructions.

But the bad thing is that I can't cast the memory direct to an m128i/m256i because the m64 values aren't in one line. So I will need to assemble and dissasemble the values like this:

// combine two 128 bit to one 256 bit nummber
__m256i static inline iCombine_128_256(__m128i *a, __m128i *b)
{
  __m256i ret = _mm256_castsi128_si256(*a);
  return _mm256_inserti128_si256(ret, *b, 1);
}

// combine four 64 bit to one 256 bit nummber
__m256i static inline iCombine_64_256(__m64 *a, __m64 *b, __m64 *c, __m64 *d)
{
  __m256i ret = _mm256_castsi128_si256(_mm_set_epi64(*b, *a));
  return _mm256_inserti128_si256(ret, _mm_set_epi64(*d, *c), 1);
}

// combine eight 32 bit to one 256 bit nummber
__m256i static inline iCombine_32_256(unsigned int *a, unsigned int *b, unsigned int *c, unsigned int *d, unsigned int *e, unsigned int *f, unsigned int *g, unsigned int *h)
{
  __m256i ret = _mm256_castsi128_si256(_mm_set_epi32(*d, *c, *b, *a));
  return _mm256_inserti128_si256(ret, _mm_set_epi32(*h, *g, *f, *e), 1);
}

So this will take some extra instructions to assemble these blocks. Isn't there a way to "cheat" an m256i? Let's say I tell x.m256i_u64[0] the pointer of the first block1, x.m256i_u64[1] the second pointer off block2,... And in sum it shows me the assembled m256i value of these 4 m64 values? Is this somehow possible?

1

There are 1 best solutions below

4
On

The _mm_set_epi64() intrinsics aren't magic. They compile to loads or shuffles. Giving the compiler multiple pointers to sort out is usually the wrong approach when manually vectorizing: figure out what SSE/AVX shuffles you can use after doing vector loads.

For 128b SSE2, (or AVX with -mprefer-avx128) gcc does a reasonable job auto-vectorizing a simple scalar implementation if it knows that the pointer is at least 16B-aligned. (So a pair of blocks that need to be processed together will be in the same aligned 16B block). I don't see a better way, and it may be slightly faster than scalar 64-bit. Clang strangely doesn't auto-vectorize unless it has AVX512 (for vpermt2q).

(With AVX2, gcc shuffles way too much. reported as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137. See my manually-vectorized version below which should be more than 2x faster than scalar or SSE2 on Haswell.)

See all the source on the Godbolt compiler explorer, to see how it gets vectorized.

// scalar version for compilers to autovectorize
#include <stdint.h>

void foo(uint64_t blocks[]) {
    // tell gcc the pointer is 64-byte aligned, to get a simpler auto-vectorization strategy.
    blocks = __builtin_assume_aligned(blocks, 64);
    for (intptr_t i = 0 ; i<10240 ; i+=2) {
        uint64_t x = blocks[i];
        uint64_t y = blocks[i+1];
        blocks[i] = x^y;
        blocks[i+1] = x&y;
    }
}

Are your pointers aligned to at least 128b in your use-case? You should probably try to make that happen, so a "pair" isn't split across a cache-line boundary. The SSE2 version can use aligned loads/stores, or memory operands to SSE instructions instead of separate loads.

There are many different ways to auto-vectorize anything. You could even consider doing unaligned overlapping loads to get a 2nd vector with blocks[0] and blocks[2] lined up in the low 64b of every 128b lane. (Load throughput is generally very good for L1 cache hits on modern CPUs. It's worth considering using unaligned loads to reduce shuffling, but I don't think it's the best option in this case for AVX2).

First let's look at scalar (or in 32-bit code, using SSE2 to do 64-bit scalar integer math. gcc -m32 does exactly that with unaligned pointers and no AVX or -mprefer-avx128):

mov   # load
mov   # load
mov   # copy a register
and
xor
mov   # store
mov   # store

per 128b pair: 7 fused-domain uops (all the instructions are single-uop). 2xload, 2xstore, 3x ALU (or less if the mov doesn't need a port). The front-end can issue 7 uops in 1.75c (or less on Ryzen). Store throughput bottlenecks at 1 per clock on all current CPUs, so with enough loop unrolling you can do about 1 pair per 2 clocks with scalar x86-64, MMX, or scalar SSE2 even on old CPUs like Core2 or Bulldozer.


SSE2

This is how gcc auto-vectorizes, processing 2 pairs per loop iteration. It looks nice with AVX-128, but with SSE2 it needs 3 extra movdqa instructions to copy registers before destroying them as a combined src/destination. (See next section for a manually-vectorized version that should be better).

b0     b1      # load128
b2     b3      # load128

               # movdqa copy a reg
b0     b2      # punpcklqdq
b1     b3      # punpckhqdq

               # movdqa copy a reg
b0&b1  b2&b3   # pand
b0^b1  b2^b3   # pxor

               # movdqa copy a reg
b0^b1  b0&b1   # punpcklqdq
               # store 128
b2^b3  b2&b3   # punpckhqdq
               # store 128

13 fused-domain uops. (3.25c front-end cycles on CPUs other than Ryzen). 4x shuffle, 2xload, 2x store, 2x boolean. 3x reg-reg copy, which either uses an ALU execution port or it doesn't, depending on the CPU. But it doesn't matter here, 5 ALU uops in 3.25 cycles is not a bottleneck.

gcc -m32 makes the interesting choice to use punpckh/l with the same memory operand twice, instead of a separate movdqa load for the 2nd vector. This saves a fused-domain uop (because punpck can micro-fuse), but keeps the load port busier. Still, not a bottleneck.

Intel Haswell and later bottleneck on 1 shuffle per clock, so they have 4c throughput, or 2c per pair, same as scalar (but it's probably easier to come close to that limit, and might hit it even without loop unrolling.)

AMD CPUs, and Intel Core2 to IvyBridge, can do 2x 128b shuffles per clock, so they just bottleneck on the front-end at 3.25c + loop overhead, not on any particular port. With a bit of loop overhead, that's maybe 1.75c per pair. (Or Ryzen can do about 5 uops per clock running single-uop instructions, so two pairs per ~2.6 cycles, or 1 pair per ~1.3 cycles + overhead).

With AVX-128, and micro-fused loads, it's 9 fused-domain uops (2.25c + loop overhead to issue). Still 4x shuffles, and requires AVX1, but this is excellent for Sandybridge and AMD. About 1.125c + loop overhead per pair on SnB.


SSE2/SSE3 manual vectorization

The biggest problem with the SSE2 version above is all the extra movdqa instructions to copy registers before destroying them.

We can take advantage of the nature of AND and XOR to save some asm instructions. x&x = x, and x ^ 0 = x.

This version might be good on Haswell, using 3 loads of the same data. But on other CPUs (including AMD), that many loads + stores will be the bottleneck.

x     x      # movddup load  (SSE3)
x     x&y    # pand [mem]
y     0      # movq load
x^y   x&y    # pxor ([x x&y], [y 0])
           store
5 uops (1.25c front-end),  3 loads + 1 store (1.5c HSW, or 2c AMD/SnB, or 3c NHM)

Or this version is a good balance between load and shuffle. It's actually really good on pre-AVX2 hardware.

x     y      # load
x     x      # movddup or pshufd  to copy+shuffle
x     x&y    # pand
y     0      # movq load or PSRLDQ by 8 bytes
x^y   x&y    # pxor
           store
6 uops (1.5c front-end + loop overhead)
  movq-load version:  2 loads + 1 store (1c HSW, 1.5c AMD/SNB, 2c NHM)
  PSRLDQ version:  1 load + 1 store, 2 shuffles, 2 boolean: (2c HSW, 1.33c AMD and Intel NHM/SnB)

So the front-end is the bottleneck for 2-shuffle version, even on Nehalem which can't do 2 loads per clock. On CPUs without AVX2, this may be measurably better than scalar:

#include <immintrin.h>
void pair_u64_sse2(uint64_t blocks[]) {
    // take advantage of x&x = x
    // and  x&y ^ 0  = x&y
    for (int i = 0 ; i<10240 ; i+=2) {
        __m128i v = _mm_loadu_si128((__m128i*)&blocks[i]);
        __m128i dup = _mm_shuffle_epi32(v, _MM_SHUFFLE(1,0, 1,0));
        __m128i and = _mm_and_si128(v, dup);       // x    x&y
        __m128i y   = _mm_srli_si128(v, 8);        // y    0
        __m128i xor = _mm_xor_si128(and, y);       // x^y  x&y
        _mm_storeu_si128((__m128i*)&blocks[i], xor);

    }
}

On the godbolt link, look at the clang tab for non-AVX asm output. gcc uses an extra movdqa for no reason, but clang succeeds at not wasting instructions. With loop unrolling, it should approach 1 vector per 1.5 clocks (if data is hot in cache), on Intel pre-Haswell or some AMD CPUs. On Ryzen, maybe even better than that.


AVX2

This is where compilers do a terrible job, see the gcc bug report I filed, linked earlier.

Manually vectorizing for one 256b vector at a time, with a data-movement pattern like this should be good:

b0     b1       |    b2       b3       # load 256b
b1     b0       |    b3       b2       # vpshufd

b0^b1  b0^b1    |    b2^b3    b2^b3    # vpxor
b0&b1  b0&b1    |    b2^b3    b2&b3    # vpand

b0^b1  b0&b1    |    b2^b3    b2&b3    # vpblendd
                                       # store 256b

Here's a C/C++ intrinsics version:

#include <immintrin.h>

void pairs_u64_avx2(uint64_t blocks[]) {
    for (int i = 0 ; i<10240 ; i+=4) {
        __m256i v = _mm256_loadu_si256((__m256i*)&blocks[i]);
        __m256i swapped = _mm256_shuffle_epi32(v, _MM_SHUFFLE(1,0, 3,2));
        __m256i and = _mm256_and_si256(v, swapped);
        __m256i xor = _mm256_xor_si256(v, swapped);
        __m256i blend = _mm256_blend_epi32(xor, and, _MM_SHUFFLE(3,0,3,0));
        _mm256_storeu_si256((__m256i*)&blocks[i], blend);
    }
}

That's 6 fused-domain uops on Intel, and should easily run at 1 iter per 1.5 cycles (+ loop overhead), without bottlenecking on any ports. The bottleneck is the front-end, so unrolling helps.

That's 0.75 cycles per 128b pair on Haswell, plus loop overhead.

Immediate-blend can run on any port on HSW+, or p0/p5 on SnB (and good throughput on BD/Ryzen) so it's much more throughput-friendly than using vunpcklqdq to combine the AND / XOR result vectors.


Other abandoned ideas that didn't look promising

b0     b1                          load 128
b2     b3                          load 128
b0     b1       |    b3       b4   vinsertf128 y,y,m,1   (SKL: 2 uops, load + p015 ALU)
b2     b3       |    b5       b6   vinsertf128

nope, easier to get that with

b0     b1       |    b2       b3   v = load256 aligned
b4     b5       |    b6       b7   v2 = load256 aligned

b0     b1       |    b6       b7   vpblendd    //vinserti128 (v, v2)
b2     b3       |    b4       b5   vperm2i128  (v, v2)   (doesn't micro-fuse, unlike vpunpck, so not helpful to use with a memory operand)

 Then vpunpck l/h in-lane shuffles, then a AND/XOR,
 then 2x VPERMQ + 2x vpunpck?
 Or vpunpck and split 128b stores?  vmovdqa 128b + vextracti128

b0     b1      # load128
b1     b0      # pshufd   (copy+shuffle)

               # movdqa copy
b0&b1  b1&b0   # pand
movq           # store low half

b0^b1  b1^b0   # pxor
movq           # store low half

Basically no advantage over scalar.

Could maybe combine two vectors together and use movhps to store the high half? It needs a shuffle-port uop, though, so not much to gain over punpckhqdq or movhlps to combine two registers for a 128b store.