I have memory organized like this:
block1(m64), block2(m64), block3(m64), block4(m64), ....
Now I do in a for loop this operation:
iteration 1.....
x = block1 XOR block2
y = block1 AND block2
block1 = x
block2 = y
iteration 2.....
x = block3 XOR block4
y = block3 AND block4
block3 = x
block4 = y
And so on...
I tried now to combine the m64 blocks to m128i blocks:
block1_block3(m128), block2_block4(m128),....
Now I am able to use 128 bit SIMD instructions and the for loop will be only 50% of m64 instructions.
But the bad thing is that I can't cast the memory direct to an m128i/m256i because the m64 values aren't in one line. So I will need to assemble and dissasemble the values like this:
// combine two 128 bit to one 256 bit nummber
__m256i static inline iCombine_128_256(__m128i *a, __m128i *b)
{
__m256i ret = _mm256_castsi128_si256(*a);
return _mm256_inserti128_si256(ret, *b, 1);
}
// combine four 64 bit to one 256 bit nummber
__m256i static inline iCombine_64_256(__m64 *a, __m64 *b, __m64 *c, __m64 *d)
{
__m256i ret = _mm256_castsi128_si256(_mm_set_epi64(*b, *a));
return _mm256_inserti128_si256(ret, _mm_set_epi64(*d, *c), 1);
}
// combine eight 32 bit to one 256 bit nummber
__m256i static inline iCombine_32_256(unsigned int *a, unsigned int *b, unsigned int *c, unsigned int *d, unsigned int *e, unsigned int *f, unsigned int *g, unsigned int *h)
{
__m256i ret = _mm256_castsi128_si256(_mm_set_epi32(*d, *c, *b, *a));
return _mm256_inserti128_si256(ret, _mm_set_epi32(*h, *g, *f, *e), 1);
}
So this will take some extra instructions to assemble these blocks. Isn't there a way to "cheat" an m256i? Let's say I tell x.m256i_u64[0] the pointer of the first block1, x.m256i_u64[1] the second pointer off block2,... And in sum it shows me the assembled m256i value of these 4 m64 values? Is this somehow possible?
The
_mm_set_epi64()
intrinsics aren't magic. They compile to loads or shuffles. Giving the compiler multiple pointers to sort out is usually the wrong approach when manually vectorizing: figure out what SSE/AVX shuffles you can use after doing vector loads.For 128b SSE2, (or AVX with
-mprefer-avx128
) gcc does a reasonable job auto-vectorizing a simple scalar implementation if it knows that the pointer is at least 16B-aligned. (So a pair of blocks that need to be processed together will be in the same aligned 16B block). I don't see a better way, and it may be slightly faster than scalar 64-bit. Clang strangely doesn't auto-vectorize unless it has AVX512 (forvpermt2q
).(With AVX2, gcc shuffles way too much. reported as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137. See my manually-vectorized version below which should be more than 2x faster than scalar or SSE2 on Haswell.)
See all the source on the Godbolt compiler explorer, to see how it gets vectorized.
Are your pointers aligned to at least 128b in your use-case? You should probably try to make that happen, so a "pair" isn't split across a cache-line boundary. The SSE2 version can use aligned loads/stores, or memory operands to SSE instructions instead of separate loads.
There are many different ways to auto-vectorize anything. You could even consider doing unaligned overlapping loads to get a 2nd vector with
blocks[0]
andblocks[2]
lined up in the low 64b of every 128b lane. (Load throughput is generally very good for L1 cache hits on modern CPUs. It's worth considering using unaligned loads to reduce shuffling, but I don't think it's the best option in this case for AVX2).First let's look at scalar (or in 32-bit code, using SSE2 to do 64-bit scalar integer math.
gcc -m32
does exactly that with unaligned pointers and no AVX or-mprefer-avx128
):per 128b pair: 7 fused-domain uops (all the instructions are single-uop). 2xload, 2xstore, 3x ALU (or less if the mov doesn't need a port). The front-end can issue 7 uops in 1.75c (or less on Ryzen). Store throughput bottlenecks at 1 per clock on all current CPUs, so with enough loop unrolling you can do about 1 pair per 2 clocks with scalar x86-64, MMX, or scalar SSE2 even on old CPUs like Core2 or Bulldozer.
SSE2
This is how gcc auto-vectorizes, processing 2 pairs per loop iteration. It looks nice with AVX-128, but with SSE2 it needs 3 extra movdqa instructions to copy registers before destroying them as a combined src/destination. (See next section for a manually-vectorized version that should be better).
13 fused-domain uops. (3.25c front-end cycles on CPUs other than Ryzen). 4x shuffle, 2xload, 2x store, 2x boolean. 3x reg-reg copy, which either uses an ALU execution port or it doesn't, depending on the CPU. But it doesn't matter here, 5 ALU uops in 3.25 cycles is not a bottleneck.
gcc -m32
makes the interesting choice to use punpckh/l with the same memory operand twice, instead of a separatemovdqa
load for the 2nd vector. This saves a fused-domain uop (becausepunpck
can micro-fuse), but keeps the load port busier. Still, not a bottleneck.Intel Haswell and later bottleneck on 1 shuffle per clock, so they have 4c throughput, or 2c per pair, same as scalar (but it's probably easier to come close to that limit, and might hit it even without loop unrolling.)
AMD CPUs, and Intel Core2 to IvyBridge, can do 2x 128b shuffles per clock, so they just bottleneck on the front-end at 3.25c + loop overhead, not on any particular port. With a bit of loop overhead, that's maybe 1.75c per pair. (Or Ryzen can do about 5 uops per clock running single-uop instructions, so two pairs per ~2.6 cycles, or 1 pair per ~1.3 cycles + overhead).
With AVX-128, and micro-fused loads, it's 9 fused-domain uops (2.25c + loop overhead to issue). Still 4x shuffles, and requires AVX1, but this is excellent for Sandybridge and AMD. About 1.125c + loop overhead per pair on SnB.
SSE2/SSE3 manual vectorization
The biggest problem with the SSE2 version above is all the extra movdqa instructions to copy registers before destroying them.
We can take advantage of the nature of AND and XOR to save some asm instructions.
x&x = x
, andx ^ 0 = x
.This version might be good on Haswell, using 3 loads of the same data. But on other CPUs (including AMD), that many loads + stores will be the bottleneck.
Or this version is a good balance between load and shuffle. It's actually really good on pre-AVX2 hardware.
So the front-end is the bottleneck for 2-shuffle version, even on Nehalem which can't do 2 loads per clock. On CPUs without AVX2, this may be measurably better than scalar:
On the godbolt link, look at the clang tab for non-AVX asm output. gcc uses an extra movdqa for no reason, but clang succeeds at not wasting instructions. With loop unrolling, it should approach 1 vector per 1.5 clocks (if data is hot in cache), on Intel pre-Haswell or some AMD CPUs. On Ryzen, maybe even better than that.
AVX2
This is where compilers do a terrible job, see the gcc bug report I filed, linked earlier.
Manually vectorizing for one 256b vector at a time, with a data-movement pattern like this should be good:
Here's a C/C++ intrinsics version:
That's 6 fused-domain uops on Intel, and should easily run at 1 iter per 1.5 cycles (+ loop overhead), without bottlenecking on any ports. The bottleneck is the front-end, so unrolling helps.
That's 0.75 cycles per 128b pair on Haswell, plus loop overhead.
Immediate-blend can run on any port on HSW+, or p0/p5 on SnB (and good throughput on BD/Ryzen) so it's much more throughput-friendly than using vunpcklqdq to combine the AND / XOR result vectors.
Other abandoned ideas that didn't look promising
nope, easier to get that with
Basically no advantage over scalar.
Could maybe combine two vectors together and use
movhps
to store the high half? It needs a shuffle-port uop, though, so not much to gain over punpckhqdq or movhlps to combine two registers for a 128b store.