Logical shift between YMM registers

182 Views Asked by At

Is it possible for me to load let's say a 2048 bit number into 8 AVX ymm registers, and shift bits left and right between all of these?

I only need to shift 1 bit at a time.

I've tried finding accurate info on AVX but the interaction between xmm/ymm/zmm and the carry bit seems unclear a lot of the time.

2

There are 2 best solutions below

2
On

I've tried finding accurate info on AVX but the interaction between xmm/ymm/zmm and the carry bit seems unclear a lot of the time.

That's the simple part: there is no interaction. SSE/AVX arithmetic does not involve the flags. There are some specific instructions that compare/test vectors (ptest) or scalars in vectors (comiss etc) and then set flags, but they're not that useful here.

One approach is start at the top of your number instead of the bottom, load two slightly-offset (mostly overlapping, so that one of the vectors is offset by one element compared to the other) vectors, and use one of the "concatenate and shift" instructions (eg vpshld) to do a left-shift that shifts in bits from the previous element (in general it's not from the previous element, it's from another vector, but this is why we loaded a second vector at a one-element offset) instead of zeroes. In AVX2 you can emulate this with left-shift, right-shift, and vpor.

0
On

It’s possible, but not straightforward.

Here’s AVX2 implementation in C++ which does that in 5 instructions per register.

#include <immintrin.h>

// Shift AVX vector left by 1 bit
// The flag should contain either 0 or 1 in the lowest int32 lane, higher 96 bits are unused
inline __m256i shiftLeft1( const __m256i src, __m128i& carryFlag )
{
    // Shift 64 bit lanes right by 63 bits, i.e. isolate the high bit into low location
    __m256i right = _mm256_srli_epi64( src, 63 );
    // Cyclic permute across the complete vector
    right = _mm256_permute4x64_epi64( right, _MM_SHUFFLE( 2, 1, 0, 3 ) );

    // Deal with the carry flags
    const __m128i nextFlag = _mm256_castsi256_si128( right );
    right = _mm256_blend_epi32( right, _mm256_castsi128_si256( carryFlag ), 1 );
    carryFlag = nextFlag;

    // Shift 64 bit lanes left by 1 bit
    __m256i left = _mm256_slli_epi64( src, 1 );
    // Assemble the result
    return _mm256_or_si256( left, right );
}

// Shift AVX vector right by 1 bit
// The flag should contain either 0 or 0x80000000 in the highest int32 lane, lower 224 bits are unused
inline __m256i shiftRight1( const __m256i src, __m256i& carryFlag )
{
    // Shift 64 bit lanes left by 63 bits, i.e. isolate low bits into high location
    __m256i left = _mm256_slli_epi64( src, 63 );
    // Cyclic permute across the complete vector
    left = _mm256_permute4x64_epi64( left, _MM_SHUFFLE( 0, 3, 2, 1 ) );

    // Deal with the carry flags
    const __m256i nextFlag = left;
    left = _mm256_blend_epi32( left, carryFlag, 0b10000000 );
    carryFlag = nextFlag;

    // Shift 64 bit lanes right by 1 bit
    __m256i right = _mm256_srli_epi64( src, 1 );
    // Assemble the result
    return _mm256_or_si256( left, right );
}

Most of these 5 instructions are very fast with 1 cycle latency, except vpermq which takes 3-6 cycles on most processors. Luckily, that vpermq instruction ain’t dependent on the carry flag it only depends on the input vectors. Modern out of order processors should be able to do decent job running that code.

Usage examples for 1024 bit numbers in 4 vectors:

// 1024 bits of data in 4 AVX registers
struct Blob1k
{
    __m256i v0, v1, v2, v3;
};

void shiftLeft1( Blob1k& blob )
{
    __m128i cf = _mm_setzero_si128();
    blob.v0 = shiftLeft1( blob.v0, cf );
    blob.v1 = shiftLeft1( blob.v1, cf );
    blob.v2 = shiftLeft1( blob.v2, cf );
    blob.v3 = shiftLeft1( blob.v3, cf );
}

void shiftRight1( Blob1k& blob )
{
    __m256i cf = _mm256_setzero_si256();
    blob.v3 = shiftRight1( blob.v3, cf );
    blob.v2 = shiftRight1( blob.v2, cf );
    blob.v1 = shiftRight1( blob.v1, cf );
    blob.v0 = shiftRight1( blob.v0, cf );
}