Using C# System..Numerics.Vector<T> to unpack / pack bits

1.1k Views Asked by At

I’m testing the capabilities of the .Net C# System.Numerics.Vector class for packing and unpacking bits.

I was hoping for Vector bitwise shift left/right functionality but that is not currently available so I tried to simulate shifting using arithmetic & logical methods as below. Here’s what I saw:

Packing (a simulated bitwise SHIFT LEFT and OR) using Vector.Multiply() and Vector.BitwiseOr() is slightly worse* than array/pointer code.

*<10% degradation in throughput (MB/sec).

But Unpacking (a simulated bitwise SHIFT RIGHT and AND) using Vector.Divide() and Vector.BitwiseAnd() is far worse** than array/pointer code.

**50% degradation in throughput

NB:

  • Vector was tested using unit (this was also raised in comments).

  • Test basis was the packing & unpacking of 100Mn up to 1Bn integers in blocks of 65536 integers. I randomly generated the int[] for each block.

  • I also tested bitwise (& | >> <<) as well as arithmetic (+ - * /) operations and saw no marked difference in cost. Even divide was not that bad with only a 10% degradation in throughout vs multiply (the question of division was raised in comments)

  • I changed my original test code (for the non-Vector comparison) to an unsafe/pointer routine to create more of a like-for-like test in terms of packing (many integers to a word) versus unpacking (a word to many integers). This brought the difference in throughout (between packing & unpacking) for the non-Vector code down to a variance of <5%. (which counters my comment about the compiler and optimization below)

  • Non-Optimized Vector: packing is 2x as fast as unpacking

  • Optimized Vector: yielded a 4x improvement (versus non-optimized Vector) in packing and a 2x improvement for unpacking

  • Non-Optimized array/pointer: unpacking is ~5% faster than packing

  • Optimized array/pointer: yielded a 3x improvement (versus non-optimized array pointer) for packing and a 2.5x improvement for unpacking. Overall, Optimized array/pointer packing was <5% faster than Optimized array/pointer unpacking.

  • Optimized array/pointer packing was ~10% faster than an Optimized Vector pack

Conclusion so far:

  • Vector.Divide() appears to be a comparatively slower implementation vs a normal arithmetic division

  • Furthermore, the Compiler does not appear to optimize Vector.Divide() code to anywhere near the same extent as Vector.Multiply() (which supports comments below regarding the optimising of division)

  • Array/pointer processing is at present slightly faster than the Vector class for packing data and significantly faster for unpacking

  • System.Numerics needs Vector.ShiftLeft() & Vector.ShiftRight() methods

Question (updated);

  • is my conclusion roughly on track? or are there other aspects to check/consider?

Further Information:

int numPages =  8192; // up to >15K     
int testSize = 65536;
StopWatch swPack = new StopWatch();
StopWatch swUnpack = new StopWatch();
long byteCount = 0;
for (int p = 0; p < numpages; b++)
{
    int[] data = GetRandomIntegers(testSize, 14600, 14800);

    swPack.Start();
    byte[] compressedBytes = pack(data);
    swPack.Stop();

    swUnpack.Start();
    int[] unpackedInts = unpack(compressedBytes);
    swUnpack.Stop();

    byteCount += (data.Length*4);

}
Console.WriteLine("Packing Throughput (MB/sec): " + byteCount / 1000 / swPack.ElapsedMilliseconds);
Console.WriteLine("Unpacking Throughput (MB/sec): " + byteCount / 1000 / swUnpacking.ElapsedMilliseconds);
2

There are 2 best solutions below

0
On

Vector.Divide has no hardware acceleration for integer types. It is very slow.

It wasn't until .NET 7.0 that Vector added ShiftRightArithmetic, ShiftRightLogical methods.

I developed the VectorTraits library. It allows lower versions of .NET programs (.NET Core 3.0+, .NET 5.0+) to use the hardware accelerated ShiftRightArithmetic, ShiftRightLogical methods. https://www.nuget.org/packages/VectorTraits

0
On

IL

/// non-SIMD fallback implementation for 128-bit right-shift (unsigned)
/// n: number of bit positions to right-shift a 16-byte memory image.
/// Vector(T) argument 'v' is passed by-ref and modified in-situ.
/// Layout order of the two 64-bit quads is little-endian.

.method public static void SHR(Vector_T<uint64>& v, int32 n) aggressiveinlining
{
    ldarg v
    dup
    dup
    ldc.i4.8
    add
    ldind.i8
    ldc.i4.s 64
    ldarg n
    sub
    shl

    ldarg v
    ldind.i8
    ldarg n
    shr.un

    or
    stind.i8

    ldc.i4.8
    add
    dup
    ldind.i8
    ldarg n
    shr.un
    stind.i8

    ret
}

pseudocode

As<Vector<ulong>,ulong>(ref v) = (As<Vector<ulong>,ulong>(in v) >> n) | 
                                  (ByteOffsAs<Vector<ulong>,ulong>(in v, 8) << (64 - n));
ByteOffsAs<Vector<ulong>,ulong>(ref v, 8) >>= n;

C# extern declaration

static class vector_ext
{
    [MethodImpl(MethodImplOptions.ForwardRef | MethodImplOptions.AggressiveInlining)]
    extern public static void SHR(ref Vector<ulong> v, int n);
};

You can link intermediate .netmodule binaries produced from IL (ildasm.exe) and C# (csc.exe) together into a single assembly by using the /LTCG (link-time code generation) option in link.exe.

runtime x64 JIT result (.NET Framework 4.7.2)

0x7FF878F5C7E0    48 89 4C 24 08       mov qword ptr [rsp+8],rcx
0x7FF878F5C7E5    8B C2                mov eax,edx
0x7FF878F5C7E7    F7 D8                neg eax
0x7FF878F5C7E9    8D 48 40             lea ecx,[rax+40h]
0x7FF878F5C7EC    48 8B 44 24 08       mov rax,qword ptr [rsp+8]
0x7FF878F5C7F1    4C 8B 40 08          mov r8,qword ptr [rax+8]
0x7FF878F5C7F5    49 D3 E0             shl r8,cl
0x7FF878F5C7F8    4C 8B 08             mov r9,qword ptr [rax]
0x7FF878F5C7FB    8B CA                mov ecx,edx
0x7FF878F5C7FD    49 D3 E9             shr r9,cl
0x7FF878F5C800    4D 0B C1             or  r8,r9
0x7FF878F5C803    4C 89 00             mov qword ptr [rax],r8
0x7FF878F5C806    48 83 C0 08          add rax,8
0x7FF878F5C80A    8B CA                mov ecx,edx
0x7FF878F5C80C    48 D3 28             shr qword ptr [rax],cl
0x7FF878F5C80F    C3                   ret