Using C# System..Numerics.Vector<T> to unpack / pack bits

Question

Using C# System..Numerics.Vector<T> to unpack / pack bits

1.1k Views Asked by Peter At 07 June 2025 at 19:59

I’m testing the capabilities of the .Net C# System.Numerics.Vector class for packing and unpacking bits.

I was hoping for Vector bitwise shift left/right functionality but that is not currently available so I tried to simulate shifting using arithmetic & logical methods as below. Here’s what I saw:

Packing (a simulated bitwise SHIFT LEFT and OR) using Vector.Multiply() and Vector.BitwiseOr() is slightly worse* than array/pointer code.

*<10% degradation in throughput (MB/sec).

But Unpacking (a simulated bitwise SHIFT RIGHT and AND) using Vector.Divide() and Vector.BitwiseAnd() is far worse** than array/pointer code.

**50% degradation in throughput

NB:

Vector was tested using unit (this was also raised in comments).
Test basis was the packing & unpacking of 100Mn up to 1Bn integers in blocks of 65536 integers. I randomly generated the int[] for each block.
I also tested bitwise (& | >> <<) as well as arithmetic (+ - * /) operations and saw no marked difference in cost. Even divide was not that bad with only a 10% degradation in throughout vs multiply (the question of division was raised in comments)
I changed my original test code (for the non-Vector comparison) to an unsafe/pointer routine to create more of a like-for-like test in terms of packing (many integers to a word) versus unpacking (a word to many integers). This brought the difference in throughout (between packing & unpacking) for the non-Vector code down to a variance of <5%. (which counters my comment about the compiler and optimization below)
Non-Optimized Vector: packing is 2x as fast as unpacking
Optimized Vector: yielded a 4x improvement (versus non-optimized Vector) in packing and a 2x improvement for unpacking
Non-Optimized array/pointer: unpacking is ~5% faster than packing
Optimized array/pointer: yielded a 3x improvement (versus non-optimized array pointer) for packing and a 2.5x improvement for unpacking. Overall, Optimized array/pointer packing was <5% faster than Optimized array/pointer unpacking.
Optimized array/pointer packing was ~10% faster than an Optimized Vector pack

Conclusion so far:

Vector.Divide() appears to be a comparatively slower implementation vs a normal arithmetic division
Furthermore, the Compiler does not appear to optimize Vector.Divide() code to anywhere near the same extent as Vector.Multiply() (which supports comments below regarding the optimising of division)
Array/pointer processing is at present slightly faster than the Vector class for packing data and significantly faster for unpacking
System.Numerics needs Vector.ShiftLeft() & Vector.ShiftRight() methods

Question (updated);

is my conclusion roughly on track? or are there other aspects to check/consider?

Further Information:

int numPages =  8192; // up to >15K     
int testSize = 65536;
StopWatch swPack = new StopWatch();
StopWatch swUnpack = new StopWatch();
long byteCount = 0;
for (int p = 0; p < numpages; b++)
{
    int[] data = GetRandomIntegers(testSize, 14600, 14800);

    swPack.Start();
    byte[] compressedBytes = pack(data);
    swPack.Stop();

    swUnpack.Start();
    int[] unpackedInts = unpack(compressedBytes);
    swUnpack.Stop();

    byteCount += (data.Length*4);

}
Console.WriteLine("Packing Throughput (MB/sec): " + byteCount / 1000 / swPack.ElapsedMilliseconds);
Console.WriteLine("Unpacking Throughput (MB/sec): " + byteCount / 1000 / swUnpacking.ElapsedMilliseconds);

Original Q&A

There are 2 best solutions below

**zyl910** · Answer 1

Vector.Divide has no hardware acceleration for integer types. It is very slow.

It wasn't until .NET 7.0 that Vector added ShiftRightArithmetic, ShiftRightLogical methods.

I developed the VectorTraits library. It allows lower versions of .NET programs (.NET Core 3.0+, .NET 5.0+) to use the hardware accelerated ShiftRightArithmetic, ShiftRightLogical methods. https://www.nuget.org/packages/VectorTraits

**Glenn Slayden** · Answer 2

IL

/// non-SIMD fallback implementation for 128-bit right-shift (unsigned)
/// n: number of bit positions to right-shift a 16-byte memory image.
/// Vector(T) argument 'v' is passed by-ref and modified in-situ.
/// Layout order of the two 64-bit quads is little-endian.

.method public static void SHR(Vector_T<uint64>& v, int32 n) aggressiveinlining
{
    ldarg v
    dup
    dup
    ldc.i4.8
    add
    ldind.i8
    ldc.i4.s 64
    ldarg n
    sub
    shl

    ldarg v
    ldind.i8
    ldarg n
    shr.un

    or
    stind.i8

    ldc.i4.8
    add
    dup
    ldind.i8
    ldarg n
    shr.un
    stind.i8

    ret
}

pseudocode

As<Vector<ulong>,ulong>(ref v) = (As<Vector<ulong>,ulong>(in v) >> n) | 
                                  (ByteOffsAs<Vector<ulong>,ulong>(in v, 8) << (64 - n));
ByteOffsAs<Vector<ulong>,ulong>(ref v, 8) >>= n;

C# extern declaration

static class vector_ext
{
    [MethodImpl(MethodImplOptions.ForwardRef | MethodImplOptions.AggressiveInlining)]
    extern public static void SHR(ref Vector<ulong> v, int n);
};

You can link intermediate .netmodule binaries produced from IL (ildasm.exe) and C# (csc.exe) together into a single assembly by using the /LTCG (link-time code generation) option in link.exe.

runtime x64 JIT result (.NET Framework 4.7.2)

0x7FF878F5C7E0    48 89 4C 24 08       mov qword ptr [rsp+8],rcx
0x7FF878F5C7E5    8B C2                mov eax,edx
0x7FF878F5C7E7    F7 D8                neg eax
0x7FF878F5C7E9    8D 48 40             lea ecx,[rax+40h]
0x7FF878F5C7EC    48 8B 44 24 08       mov rax,qword ptr [rsp+8]
0x7FF878F5C7F1    4C 8B 40 08          mov r8,qword ptr [rax+8]
0x7FF878F5C7F5    49 D3 E0             shl r8,cl
0x7FF878F5C7F8    4C 8B 08             mov r9,qword ptr [rax]
0x7FF878F5C7FB    8B CA                mov ecx,edx
0x7FF878F5C7FD    49 D3 E9             shr r9,cl
0x7FF878F5C800    4D 0B C1             or  r8,r9
0x7FF878F5C803    4C 89 00             mov qword ptr [rax],r8
0x7FF878F5C806    48 83 C0 08          add rax,8
0x7FF878F5C80A    8B CA                mov ecx,edx
0x7FF878F5C80C    48 D3 28             shr qword ptr [rax],cl
0x7FF878F5C80F    C3                   ret

Using C# System..Numerics.Vector<T> to unpack / pack bits

There are 2 best solutions below

Related Questions in C#

Related Questions in VECTOR

Related Questions in BITWISE-OPERATORS

Related Questions in SIMD

Related Questions in SYSTEM.NUMERICS

Trending Questions

Popular # Hahtags

Popular Questions