I've implemented a method for parsing an unsigned integer string of length <= 8 using SIMD intrinsics available in .NET as follows:
public unsafe static uint ParseUint(string text)
{
fixed (char* c = text)
{
var parsed = Sse3.LoadDquVector128((byte*) c);
var shift = (8 - text.Length) * 2;
var shifted = Sse2.ShiftLeftLogical128BitLane(parsed,
(byte) (shift));
Vector128<byte> digit0 = Vector128.Create((byte) '0');
var reduced = Sse2.SubtractSaturate(shifted, digit0);
var shortMult = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
var collapsed2 = Sse2.MultiplyAddAdjacent(reduced.As<byte, short>(), shortMult);
var repack = Sse41.PackUnsignedSaturate(collapsed2, collapsed2);
var intMult = Vector128.Create((short)0, 0, 0, 0, 100, 1, 100, 1);
var collapsed3 = Sse2.MultiplyAddAdjacent(repack.As<ushort,short>(), intMult);
var e1 = collapsed3.GetElement(2);
var e2 = collapsed3.GetElement(3);
return (uint) (e1 * 10000 + e2);
}
}
Sadly, a comparison with a baseline uint.Parse()
gives the following, rather unimpressive, result:
Method | Mean | Error | StdDev |
---|---|---|---|
Baseline | 15.157 ns | 0.0325 ns | 0.0304 ns |
ParseSimd | 3.269 ns | 0.0115 ns | 0.0102 ns |
What are some of the ways the above code can be improved? My particular areas of concern are:
- The way a bit shift of the SIMD register happens with a calculation involving
text.Length
- ~~The unpacking of UTF-16 data using a
MultiplyAddAdjacent
involving a vector of0
s and1
~~ - The way elements are extracted using
GetElement()
-- maybe there's someToScalar()
call that can happen somwehere?
First of all, 5x improvement is not “rather unimpressive”.
I would not do the last step with scalar code, here’s an alternative:
The C++ version gonna be way faster than what happens on my PC (.NET Core 3.1). The generated code is not good. They initialize constants like this:
They use stack memory instead of another vector register. It looks like JIT developers forgot there’re 16 vector registers there, the complete function only uses
xmm0
.