I'm experimenting with System.Numerics to multiple array elements. Is there a faster way of multiplying the element of the resultant vector (accVector) together? Currently accVector needs to be converted to an array where the elements are multiplied together using LINQ.
private double VectorMultiplication(double[] array)
{
int vectorSize = Vector<double>.Count;
var accVector = Vector<double>.One;
int i;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize)
{
var v = new Vector<double>(array, i);
accVector = Vector.Multiply(accVector, v);
}
var tempArray = new double[Vector<double>.Count];
accVector.CopyTo(tempArray);
var result = tempArray.Aggregate(1d, (p, d) => p * d);
for (; i < array.Length; i++)
{
result *= array[i];
}
return result;
}
Within Sytem.Numerics, no. As mentioned by Peter in the comments, usually you would start by splitting a 256bit vector into two 128bit halves and multiply them, then use shuffles to handle the 128bit part. But System.Numerics offers no shuffles, and it does not let you choose the size of the vector that you're using.
The usual approach can be used with the System.Runtime.Intrinsics.X86 API, which requires .NET Core 3.0 or higher.
For example:
That looks like it might be bad, leaving a mysterious
GetElementup to the JIT engine to figure out, but actually the codegen is really reasonable:So it looks like
GetElement(0)is implicit andGetElement(1)results in avpshufd, that's fine. Copyingxmm0toxmm1instead of using a non-destructivevpshufdis a bit mysterious but not that bad, overall better than I normally expect of .NET.. I tested this function non-inlined, usually it should be inlined and the loads should go away.The main loop can be improved, because the throughput of multiplication is much better than its latency. Right now the multiplications are done one at the time (that is, one vector multiplication at the time) with a delay in between (5 cycles on Haswell, 4 on Broadwell and newer) to wait for the previous multiplication to finish, but for example an Intel Haswell could be starting two multiplications per cycle which is 10 times as much. Realistically the improvement wouldn't be that big, but creating some opportunity for instruction level parallelism helps.
For example (not tested):
This makes that last loop run potentially 8 times as much as it used to, that could be avoided by having an extra single-vector-per-iteration loop.