Beginner here. Sorry.
Old method:
float[] Matrix = new float[dimension * dimension];
// Matrix is square
// 1D array is much faster than 2D/jagged array for this step
// Matrix will be multiplied by itself
float[] Result = new float[dimension * dimension];
_ = Parallel.For(0, dimension, i =>
{
float[] column = new float[dimension];
for (int j = 0; j < dimension; j++)
{
column[j] = Matrix[j * dimension + i];
}
for (int j = 0; j < dimension; j++)
{
float DotProd = 0;
for (int k = 0; k < dimension; k++)
{
DotProd += Matrix[j * dimension + k] * column[k];
}
Result[j * dimension + i] = DotProd;
}
}
It makes sense to me why making the "column" array beforehand improves performance: this array puts all of the necessary data in one location, which is then stored in the CPU cache. The performance improvement when I did this was enormous. Calculation time went from 67 seconds to 30 seconds.
So, seeing that it worked so well for "columns", I tried it with the "rows" as well. This was done by copying the 1D array into a Jagged array, like so:
float[] Matrix = new float[dimension * dimension];
// Matrix is square
// 1D array is much faster than 2D/jagged array when filling it with data
// Matrix will be multiplied by itself
float[] Result = new float[dimension * dimension];
float[][] JagMatrix = new float[dimension][];
for (int i = 0; i < dimension; i++)
{
JagMatrix[i] = new float[dimension];
Array.Copy(Matrix, dimension * i, JagMatrix[i], 0, dimension);
}
_ = Parallel.For(0, dimension, i =>
{
float[] column = new float[dimension];
for (int j = 0; j < dimension; j++)
{
column[j] = Matrix[j * dimension + i];
}
for (int j = 0; j < dimension; j++)
{
float DotProd = 0;
float[] row = JagMatrix[j];
for (int k = 0; k < dimension; k++)
{
DotProd += row[k] * column[k];
}
Result[j * dimension + i] = DotProd;
}
}
This is about 30% faster again, presumably for the same reason: it ensures that the "row" is stored in CPU cache.
However, I'm wondering if there's a way to get the "row" into cache without copying the matrix to a jagged array. Besides the (probably negligible) performance hit from setting it up, it doesn't seem like it should be necessary. Unlike a "column", the data for a "row" should already be in the one location (as indicated by the much smaller performance improvement). So, is there a way to get it into cache without creating the jagged array?
It can't be a jagged array from the start as it's faster to fill the matrix when it's a 1D array.
I've tried using pointers and spans, but both were slower.