I performed some benchmarking to compare doubles and floats performance. I was very surprised to see that doubles are much faster than floats.
I saw some discussion about that, for example:
Is using double faster than float?
Are doubles faster than floats in c#?
Most of them said that it is possible that double and float performance will be similar , because of double-precision optimization, etc. . But I saw a x2 performance improvement when using doubles!! How is it possible? What makes it worst, is that I'm using a 32-bit machine which do expected to perform better for floats according to some posts...
I used C# to check it precisely but I see that similar C++ implementation have similar behavior.
Code I used to check it:
static void Main(string[] args)
{
double[,] doubles = new double[64, 64];
float[,] floats = new float[64, 64];
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
s.Restart();
CalcDoubles(doubles);
s.Stop();
long doubleTime = s.ElapsedMilliseconds;
s.Restart();
CalcFloats(floats);
s.Stop();
long floatTime = s.ElapsedMilliseconds;
Console.WriteLine("Doubles time: " + doubleTime + " ms");
Console.WriteLine("Floats time: " + floatTime + " ms");
}
private static void CalcDoubles(double[,] arr)
{
unsafe
{
fixed (double* p = arr)
{
for (int b = 0; b < 192 * 12; ++b)
{
for (int i = 0; i < 64; ++i)
{
for (int j = 0; j < 64; ++j)
{
double* addr = (p + i * 64 + j);
double arrij = *addr;
arrij = arrij == 0 ? 1.0f / (i * j) : arrij * (double)i / j;
*addr = arrij;
}
}
}
}
}
}
private static void CalcFloats(float[,] arr)
{
unsafe
{
fixed (float* p = arr)
{
for (int b = 0; b < 192 * 12; ++b)
{
for (int i = 0; i < 64; ++i)
{
for (int j = 0; j < 64; ++j)
{
float* addr = (p + i * 64 + j);
float arrij = *addr;
arrij = arrij == 0 ? 1.0f / (i * j) : arrij * (float)i / j;
*addr = arrij;
}
}
}
}
}
}
I'm using a very weak notebook: Intel Atom N455 processor (dual core, 1.67GHz, 32bit) with 2GB RAM.
This looks the jitter optimizer drops the ball here, it doesn't suppress a redundant store in the float case. The hot code is the
1.0f / (i * j)
calculation since all array values are 0. The x86 jitter generates:The x64 jitter:
I marked the superfluous instructions with "redundant". The optimizer did manage to eliminate them in the double version so that code runs faster.
The redundant stores are actually present in the IL generated by the C# compiler, it is the job of the optimizer to detect and remove them. Notable is that both the x86 and the x64 jitter have this flaw so it looks like a general oversight in the optimizer algorithm.
The x64 code is especially noteworthy for converting the float result to double and then back to float again, suggesting that the underlying problem is a data type conversion that it doesn't know how to suppress. You also see it in the x86 code, the redundant store actually makes a double to float conversion. Eliminating the conversion looks difficult in the x86 case so this may well have leaked into the x64 jitter.
Do note that the x64 code runs significantly faster than the x86 code so be sure to set the Platform target to AnyCPU for a simple win. At least part of that speed up was the optimizer's smarts at hoisting the integer multiplication.
And do make sure to test realistic data, your measurement is fundamentally invalid due to the uninitialized array content. The difference is much less pronounced with non-zero data in the elements, it makes the division much more expensive.
Also note your bug in the double case, you should not use 1.0f there.