I'm having trouble understanding why my "concurrent" implementation of this http://www.codeproject.com/Tips/447938/High-performance-Csharp-byte-array-to-hex-string-t function has only ~20% performance gain.
For convenience here is the code from that site:
static readonly int[] toHexTable = new int[] {
3145776, 3211312, 3276848, 3342384, 3407920, 3473456, 3538992, 3604528, 3670064, 3735600,
4259888, 4325424, 4390960, 4456496, 4522032, 4587568, 3145777, 3211313, 3276849, 3342385,
3407921, 3473457, 3538993, 3604529, 3670065, 3735601, 4259889, 4325425, 4390961, 4456497,
4522033, 4587569, 3145778, 3211314, 3276850, 3342386, 3407922, 3473458, 3538994, 3604530,
3670066, 3735602, 4259890, 4325426, 4390962, 4456498, 4522034, 4587570, 3145779, 3211315,
3276851, 3342387, 3407923, 3473459, 3538995, 3604531, 3670067, 3735603, 4259891, 4325427,
4390963, 4456499, 4522035, 4587571, 3145780, 3211316, 3276852, 3342388, 3407924, 3473460,
3538996, 3604532, 3670068, 3735604, 4259892, 4325428, 4390964, 4456500, 4522036, 4587572,
3145781, 3211317, 3276853, 3342389, 3407925, 3473461, 3538997, 3604533, 3670069, 3735605,
4259893, 4325429, 4390965, 4456501, 4522037, 4587573, 3145782, 3211318, 3276854, 3342390,
3407926, 3473462, 3538998, 3604534, 3670070, 3735606, 4259894, 4325430, 4390966, 4456502,
4522038, 4587574, 3145783, 3211319, 3276855, 3342391, 3407927, 3473463, 3538999, 3604535,
3670071, 3735607, 4259895, 4325431, 4390967, 4456503, 4522039, 4587575, 3145784, 3211320,
3276856, 3342392, 3407928, 3473464, 3539000, 3604536, 3670072, 3735608, 4259896, 4325432,
4390968, 4456504, 4522040, 4587576, 3145785, 3211321, 3276857, 3342393, 3407929, 3473465,
3539001, 3604537, 3670073, 3735609, 4259897, 4325433, 4390969, 4456505, 4522041, 4587577,
3145793, 3211329, 3276865, 3342401, 3407937, 3473473, 3539009, 3604545, 3670081, 3735617,
4259905, 4325441, 4390977, 4456513, 4522049, 4587585, 3145794, 3211330, 3276866, 3342402,
3407938, 3473474, 3539010, 3604546, 3670082, 3735618, 4259906, 4325442, 4390978, 4456514,
4522050, 4587586, 3145795, 3211331, 3276867, 3342403, 3407939, 3473475, 3539011, 3604547,
3670083, 3735619, 4259907, 4325443, 4390979, 4456515, 4522051, 4587587, 3145796, 3211332,
3276868, 3342404, 3407940, 3473476, 3539012, 3604548, 3670084, 3735620, 4259908, 4325444,
4390980, 4456516, 4522052, 4587588, 3145797, 3211333, 3276869, 3342405, 3407941, 3473477,
3539013, 3604549, 3670085, 3735621, 4259909, 4325445, 4390981, 4456517, 4522053, 4587589,
3145798, 3211334, 3276870, 3342406, 3407942, 3473478, 3539014, 3604550, 3670086, 3735622,
4259910, 4325446, 4390982, 4456518, 4522054, 4587590
};
public static unsafe string ToHex1(byte[] source)
{
fixed (int* hexRef = toHexTable)
fixed (byte* sourceRef = source)
{
byte* s = sourceRef;
int resultLen = (source.Length << 1);
var result = new string(' ', resultLen);
fixed (char* resultRef = result)
{
int* pair = (int*)resultRef;
while (*pair != 0)
*pair++ = hexRef[*s++];
return result;
}
}
}
Here are my "improvements":
public static unsafe string ToHex1p(byte[] source)
{
var chunks = Environment.ProcessorCount;
var n = (int)Math.Ceiling(source.Length / (double)chunks);
int resultLen = (source.Length << 1);
var result = new string(' ', resultLen);
Parallel.For(0, chunks, k =>
{
var l = Math.Min(source.Length, (k + 1) * n);
fixed (char* resultRef = result) fixed (byte* sourceRef = source)
{
int from = n * k;
int to = (int)resultRef + (l << 2);
int* pair = (int*)resultRef + from;
byte* s = sourceRef + from;
while ((int)pair != to)
*pair++ = toHexTable[*s++];
}
});
return result;
}
Edit 1 This is how I time the functions:
var n = 0xff;
var s = new System.Diagnostics.Stopwatch();
var d = Enumerable.Repeat<byte>(0xce, (int)Math.Pow(2, 23)).ToArray();
s.Start();
for (var i = 0; i < n; ++i)
{
Binary.ToHex1(d);
}
Console.WriteLine(s.ElapsedMilliseconds / (double)n);
s.Restart();
for (var i = 0; i < n; ++i)
{
Binary.ToHex1p(d);
}
Console.WriteLine(s.ElapsedMilliseconds / (double)n);
After playing around with your example some, I conclude that the bulk of the disparity in timing you're seeing is due to GC overhead, with initialization overhead in both scenarios being high enough to make performance differences relatively unimportant even once the GC overhead is removed from the test.
When I switch the order of the tests, the parallel one winds up faster than the non-parallel one. This is the first sign that the test isn't fair. :)
When I changed the tests so that I called
GC.Collect()after each test, to ensure the GC was quiet during the subsequent one, the parallel version consistently came out ahead. But only barely so; the start-up time for each thread was in all cases over half the total duration of each parallel test.As part of my testing, I modified the code so that it tracked the actual time spent in each thread of the
For()version. Here, I found that the time spent in this code was about what I'd expect based on the non-parallel version (i.e. reasonably close to the time divided by the thread count).(Of course, one could obtain this information using a profiler too).
Here's the output for the tests I ran with
GC.Collect(). For the parallel test, this is also showing the start (relative to the overall test start time) and duration times for each thread.Running the non-parallel version first, parallel version second:
Running the parallel version first, non-parallel second:
Lessons learned:
One final note: another source of error in these kinds of tests is Intel's Hyperthreading. Or rather, if you test while using the Hyperthread-enabled CPU count, you'll get misleading results. For example, I tested this on my Intel i5-based laptop, which reports having 4 cores. But running four threads won't come close to a 4x speed-up over a non-parallel implementation, while running two threads will be close to a 2x speed-up (for the right problem). That's why even though my computer reports 4 CPUs, I tested with only 2 threads.
Here, there's so much other misleading overhead in this test that I don't think Hyperthreading makes a big difference. But it's something to watch out for.