I have a pretty simple function that demultiplex data acquired from a board. So data comes by frames, each frame made up by multiple signals, as a 1-dim array and I need to convert to jagged arrays, one for each signal. Basically the following:
I'm working in C# but I've a core function in C that does the work for a single signal:
void Functions::Demux(short*pMux, short* pDemux, int nSignals, int signalIndex, int nValuesPerSignal)
{
short* pMuxStart = pMux + signalIndex;
for (size_t i = 0; i < nValuesPerSignal; i++)
*pDemux++ = *(pMuxStart + i * nSignals);
}
and then I call it via C++/CLI (using pin_ptr<short>
, so no copy) from C# and with parallel for:
Parallel.For(0, nSignals, (int i) =>
{
Core.Functions.Demux(muxed, demuxed[i], nSignals, i, nFramesPerSignal);
});
The muxed data comes from 16k signals (16bit resolution), each signal has 20k samples/s, which turns into a data rate of 16k * 20k * 2 = 640MB/s. When running the code on a workstation with 2 Xeon E5-2620 v4 (in total 16 cores @2.1GHz) it takes about 115% to demux (for 10s of data it takes 11.5s).
I need to go down at least to half the time. Does anybody knows of some way, perhaps with AVX technology, or better of some high performance library for this? Or maybe is there a way by using GPU? I would prefer not to improve the CPU hardware because that will cost probably more.
Edit
Please consider that nSignals
and nValuesPerSignal
can change and that the interleaved array must be split in nSignals separate arrays to be further handled in C#.
Edit: further tests
In the meantime, following the remark from Cody Gray, I tested with a single core by:
void _Functions::_Demux(short*pMux, short** pDemux, int nSignals, int nValuesPerSignal)
{
for (size_t i = 0; i < nSignals; i++)
{
for (size_t j = 0; j < nValuesPerSignal; j++)
pDemux[i][j] = *pMux++;
}
}
called from C++/CLI:
int nSignals = demuxValues->Length;
int nValuesPerSignal = demuxValues[0]->Length;
pin_ptr<short> pMux = &muxValues[0];
array<GCHandle>^ pins = gcnew array<GCHandle>(nSignals);
for (size_t i = 0; i < nSignals; i++)
pins[i] = GCHandle::Alloc(demuxValues[i], GCHandleType::Pinned);
try
{
array<short*>^ arrays = gcnew array<short*>(nSignals);
for (int i = 0; i < nSignals; i++)
arrays[i] = static_cast<short*>(pins[i].AddrOfPinnedObject().ToPointer());
pin_ptr<short*> pDemux = &arrays[0];
_Functions::_Demux(pMux, pDemux, nSignals, nValuesPerSignal);
}
finally
{ foreach (GCHandle pin in pins) pin.Free(); }
and I obtain a computing time of about 105%, which is too much but clearly shows that Parallel.For is not the right choice. From your replies I guess the only viable solution is SSE/AVX. I never wrote code for that, could some of you instruct me in the right direction for this? I think we can suppose that the processor will support always AVX2.
Edit: my initial code vs Matt Timmermans solution
On my machine I compared my initial code (where I was simply using Parallel.For
and calling a C function responsible to deinterlace a single signal) with the code proposed by Matt Timmermans (still using Parallel.For
but in a cleverer way). See results (in ms) against the number of tasks used in the Parallel.For
(I have 32 threads):
N.Taks MyCode MattCode
4 1649 841
8 997 740
16 884 497
32 810 290
So performances are much improved. However, I'll still do some tests on the AVX idea.
Good news! You don't need multiple cores, SIMD, or fancy packages to solve this problem. You probably don't even need to call out to C.
Your bottleneck is memory bandwidth, because you're using it inefficiently.
With that CPU, your memory is probably fast enough to demux > 3GB/s using one core, but each time you need a sample from RAM, the CPU fetches 64 bytes to fill a cache line, and you only use 2 of them. Those 64 bytes will hang around in cache for a while and some other threads will probably use some of them, but the access pattern is still really bad.
All you really have to do is make good use of those 64 bytes. There are many ways. For example:
1) Try a simple loop in C#. Run through your input buffer from start to end, putting each sample you encounter where it belongs. This will use all 64 bytes every time you fill a cache line in reading, and your 16K output channels are few enough that the blocks you're writing to will mostly stay cached. This will probably be fast enough.
2) Continue calling out to your C function, but process the input buffer in 2MB chunks, and don't bother with multiple cores. Each of those 2MB chunks is small enough to stay cached until you're done with it. This will probably be a bit faster than (1).
3) If the above aren't fast enough (it may be close), then it you can go multithreaded. Use method (2), but do a parallel For over the chunks. That way each core can do a whole 2MB chunk, making good use of its cache, and they won't contend with each other. Use at most 4 threads, or you may start stressing your cache again. If you really need to use more than 4 threads, then divide the work more finely, into groups of 1024 channels within each 2MB block or so... but you won't need to do that.
EDIT:
Oh, sorry -- option (1) is pretty difficult to implement in unsafe C# because each
fixed
statement only fixes a few pointers, and using managed arrays is too slow. Option (2) is easy in unsafe C#, though, and still works fine. I wrote a test:That's one second of data, and on my box it says:
Hmmm.. that's better than real time, but not twice as fast as real time. Your box looks to be a little faster than mine, so maybe it's OK. Lets try the parallel version (3). The Main function is the same:
That's better:
The amount of data being read and written in this case is ~ 4.6 GB/s, which is a bit off my theoretical maximum of 6.4 GB/s, and I only have 4 real cores, so I might be able to get that down a bit by calling out to C, but there's not a lot of room for improvement.