I'm porting some code to windows and found threading to be extremely slow. The task takes 300 seconds on windows (with two xeon E5-2670 8 core 2.6ghz = 16 core) and 3.5 seconds on linux (xeon E5-1607 4 core 3ghz). Using vs2012 express.
I've got 32 threads all calling EnterCriticalSection(), popping an 80 byte job of a std::stack, LeaveCriticalSection and doing some work (250k jobs in total).
Before and after every critical section call I print the thread ID and current time.
- The wait time for a single thread's lock is ~160ms
- To pop the job off the stack takes ~3ms
- Calling leave takes ~3ms
- The job takes ~1ms
(roughly same for Debug/Release, Debug takes a little longer. I'd love to be able to properly profile the code :P)
Commenting out the job call makes the whole process take 2 seconds (still more than linux).
I've tried both queryperformancecounter and timeGetTime, both give approx the same result.
AFAIK the job never makes any sync calls, but I can't explain the slowdown unless it does.
I have no idea why copying from a stack and calling pop takes so long. Another very confusing thing is why a call to leave() takes so long.
Can anyone speculate on why it's running so slowly?
I wouldn't have thought the difference in processor would give a 100x performance difference, but could it be at all related to dual CPUs? (having to sync between separate CPUs than internal cores).
By the way, I'm aware of std::thread but want my library code to work with pre C++11.
edit
//in a while(hasJobs) loop...
EVENT qwe1 = {"lock", timeGetTime(), id};
events.push_back(qwe1);
scene->jobMutex.lock();
EVENT qwe2 = {"getjob", timeGetTime(), id};
events.push_back(qwe2);
hasJobs = !scene->jobs.empty();
if (hasJobs)
{
job = scene->jobs.front();
scene->jobs.pop();
}
EVENT qwe3 = {"gotjob", timeGetTime(), id};
events.push_back(qwe3);
scene->jobMutex.unlock();
EVENT qwe4 = {"unlock", timeGetTime(), id};
events.push_back(qwe4);
if (hasJobs)
scene->performJob(job);
and the mutex class, with linux #ifdef stuff removed...
CRITICAL_SECTION mutex;
...
Mutex::Mutex()
{
InitializeCriticalSection(&mutex);
}
Mutex::~Mutex()
{
DeleteCriticalSection(&mutex);
}
void Mutex::lock()
{
EnterCriticalSection(&mutex);
}
void Mutex::unlock()
{
LeaveCriticalSection(&mutex);
}
It seems like your windows threads are facing super contention. They seem totally serialized. You have about 7ms of total processing time in your critical section and 32 threads. If all the threads are queued up on the lock, the last thread in the queue wouldn't get to run until after sleeping about 217ms. This is not too far off your 160ms observed wait time.
So, if the threads have nothing else to do than to enter the critical section, do work, then leave the critical section, this is the behavior I would expect.
Try to characterize the linux profiling behavior, and see if the program behavior is really and apples to apples comparison.