Can SysInternals' Process Monitor log when a thread blocks awaiting for an event?

762 Views Asked by At

I need to diagnose a server that is unable to reach peak performance. CPU usage drops to zero for around 500ms and then spikes to 100% while trying to process the queued requests, this pattern repeats during a number of hours after which the operation becomes smooth again (Operation had been smooth for years)

This suggests to me that the worker threads are idling while awaiting for an external event to occur. The application is complex and we haven't been able to pinpoint the culprit.

Can Process Monitor be configured to log every time a thread sleeps awaiting for some event? If possible, can the event be related to a particular stack trace?

If the above is possible, perhaps I could correlate the CPU drops with wait events and pinpoint the culprit.

I have successfully used Windbg before to diagnose these kinds of problems, however in this case, the wait is very brief and I'm not confident that I can make the debugger break exactly while the processor is idling.

1

There are 1 best solutions below

0
On BEST ANSWER

Windbg and ProcMon are not the right tools for this job. Install the Windows Performance Toolkit which is part of the Windows 10 SDK on your Developer device.

enter image description here

Now xcopy the folder C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit to the server, open cmd.exe as admin and run wpr.exe -start CPU && timeout -1 && wpr.exe -stop C:\Hang.etl, now minimize the cmd.

After you got the hang, switch back to cmd and press a key to stop logging.

Move the Hang.etl + NGENPDB folder to the dev PC, open the Hang.etl with Windows Performance Analyzer (WPA.exe), load debug symbols and start finding the hang by adding the CPU (Precise) to analysis pane

enter image description here

and make sour you see the columns NewProcess, NewThreadId, NewStack, ReadyingProcess, ReadyingThreadId, ReadyingStack, Waits(us). Click on Waits(us) to see most long on top. Now look for long times, with a small Count (so small operations that take long time, not many operations) and inspect the callstack to have any clues what happens.