We sometimes see a "spike" of null reference exceptions. What I would like to do is to tell the server (via procdump or some mechanism) to "capture a dump, with the stack trace, whenever a null reference exception is seen to occur at a particular frequency for a given amount of time".
In other words, if null refernce exceptions occur at a high rate (say, once per second), for a period of say, 10 seconds, then I would want to get a dump file that has one of those exceptions "fully captured". By fully captured I mean with a full stack trace that will identify the method throwing the exception, and info that will allow me to drill down the offending line of code in the assembly code view within the dump (using WinDbg or similar tool). Our environment is Windows servers.
Is that possible and, if so, how do I do that? And, are there ways to minimize the impact to the server's performance and still get the stack trace info I want?
We only have AppInsights for such exception spikes and, while it does indicate the method throwing the exception, it doesn't give line numbers. Unless the method is so small that it is clear from which line the exception is being thrown, it can be a pure guessing game as to which line is throwing, especially if the method is huge.
You already tagged the question ProcDump, so I assume you know the tool and it doesn't suit your purpose. While someone implemented ProcDump, it seems possible someone else (you?) could implement that too and add the specific behavior you want. But that's a lot of effort.
To be clear: I don't know any tool which could perform the task that you want. And thus, you might not get any answer. But let me explain how I handle similar cases. Perhaps that fits you, too.
From the term NullReferenceException, I assume you're dealing with a .NET exception. Therefore this answer will consider .NET.
The approach I will suggest will use debugging in WinDbg. Attaching a debugger will always have a performance impact, because of the way exceptions are dispatched. IMHO, ProcDump also has this performance impact - maybe not as huge as WinDbg.
Consideration: it seems you have servers. If you have many servers and they do load balancing or something, you could probably set up one server in a way that it accepts less clients. On that server you could do the debugging. It's like A/B testing: some users will be debugged, others won't. This way, the majority of users will not notice a performance slowdown.
Procedure overview
We download all symbols in advance, so that any access to symbols will be fast and not download symbols from the Internet (which is slow).
We attach WinDbg (or cdb) to the affected process. Before we do that, let's have all commands available.
We set up stuff for .NET
We set up logging, because we don't want huge crash dumps for every exception. Taking a full memory dump is great for analysis, but writing GBs to disk may take a long time.
We set up exception handling to log the call stack for each NullReferenceException.
We output separators into the log file so that it can be split later and you can build some statistics on which method has NullReferenceException how often.
Detach correctly
Test program
Before you do the following steps on the production machine, write a simple application that does nothing but throw a NullReferenceException. Use that to validate the procedure and make yourself familiar with it.
Downloading all symbols
This will be done once before production debugging. Everything else will be part of the production debugging.
ld *.Alternatively, you could also download all symbols for the whole system, but that's a bit overkill I'd say.
Loading the .NET extension
For the above demo program, you will not have enough time to attach a debugger. You could either insert a console readline or launch the executable in WinDbg and use
sxe ld clr;gto wait for the SOS commands to work.For .NET Framework, load the SOS extension using
.loadby sos clr. Try this with the minidump you have from the previous step.For .NET Core run
dotnet tool install -g dotnet-sosand use.loadwith the full path toSOS.dll.Setting up logging
.logopen /t /u NullReferences.logUse a full path if WinDbg tells you that you don't have access.
/twill add a timestamp and/uwrites Unicode.Setting up exception handling
First let's ignore all exceptions:
The command is explained in detail in setting all exceptions, but note that we use
sxiinstead ofsxd.Now we can consider .NET exceptions. The easiest thing would be to set up exception handling for all types of .NET exceptions. You would use
sxe -c "!pe;!clrstack;g" clr. This will print the exception (!pe), print the .NET call stack (!clrstack) and immediately continue (g).Why do we need
!clrstack? Doesn't an exception come with a callstack? AFAIK not always. If the exception is caught and the callstack is never accessed programmatically, the exception object may not have callstack information. That's why I put!clrstackexplicitly.Maybe you could get rid of the
!pepart, since NullReferenceExceptions tend to look similar. I doubt that I've ever seen one with an InnerException (which could be kind of interesting).For specific .NET Exceptions, we need the a .NET specific command from the SOS extension:
!soe -create System.NullReferenceException 1. This will use the pseudo register$t1as a boolean flag which we can then use. So the command issxe -c "!soe System.NullReferenceException 1; .if (@$t1==1){!pe;!clrstack};g" clr.Getting a split point
We extend the exception analysis command by another
.echo XXXXXSPLITXXXXXand.echo XXXXXSTACKXXXXXso that you're later able to process the file.So the command is
sxe -c "!soe System.NullReferenceException 1; .if (@$t1==1){.echo XXXXXSPLITXXXXX;!pe;.echo XXXXXSTACKXXXXX;!clrstack -a};g" clr.Detach and quit correctly
On a production system you want to detach before you quit. Use
qd, which is basically.detachandq.When debugging during development, you are probably used to simply quit, which terminates the running program. Don't do that! Make it a habit to use
qdin production debugging.For a good habit, close the log file first, so that it's surely written. That makes it a
.logclose;qd.Result
In the end you'll have a log file containing (example from the demo app):
To analyze it, you should write a small program (in Python maybe) which splits the file after each exception and tries to group the exceptions by call stack and build a statistics.
Getting quick
Ideally you want to put all commands needed for debugging into one line, so that the interruption for attaching to the process and getting everything done is as short as possible.
However, it's quite hard to get everything into one line. This has to do with WinDbg sometimes escaping special characters, sometimes not. Sometimes even spaces are relevant. And sometimes WinDbg just has bugs. Some of the issues are discussed here.
You could also try and put your commands into a script file and execute that with one of the $<, $><, $$<, $$><, $$ >a< commands.
I'll leave that work to you, since writing all this already took too much time.
You certainly want to break in and continue within one second. This will be noticed by users, but it'll be treated as a network lag or something.