Capture memory dump with stack for a given exception type seen to occur at a given frequency

489 Views Asked by At

We sometimes see a "spike" of null reference exceptions. What I would like to do is to tell the server (via procdump or some mechanism) to "capture a dump, with the stack trace, whenever a null reference exception is seen to occur at a particular frequency for a given amount of time".

In other words, if null refernce exceptions occur at a high rate (say, once per second), for a period of say, 10 seconds, then I would want to get a dump file that has one of those exceptions "fully captured". By fully captured I mean with a full stack trace that will identify the method throwing the exception, and info that will allow me to drill down the offending line of code in the assembly code view within the dump (using WinDbg or similar tool). Our environment is Windows servers.

Is that possible and, if so, how do I do that? And, are there ways to minimize the impact to the server's performance and still get the stack trace info I want?

We only have AppInsights for such exception spikes and, while it does indicate the method throwing the exception, it doesn't give line numbers. Unless the method is so small that it is clear from which line the exception is being thrown, it can be a pure guessing game as to which line is throwing, especially if the method is huge.

1

There are 1 best solutions below

0
Thomas Weller On

You already tagged the question ProcDump, so I assume you know the tool and it doesn't suit your purpose. While someone implemented ProcDump, it seems possible someone else (you?) could implement that too and add the specific behavior you want. But that's a lot of effort.

To be clear: I don't know any tool which could perform the task that you want. And thus, you might not get any answer. But let me explain how I handle similar cases. Perhaps that fits you, too.

From the term NullReferenceException, I assume you're dealing with a .NET exception. Therefore this answer will consider .NET.

The approach I will suggest will use debugging in WinDbg. Attaching a debugger will always have a performance impact, because of the way exceptions are dispatched. IMHO, ProcDump also has this performance impact - maybe not as huge as WinDbg.

Consideration: it seems you have servers. If you have many servers and they do load balancing or something, you could probably set up one server in a way that it accepts less clients. On that server you could do the debugging. It's like A/B testing: some users will be debugged, others won't. This way, the majority of users will not notice a performance slowdown.

Procedure overview

  1. We download all symbols in advance, so that any access to symbols will be fast and not download symbols from the Internet (which is slow).

  2. We attach WinDbg (or cdb) to the affected process. Before we do that, let's have all commands available.

  3. We set up stuff for .NET

  4. We set up logging, because we don't want huge crash dumps for every exception. Taking a full memory dump is great for analysis, but writing GBs to disk may take a long time.

  5. We set up exception handling to log the call stack for each NullReferenceException.

  6. We output separators into the log file so that it can be split later and you can build some statistics on which method has NullReferenceException how often.

  7. Detach correctly

Test program

Before you do the following steps on the production machine, write a simple application that does nothing but throw a NullReferenceException. Use that to validate the procedure and make yourself familiar with it.

class Program
{
    static void Main()
    {
        for (int i = 0; i < 3; i++)
        {
            try { throw new NullReferenceException(); }
            catch (NullReferenceException) { }
        } 
    }
}

Downloading all symbols

This will be done once before production debugging. Everything else will be part of the production debugging.

  1. In order to download the symbols in advance, you need a minidump of your process. Taking a minidump will not have a huge performance impact. I have listed various options here, but just take a minidump, not a full dump. As a GUI tool I think Process Explorer is the simplest to use.
  2. Open the crash dump in WinDbg on the production machine.
  3. Set up your symbols correctly
  4. Download all symbols by typing ld *.

Alternatively, you could also download all symbols for the whole system, but that's a bit overkill I'd say.

Loading the .NET extension

For the above demo program, you will not have enough time to attach a debugger. You could either insert a console readline or launch the executable in WinDbg and use sxe ld clr;g to wait for the SOS commands to work.

For .NET Framework, load the SOS extension using .loadby sos clr . Try this with the minidump you have from the previous step.

For .NET Core run dotnet tool install -g dotnet-sos and use .load with the full path to SOS.dll.

Setting up logging

.logopen /t /u NullReferences.log

Use a full path if WinDbg tells you that you don't have access.

/t will add a timestamp and /u writes Unicode.

Setting up exception handling

First let's ignore all exceptions:

.foreach(exc {.echo "ct et cpr epr ld ud ser ibp iml out av asrt aph bpe bpec eh clr clrn cce cc dm dbce gp ii ip dz iov ch hc lsq isc 3c svh sse ssec sbo sov vs vcpp wkd rto rtt wob wos *"}) {.catch{sxi ${exc}}}

The command is explained in detail in setting all exceptions, but note that we use sxi instead of sxd.

Now we can consider .NET exceptions. The easiest thing would be to set up exception handling for all types of .NET exceptions. You would use sxe -c "!pe;!clrstack;g" clr . This will print the exception (!pe), print the .NET call stack (!clrstack) and immediately continue (g).

Why do we need !clrstack? Doesn't an exception come with a callstack? AFAIK not always. If the exception is caught and the callstack is never accessed programmatically, the exception object may not have callstack information. That's why I put !clrstack explicitly.

Maybe you could get rid of the !pe part, since NullReferenceExceptions tend to look similar. I doubt that I've ever seen one with an InnerException (which could be kind of interesting).

For specific .NET Exceptions, we need the a .NET specific command from the SOS extension: !soe -create System.NullReferenceException 1. This will use the pseudo register $t1 as a boolean flag which we can then use. So the command is sxe -c "!soe System.NullReferenceException 1; .if (@$t1==1){!pe;!clrstack};g" clr.

Getting a split point

We extend the exception analysis command by another .echo XXXXXSPLITXXXXX and .echo XXXXXSTACKXXXXX so that you're later able to process the file.

So the command is sxe -c "!soe System.NullReferenceException 1; .if (@$t1==1){.echo XXXXXSPLITXXXXX;!pe;.echo XXXXXSTACKXXXXX;!clrstack -a};g" clr.

Detach and quit correctly

On a production system you want to detach before you quit. Use qd, which is basically .detach and q.

When debugging during development, you are probably used to simply quit, which terminates the running program. Don't do that! Make it a habit to use qd in production debugging.

For a good habit, close the log file first, so that it's surely written. That makes it a .logclose;qd.

Result

In the end you'll have a log file containing (example from the demo app):

(59a0.1d5c): CLR exception - code e0434352 (first chance)
r$t1=0
r$t1=1
XXXXXSPLITXXXXX
Exception object: 02a76ec4
Exception type:   System.NullReferenceException
Message:          Object reference not set to an instance of an object.
InnerException:   <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 80004003
XXXXXSTACKXXXXX
OS Thread Id: 0x1d5c (0)
Child SP       IP Call Site
008ff160 758fe4f2 [HelperMethodFrame: 008ff160] 
008ff210 00d5089e ConsoleNetFramework.Program.Main() [B:\...\Program.cs @ 12]
    LOCALS:
        0x008ff21c = 0x00000001
        0x008ff218 = 0x00000001

008ff3ac 60aa0556 [GCFrame: 008ff3ac] 

To analyze it, you should write a small program (in Python maybe) which splits the file after each exception and tries to group the exceptions by call stack and build a statistics.

Getting quick

Ideally you want to put all commands needed for debugging into one line, so that the interruption for attaching to the process and getting everything done is as short as possible.

However, it's quite hard to get everything into one line. This has to do with WinDbg sometimes escaping special characters, sometimes not. Sometimes even spaces are relevant. And sometimes WinDbg just has bugs. Some of the issues are discussed here.

You could also try and put your commands into a script file and execute that with one of the $<, $><, $$<, $$><, $$ >a< commands.

I'll leave that work to you, since writing all this already took too much time.

You certainly want to break in and continue within one second. This will be noticed by users, but it'll be treated as a network lag or something.