I'm currently debugging some failure in PyTorch which is a Python library with a C++ extension, so there is some C++ code called by the Python code.
The failure happens because some floating point exception traps are getting set before a seemingly innocent std::exp
call causing a core dump. Strangely reducing this to a minimum, just setting the FPE via feenableexcept
and then calling std::exp
with the same values doesn't produce this crash/core dump. So I'm stuck with debugging the original application.
Doing some printf-debugging (of course the code doesn't break [i.e. traps are not set] when compiled in debug mode) I narrowed it down to a throw c10::Error(...)
statement. This class is derived from std::exception
so nothing unusual here. To translate that C++ exception into a Python exception a catch(...){ /*set a bool*/; throw;}catch(c10::Error&){...}
is entered. Nothing looks odd so far and of course this also does not reproduce in a minimal setup doing the same.
Using gdb with catch throw
and catch catch
I got to the place where this exception is thrown and caught and did some single-stepping (step
) followed by p fegetexcept()
and indeed:
90 in ../../../../libstdc++-v3/libsupc++/eh_throw.cc
(gdb) p fegetexcept()
$20 = 0
(gdb) s
Catchpoint 4 (exception caught), __cxxabiv1::__cxa_begin_catch (exc_obj_in=0x11c6da60) at ../../../../libstdc++-v3/libsupc++/eh_catch.cc:42
42 ../../../../libstdc++-v3/libsupc++/eh_catch.cc: Datei oder Verzeichnis nicht gefunden.
(gdb) p fegetexcept()
$21 = 536870912
So right inside the throw the FPE is still not set and right inside the catch it is. The line in eh_throw
is _Unwind_RaiseException (&header->exc.unwindHeader);
which I can't step into.
Also the value of fegetexcept()
is pretty much different per program invocation. Furthermore the problem goes away if I do NOT build with GLOG which I further traced to it using libunwind.
However I can't get any further than to the point where libunwind calls setcontext
from which I only get assembly. At a line lfd fp29,(SIGCONTEXT_FP_REGS+(PT_R29*8))(r31)
the value of fegetexcept()
changes.
So this looks like an issue of libunwind. However the issue does not appear either when I use clang 9.0.1 instead of GCC 8.3.0. So I'm at loss here.
Does anyone have an idea what the issue could be, what else I can do or if there is a known bug? This is using glibc 2.17 and libunwind 1.4.0 in case that matters.