How can I get the JVM to exit quickly after a SIGSEGV crash?

70 Views Asked by At

We have a service that crashes frequently due to some issue with TensorFlow Java. That we can live with (K8s restarts it, lots of instances). The problem is that it takes several minutes for the JVM to die. Is there some way to force a quick exit on SIGSEGV in native code?

corrupted size vs. prev_size while consolidating
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe4f321a898, pid=1, tid=545
#
# JRE version: OpenJDK Runtime Environment Zulu21.28+85-CA (21.0+35) (build 21+35)
# Java VM: OpenJDK 64-Bit Server VM Zulu21.28+85-CA (21+35, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0x28898]  abort+0x178
#
# Core dump will be written. Default location: /data/core
#
# An error report file with more information is saved as:
# /data/hs_err_pid1.log

Some minutes later:

# [ timer expired, abort... ]
[thread 1037 also had an error]
2

There are 2 best solutions below

0
apangin On BEST ANSWER

Add the following JVM options:

-XX:+SuppressFatalErrorMessage -XX:-CreateCoredumpOnCrash

This will force JVM terminate immediately on SIGSEGV without creating an error report or coredump. If you still want to see a fatal error message, replace -XX:+SuppressFatalErrorMessage with -XX:ErrorLogTimeout=1.

3
raner On

I would suspect that this JVM is running with a pretty large heap (> 64 GB), and that it just takes a while to write out the core dump file for a process that uses so much memory:

# Core dump will be written. Default location: /data/core

During the several minutes that this takes you might see the core dump file growing in the above location (that would be an easy way to confirm this theory).

The remedy would be to disable the creation of core dump files, the details of which would depend on your specific operating system (but core dumps can be disabled on pretty much any UNIX-based operating system). Additionally, there might be some filesystem-related bottleneck with that specific location that causes core dumps to be written slower than one would expect.