How to reduce time taken on threads reaching Safepoint - Sync state

1.1k Views Asked by At

About the Issue:
During heavy IO in the VM, we faced JVM pause/slowness due to stopping threads taking more time. When looking on safepoint logs it showed Sync state takes the most time.

Safepoint log reference

We also tried printing Safepoint traces on timeout delay (-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=200) to know which threads is causing this issue but nothing seems to be suspicious. Also when setting timeout for safepoints, we are not getting timeout detected print when the time spent is in 'Sync' state.

Questions about this safepoint tracing:

  1. How does the safepoint timeout work?
  2. After logging the thread details, does the safepoint exists and all threads resume?
  3. Will that VM operation be carried out. What will happen if the vmop is GC.

Using Async-profiler:
Tried time-to-safepoint profiling using async-profiler and noticed VM Thread is taking more time on SafepointSynchronize::begin() method and C2 compiler threads is taking almost equal time as VM Thread.

Flame Graph

We doubt that C2 Compilers may be taking time to reach safepoint. Can someone help us in resolving this issue and to interpret this time-to-safepoint flamegraph. Thanks in advance.

1

There are 1 best solutions below

2
On

SafepointTimeout option affects nothing but logging, i.e. threads will not be interrupted, VM operation will run normally, etc.

SafepointTimeout does not always print timed out threads: a thread may already have reached the safepoint by the time printing occurs. Furthermore, SafepointTimeout may not even detect a timeout, if the entire process has been frozen by the Operating System.

For example, such 'freezes' many happen

  • when a process has exhausted its cpu quota in a cgroup (container);
  • when a system is low on physical memory, and direct reclaim occurs;
  • due to activity of another process (e.g. I observed long JVM pauses when atop utility inspected the system).

async-profiler indeed has a time-to-safepoint profiling option (--ttsp), though using it correctly may seem tricky. It works best in wall profiling mode with jfr output. In this configuration, async-profiler will sample all threads (both running and blocking) during safepoint synchronization, and record each individual event with a timestamp.

Such profile can be then analyzed with JDK Mission Control: choose the time interval around the long pause, and look at the stack traces of java threads in this interval.

Note that if the JVM process is 'frozen', async-profiler thread does not work either, i.e. you will not see collected samples during this period. Normally, in wall clock profiling mode, all threads are sampled evenly. But if you see a 'gap ' (missed events during some time interval), it apparently means the JVM process has not received CPU time. In this case, the reason of JVM pauses is not in the Java application, but rather in the operating system / environment.