CMS class unloading took much time

574 Views Asked by At

On large load was noticed large GC pause(400ms) for our application. During investigation it turns out, that pause happens on CMS Final Remark and class unloading phase took a lot more time, than other phases(10x-100X):

    (CMS Final Remark) 
        [YG occupancy: 142247 K (294912 K)]
2019-03-13T07:38:30.656-0700: 24252.576: 
        [Rescan (parallel) , 0.0216770 secs]
2019-03-13T07:38:30.677-0700: 24252.598: 
        [weak refs processing, 0.0028353 secs]      
2019-03-13T07:38:30.680-0700: 24252.601: 
        [class unloading, 0.3232543 secs]       
2019-03-13T07:38:31.004-0700: 24252.924: 
        [scrub symbol table, 0.0371301 secs]
2019-03-13T07:38:31.041-0700: 24252.961: 
        [scrub string table, 0.0126352 secs]
        [1 CMS-remark: 2062947K(4792320K)] 2205195K(5087232K), 0.3986822 secs]
[Times: user=0.63 sys=0.01, real=0.40 secs]

Total time for which application threads were stopped: 0.4156259 seconds, 
Stopping threads took: 0.0014133 seconds

This pause always happens in first second of performance test, the duration of pause varies from 300ms to 400+ms.

Unfortunately, I have no access to server(it's under maintenance) and have only logs for several test runs. But when the server would available, I want to be prepared for further investigation, but I have no idea of what causes such a behavior.

My first thought was about Linux Huge pages, but we don't use them.

After more time in logs, I found following:

Heap after GC invocations=7969 (full 511):
 par new generation   total 294912K, used 23686K [0x0000000687800000, 0x000000069b800000, 0x000000069b800000)
  eden space 262144K,   0% used [0x0000000687800000, 0x0000000687800000, 0x0000000697800000)
  from space 32768K,  72% used [0x0000000699800000, 0x000000069af219b8, 0x000000069b800000)
  to   space 32768K,   0% used [0x0000000697800000, 0x0000000697800000, 0x0000000699800000)
 concurrent mark-sweep generation total 4792320K, used 2062947K [0x000000069b800000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 282286K, capacity 297017K, committed 309256K, reserved 1320960K
  class space    used 33038K, capacity 36852K, committed 38872K, reserved 1048576K
}


Heap after GC invocations=7970 (full 511):
 par new generation   total 294912K, used 27099K [0x0000000687800000, 0x000000069b800000, 0x000000069b800000)
  eden space 262144K,   0% used [0x0000000687800000, 0x0000000687800000, 0x0000000697800000)
  from space 32768K,  82% used [0x0000000697800000, 0x0000000699276df0, 0x0000000699800000)
  to   space 32768K,   0% used [0x0000000699800000, 0x0000000699800000, 0x000000069b800000)
 concurrent mark-sweep generation total 4792320K, used 2066069K [0x000000069b800000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 282303K, capacity 297017K, committed 309256K, reserved 1320960K
  class space    used 33038K, capacity 36852K, committed 38872K, reserved 1048576K
}

Investigating GC pause happens between GC invocations 7969 and 7970. And the amount of used space in meta space is almost the same(it's actually increased)

So, It looks like it's not actually some stall classes which aren't used anymore(since no space was cleared) and it's not safe point reaching issue - since blocking of threads took small time(0.0014133).

How to investigate such case and what diagnostic information is required for proper preparedness.

Technical details

Centos5 + JDK8 + CMS GC with args:

-XX:+CMSClassUnloadingEnabled 
-XX:CMSIncrementalDutyCycleMin=10 
-XX:+CMSIncrementalPacing 
-XX:CMSInitiatingOccupancyFraction=50 
-XX:+CMSParallelRemarkEnabled 
-XX:+DisableExplicitGC 
-XX:InitialHeapSize=5242880000 
-XX:MaxHeapSize=5242880000 
-XX:MaxNewSize=335544320 
-XX:MaxTenuringThreshold=6 
-XX:NewSize=335544320 
-XX:OldPLABSize=16
-XX:+UseCompressedClassPointers 
-XX:+UseCompressedOops 
-XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC
0

There are 0 best solutions below