Apache Geode down when deploy jar files

422 Views Asked by At

We have a Apache Geode cluster with 4 locators and 3 servers. When we are deploying a new version jar file to this Goede cluster. The locator detect timeout and dismissed the cluster as removing every server and locator out of the distribution system.

There is the log we are suspecting the problem root causes.

[warn 2020/10/10 17:04:05.989 CST <ThreadsMonitor> tid=0x11] Thread 2129 (0x851) is stuck

[warn 2020/10/10 17:04:05.993 CST <ThreadsMonitor> tid=0x11] Thread <2129> (0x851) that was executed at <10 Oct 2020 17:02:53 CST> has been stuck for <72.253 seconds> and number of thread monitor iteration <1>
Thread Name <Function Execution Processor119> state <BLOCKED>
Waiting on <java.lang.Class@48cd319d>
Owned By <Removing shunned GemFire node 172.18.13.13(server_13.13:27524)<v24>:41001> with ID <2285>
Executor Group <FunctionExecutionPooledExecutor>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
org.apache.geode.internal.cache.CacheFactoryStatics.getAnyInstance(CacheFactoryStatics.java:85)
org.apache.geode.cache.CacheFactory.getAnyInstance(CacheFactory.java:396)
org.apache.geode.internal.DeployedJar.cleanUp(DeployedJar.java:233)
org.apache.geode.internal.JarDeployer.registerNewVersions(JarDeployer.java:377)
org.apache.geode.internal.JarDeployer.deploy(JarDeployer.java:414)
org.apache.geode.management.internal.cli.functions.DeployFunction.execute(DeployFunction.java:79)
org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:201)
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:372)
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:436)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:475)
org.apache.geode.distributed.internal.ClusterOperationExecutors.doFunctionExecutionThread(ClusterOperationExecutors.java:393)
org.apache.geode.distributed.internal.ClusterOperationExecutors$$Lambda$72/738636051.invoke(Unknown Source)
org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119)
org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$62/1490297742.run(Unknown Source)
java.lang.Thread.run(Thread.java:748)
Lock owner thread stack
java.util.Timer.purge(Timer.java:462)
org.apache.geode.internal.SystemTimer.timerPurge(SystemTimer.java:287)
org.apache.geode.internal.cache.ExpirationScheduler.forcePurge(ExpirationScheduler.java:46)
org.apache.geode.internal.cache.LocalRegion.cancelAllEntryExpiryTasks(LocalRegion.java:7937)
org.apache.geode.internal.cache.LocalRegion.recursiveDestroyRegion(LocalRegion.java:2592)
org.apache.geode.internal.cache.LocalRegion.basicDestroyRegion(LocalRegion.java:6177)
org.apache.geode.internal.cache.DistributedRegion.basicDestroyRegion(DistributedRegion.java:1822)
org.apache.geode.internal.cache.LocalRegion.handleCacheClose(LocalRegion.java:7249)
org.apache.geode.internal.cache.DistributedRegion.handleCacheClose(DistributedRegion.java:2676)
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2205)
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1550)
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2545)
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2408)
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1247)
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2303)
org.apache.geode.distributed.internal.membership.adapter.GMSMembershipManager.requestMemberRemoval(GMSMembershipManager.java:1507)
org.apache.geode.distributed.internal.membership.adapter.GMSMembershipManager.lambda$addSurpriseMember$1(GMSMembershipManager.java:875)
org.apache.geode.distributed.internal.membership.adapter.GMSMembershipManager$$Lambda$366/738075788.run(Unknown Source)
java.lang.Thread.run(Thread.java:748)


[warn 2020/10/10 17:04:05.993 CST <ThreadsMonitor> tid=0x11] There is 1 stuck thread in this node

After the cluster was dismissed, we have to kill every server as the server process is still there but never working again. Because is on production, we were so panic. So we did the following procedure to recover.

  1. Locator seemed still working. But we restarted the locator one by one using gfsh.
  2. As the server could not be seen by gfsh. We hade to kill all server's process.
  3. We started on server using gfsh. But the process hung at: [info ----- CST <main> tid=0x1] Initializing region PdxTypes
  4. We started another server using gfsh, and then the first server moved on and began loading data form persistent storage.
  5. Custer recovered by starting the last server.

Anybody expert can advice on this incident?

We were suspecting the jar deployment caused extra load to this cluster so there would be something very busy e.g. Memory GC or something. But 'stuck for <72.253 seconds>' is abnormal. We were suspecting jar file incompatibility problem. But we tried so may times on test environment, but it deploy jar ( even slightly incompatible jar) wouldn't cause the server hung.

Asking expert advice !!! Thanks a lot in advance.

1

There are 1 best solutions below

0
On

We had to plan a maintenance time for this jar file upgrading.

When we stopped all the payloads on this server, we started deploying the jar file.

This time, we finded out the jvm GC pause is more than 20 seconds.

So we realize this problem is a perform issue. We need to tune jvm GC.

And the take away from this incident is deploy jar file can cause significant GC load. Just FYI.