I have a 2 node Ignite cluster on dev environment. Version 2.9.0 I have enabled persistence in my data region configs. When I start cluster it runs fine. Now when I shutdown node one by one and restart them, to check if persistence is working alright, I see below errors in one of the nodes.
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=ttl-cleanup-worker, threadName=ttl-cleanup-worker-#44%Gemini.dev%, blockedFor=3079s]
[15:04:59] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=ttl-cleanup-worker, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928019372]]]
Jan 05, 2023 3:05:08 PM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-0, threadName=sys-stripe-0-#1%Gemini.dev%, blockedFor=2490s]
[15:05:08] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-0, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928617333]]]
Jan 05, 2023 3:05:08 PM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-6, threadName=sys-stripe-6-#7%Gemini.dev%, blockedFor=2506s]
[15:05:08] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-6, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928602019]]]
Jan 05, 2023 3:05:17 PM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=db-checkpoint-thread, threadName=db-checkpoint-thread-#105%Gemini.dev%, blockedFor=3098s]
[15:05:17] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=db-checkpoint-thread, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928018434]]]
When these errors are shown some caches of cluster become unresponsive. I am attaching entire Ignite config at the end.
I also tried observing persistence related metrics, but I don't see any in JConsole in "Persistent Storage" section. As you can see in my config I have set metricsEnabled=true
in DR configs.
I took a threaddump too. Although I am unable to make out anything from it. There are many threads which are in WAITING(parking) state. Here is one snippet from it.
"db-checkpoint-thread-#105%Gemini.dev%" #165 prio=5 os_prio=0 cpu=86.94ms elapsed=2085.34s tid=0x00007f9a5c138000 nid=0x3d3b waiting on condition [0x00007f7eabf7c000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00000004018cc220> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
Some lines are cut to keep post under size limit
Below is my entire Ignite server config.
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:util="http://www.springframework.org/schema/util"
xmlns="http://www.springframework.org/schema/beans"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/util
http://www.springframework.org/schema/util/spring-util.xsd">
<bean id="propertyConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="systemPropertiesModeName" value="SYSTEM_PROPERTIES_MODE_FALLBACK"/>
<property name="searchSystemEnvironment" value="true"/>
</bean>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<!-- Set to true to enable distributed class loading for examples, default is false. -->
<property name="sslContextFactory">
<bean class="org.apache.ignite.ssl.SslContextFactory">
<property name="keyStoreFilePath" value="/home/sysSvcDevOps/ssl/ignite1.keystore.jks"/>
<property name="keyStorePassword" value="KeyStore443"/>
<property name="keyStoreType" value="jks"/>
<property name="trustStoreFilePath" value="/home/sysSvcDevOps/ssl/cacerts/java.cacerts.jks"/>
<property name="trustStorePassword" value="changeit"/>
<property name="trustStoreType" value="jks"/>
</bean>
</property>
<property name="igniteInstanceName" value=".dev"/>
<property name="consistentId" value="ignite1.dev"/>
<property name="workDirectory" value="/apps/Svc/dev/Ignite/IgniteData/persistentstore/work"/>
<property name="rebalanceThreadPoolSize" value="8"/>
<property name="publicThreadPoolSize" value="32"/>
<property name="systemThreadPoolSize" value="64"/>
<property name="queryThreadPoolSize" value="64"/>
<property name="failureDetectionTimeout" value="30000"/>
<property name="authenticationEnabled" value="true"/>
<property name="metricsUpdateFrequency" value="30000"/>
<property name="peerClassLoadingEnabled" value="false"/>
<property name="clientMode" value="false"/>
<!-- Enable task execution events for examples. -->
<property name="includeEventTypes">
<list>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STARTED"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STOPPED"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST"/>
<util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_NODES_LEFT"/>
</list>
</property>
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="walSegmentSize" value="1073741824"/>
<property name="walSegments" value="20"/>
<property name="maxWalArchiveSize" value="10737418240"/>
<property name="walCompactionEnabled" value="true"/>
<property name="walCompactionLevel" value="4"/>
<property name="checkpointFrequency" value="300000"/>
<property name="checkpointThreads" value="16"/>
<property name="checkpointReadLockTimeout" value="60000"/>
<property name="lockWaitTime" value="45000"/>
<property name="checkpointWriteOrder" value="RANDOM"/>
<property name="pageSize" value="4096"/>
<property name="writeThrottlingEnabled" value="true"/>
<!-- wal storage paths -->
<property name="walPath" value="/apps/Svc/dev/Ignite/IgniteData"/>
<property name="walArchivePath" value="/apps/Svc/dev/Ignite/IgniteDataArchive"/>
<property name="storagePath" value="/apps/Svc/dev/Ignite/IgniteData/archive"/>
<property name="dataRegionConfigurations">
<list>
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="dr.dev.referencedata"/>
<property name="persistenceEnabled" value="true"/>
<property name="initialSize" value="1073741824"/>
<property name="maxSize" value="4294969673"/>
<property name="checkpointPageBufferSize" value="1073741824"/>
</bean>
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="dr.dev.input"/>
<property name="persistenceEnabled" value="true"/>
<property name="metricsEnabled" value="true"/>
<property name="checkpointPageBufferSize" value="#{4 * 1024 * 1024 * 1024}"/>
<property name="initialSize" value="12884901888"/>
<property name="maxSize" value="81604378624"/>
<property name="pageEvictionMode" value="RANDOM_2_LRU"/>
</bean>
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="dr.dev.input.exception"/>
<property name="persistenceEnabled" value="true"/>
<property name="metricsEnabled" value="true"/>
<property name="checkpointPageBufferSize" value="#{4 * 1024 * 1024 * 1024}"/>
<property name="initialSize" value="4294967296"/>
<property name="maxSize" value="21474836480"/>
<property name="pageEvictionMode" value="RANDOM_2_LRU"/>
</bean>
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="dr.dev.output"/>
<property name="initialSize" value="1073741824"/>
<property name="persistenceEnabled" value="true"/>
<property name="metricsEnabled" value="true"/>
<property name="checkpointPageBufferSize" value="#{2 * 1024 * 1024 * 1024}"/>
<property name="maxSize" value="2147483648"/>
</bean>
</list>
</property>
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="default_region"/>
<property name="persistenceEnabled" value="true"/>
<property name="initialSize" value="268435456"/>
<property name="maxSize" value="268435456"/>
</bean>
</property>
</bean>
</property>
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi">
<property name="zkConnectionString" value="zk1.intranet.com:22001,zk2.intranet.com:22001"/>
<property name="zkRootPath" value="/ignite"/>
<property name="sessionTimeout" value="120000"/>
<property name="joinTimeout" value="10000"/>
</bean>
</property>
<property name="communicationSpi">
<bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
<property name="socketWriteTimeout" value="60000"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="referenceDataCacheTemplate*"/>
<property name="cacheMode" value="REPLICATED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.referencedata"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="statisticsEnabled" value="true"/>
<property name="sqlIndexMaxInlineSize" value="203"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
</bean>
</property>
</bean>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="inputMetadataCacheTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.input"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="statisticsEnabled" value="true"/>
<property name="readFromBackup" value="false"/>
<property name="sqlIndexMaxInlineSize" value="211"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
<property name="affinityBackupFilter">
<bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
<constructor-arg>
<array value-type="java.lang.String">
<value>RACK_ID</value>
</array>
</constructor-arg>
</bean>
</property>
</bean>
</property>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.ModifiedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="DAYS"/>
<constructor-arg value="5"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="inputReconCacheTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.input"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="statisticsEnabled" value="true"/>
<property name="readFromBackup" value="false"/>
<property name="sqlIndexMaxInlineSize" value="211"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
<property name="affinityBackupFilter">
<bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
<constructor-arg>
<array value-type="java.lang.String">
<value>RACK_ID</value>
</array>
</constructor-arg>
</bean>
</property>
</bean>
</property>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="DAYS"/>
<constructor-arg value="4"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="inputExceptionsCacheTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.input.exception"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="statisticsEnabled" value="true"/>
<property name="readFromBackup" value="false"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
<property name="affinityBackupFilter">
<bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
<constructor-arg>
<array value-type="java.lang.String">
<value>RACK_ID</value>
</array>
</constructor-arg>
</bean>
</property>
</bean>
</property>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="DAYS"/>
<constructor-arg value="15"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="outputDataCacheTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.output"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="sqlSchema" value=""/>
<property name="statisticsEnabled" value="true"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
<property name="affinityBackupFilter">
<bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
<constructor-arg>
<array value-type="java.lang.String">
<value>RACK_ID</value>
</array>
</constructor-arg>
</bean>
</property>
</bean>
</property>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="DAYS"/>
<constructor-arg value="450"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="reconAuditDataCacheTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.referencedata"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="sqlSchema" value=""/>
<property name="statisticsEnabled" value="true"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
<property name="affinityBackupFilter">
<bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
<constructor-arg>
<array value-type="java.lang.String">
<value>RACK_ID</value>
</array>
</constructor-arg>
</bean>
</property>
</bean>
</property>
</bean>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="fileDataCacheTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.input"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="statisticsEnabled" value="true"/>
<property name="queryParallelism" value="4"/>
<property name="eagerTtl" value="true"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
<property name="affinityBackupFilter">
<bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
<constructor-arg>
<array value-type="java.lang.String">
<value>RACK_ID</value>
</array>
</constructor-arg>
</bean>
</property>
</bean>
</property>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="DAYS"/>
<constructor-arg value="5"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="shortLivedReferenceDataTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.dev.input.exception"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="statisticsEnabled" value="true"/>
<property name="managementEnabled" value="true"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="64"/>
<property name="affinityBackupFilter">
<bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
<constructor-arg>
<array value-type="java.lang.String">
<value>RACK_ID</value>
</array>
</constructor-arg>
</bean>
</property>
</bean>
</property>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="DAYS"/>
<constructor-arg value="2"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
</list>
</property>
<property name="sqlSchemas">
<list>
<value>dataInput</value>
</list>
</property>
</bean>
</beans>