Ignite system-critical thread blocked

393 Views Asked by At

I have a 2 node Ignite cluster on dev environment. Version 2.9.0 I have enabled persistence in my data region configs. When I start cluster it runs fine. Now when I shutdown node one by one and restart them, to check if persistence is working alright, I see below errors in one of the nodes.

SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=ttl-cleanup-worker, threadName=ttl-cleanup-worker-#44%Gemini.dev%, blockedFor=3079s]

[15:04:59] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=ttl-cleanup-worker, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928019372]]]

Jan 05, 2023 3:05:08 PM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-0, threadName=sys-stripe-0-#1%Gemini.dev%, blockedFor=2490s]

[15:05:08] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-0, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928617333]]]

Jan 05, 2023 3:05:08 PM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-6, threadName=sys-stripe-6-#7%Gemini.dev%, blockedFor=2506s]

[15:05:08] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-6, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928602019]]]

Jan 05, 2023 3:05:17 PM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=db-checkpoint-thread, threadName=db-checkpoint-thread-#105%Gemini.dev%, blockedFor=3098s]

[15:05:17] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=db-checkpoint-thread, igniteInstanceName=Gemini.dev, finished=false, heartbeatTs=1672928018434]]]

When these errors are shown some caches of cluster become unresponsive. I am attaching entire Ignite config at the end.

I also tried observing persistence related metrics, but I don't see any in JConsole in "Persistent Storage" section. As you can see in my config I have set metricsEnabled=true in DR configs.

I took a threaddump too. Although I am unable to make out anything from it. There are many threads which are in WAITING(parking) state. Here is one snippet from it.

"db-checkpoint-thread-#105%Gemini.dev%" #165 prio=5 os_prio=0 cpu=86.94ms elapsed=2085.34s tid=0x00007f9a5c138000 nid=0x3d3b waiting on condition  [0x00007f7eabf7c000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000004018cc220> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)

Some lines are cut to keep post under size limit

Below is my entire Ignite server config.

<?xml version="1.0" encoding="UTF-8"?>    
<beans xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:util="http://www.springframework.org/schema/util"
       xmlns="http://www.springframework.org/schema/beans"
       xsi:schemaLocation="
        http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans.xsd
        http://www.springframework.org/schema/util
        http://www.springframework.org/schema/util/spring-util.xsd">

  <bean id="propertyConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
    <property name="systemPropertiesModeName" value="SYSTEM_PROPERTIES_MODE_FALLBACK"/>
    <property name="searchSystemEnvironment" value="true"/>
  </bean>

  <bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <!-- Set to true to enable distributed class loading for examples, default is false. -->
    <property name="sslContextFactory">
      <bean class="org.apache.ignite.ssl.SslContextFactory">
          <property name="keyStoreFilePath" value="/home/sysSvcDevOps/ssl/ignite1.keystore.jks"/>
          <property name="keyStorePassword" value="KeyStore443"/>
          <property name="keyStoreType" value="jks"/>
          <property name="trustStoreFilePath" value="/home/sysSvcDevOps/ssl/cacerts/java.cacerts.jks"/>
          <property name="trustStorePassword" value="changeit"/>
          <property name="trustStoreType" value="jks"/>
      </bean>
    </property>
    <property name="igniteInstanceName" value=".dev"/>
    <property name="consistentId" value="ignite1.dev"/>
    <property name="workDirectory" value="/apps/Svc/dev/Ignite/IgniteData/persistentstore/work"/>

      <property name="rebalanceThreadPoolSize" value="8"/>
      <property name="publicThreadPoolSize" value="32"/>
      <property name="systemThreadPoolSize" value="64"/>
      <property name="queryThreadPoolSize" value="64"/>
      <property name="failureDetectionTimeout" value="30000"/>
      <property name="authenticationEnabled" value="true"/>
      <property name="metricsUpdateFrequency" value="30000"/>
      <property name="peerClassLoadingEnabled" value="false"/>
      <property name="clientMode" value="false"/>

    <!-- Enable task execution events for examples. -->
    <property name="includeEventTypes">
      <list>
        <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STARTED"/>
        <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_STOPPED"/>
        <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST"/>
        <util:constant static-field="org.apache.ignite.events.EventType.EVT_CACHE_NODES_LEFT"/>
      </list>
    </property>

    <property name="dataStorageConfiguration">
      <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

        <property name="walSegmentSize" value="1073741824"/>
          <property name="walSegments" value="20"/>
          <property name="maxWalArchiveSize" value="10737418240"/>
          <property name="walCompactionEnabled" value="true"/>
          <property name="walCompactionLevel" value="4"/>
          <property name="checkpointFrequency" value="300000"/>
          <property name="checkpointThreads" value="16"/>
          <property name="checkpointReadLockTimeout" value="60000"/>
          <property name="lockWaitTime" value="45000"/>
          <property name="checkpointWriteOrder" value="RANDOM"/>
          <property name="pageSize" value="4096"/>
          <property name="writeThrottlingEnabled" value="true"/>

        <!-- wal storage paths -->
        <property name="walPath" value="/apps/Svc/dev/Ignite/IgniteData"/>
        <property name="walArchivePath" value="/apps/Svc/dev/Ignite/IgniteDataArchive"/>
        <property name="storagePath" value="/apps/Svc/dev/Ignite/IgniteData/archive"/>

        <property name="dataRegionConfigurations">
          <list>
                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="dr.dev.referencedata"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="initialSize" value="1073741824"/>
                    <property name="maxSize" value="4294969673"/>
                    <property name="checkpointPageBufferSize" value="1073741824"/>
                </bean>
                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="dr.dev.input"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="metricsEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{4 * 1024 * 1024 * 1024}"/>
                    <property name="initialSize" value="12884901888"/>
                    <property name="maxSize" value="81604378624"/>
                    <property name="pageEvictionMode" value="RANDOM_2_LRU"/>
                </bean>
                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="dr.dev.input.exception"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="metricsEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{4 * 1024 * 1024 * 1024}"/>
                    <property name="initialSize" value="4294967296"/>
                    <property name="maxSize" value="21474836480"/>
                    <property name="pageEvictionMode" value="RANDOM_2_LRU"/>
                </bean>
                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="dr.dev.output"/>
                    <property name="initialSize" value="1073741824"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="metricsEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{2 * 1024 * 1024 * 1024}"/>
                    <property name="maxSize" value="2147483648"/>
                </bean>
          </list>
        </property>

        <property name="defaultDataRegionConfiguration">
          <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="name" value="default_region"/>
              <property name="persistenceEnabled" value="true"/>
              <property name="initialSize" value="268435456"/>
              <property name="maxSize" value="268435456"/>

          </bean>
        </property>
      </bean>
    </property>

    <property name="discoverySpi">
      <bean class="org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi">
        <property name="zkConnectionString" value="zk1.intranet.com:22001,zk2.intranet.com:22001"/>
          <property name="zkRootPath" value="/ignite"/>
          <property name="sessionTimeout" value="120000"/>
          <property name="joinTimeout" value="10000"/>
      </bean>
    </property>

    <property name="communicationSpi">
      <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
          <property name="socketWriteTimeout" value="60000"/>
      </bean>
    </property>

    <property name="cacheConfiguration">
      <list>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="referenceDataCacheTemplate*"/>
              <property name="cacheMode" value="REPLICATED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.referencedata"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="statisticsEnabled" value="true"/>
              <property name="sqlIndexMaxInlineSize" value="203"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="256"/>


              </bean>
            </property>
          </bean>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="inputMetadataCacheTemplate*"/>
              <property name="cacheMode" value="PARTITIONED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.input"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="statisticsEnabled" value="true"/>
              <property name="readFromBackup" value="false"/>
              <property name="sqlIndexMaxInlineSize" value="211"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="256"/>

                <property name="affinityBackupFilter">
                          <bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
                            <constructor-arg>
                                                  <array value-type="java.lang.String">
                                                  <value>RACK_ID</value>
                                                     </array>
                             </constructor-arg>
                          </bean>
                </property>

              </bean>
            </property>
              <property name="expiryPolicyFactory">
                <bean class="javax.cache.expiry.ModifiedExpiryPolicy" factory-method="factoryOf">
                  <constructor-arg>
                    <bean class="javax.cache.expiry.Duration">
                      <constructor-arg value="DAYS"/>
                      <constructor-arg value="5"/>
                    </bean>
                  </constructor-arg>
                </bean>
              </property>
          </bean>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="inputReconCacheTemplate*"/>
              <property name="cacheMode" value="PARTITIONED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.input"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="statisticsEnabled" value="true"/>
              <property name="readFromBackup" value="false"/>
              <property name="sqlIndexMaxInlineSize" value="211"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="256"/>

                <property name="affinityBackupFilter">
                          <bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
                            <constructor-arg>
                                                  <array value-type="java.lang.String">
                                                  <value>RACK_ID</value>
                                                     </array>
                             </constructor-arg>
                          </bean>
                </property>

              </bean>
            </property>
              <property name="expiryPolicyFactory">
                <bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
                  <constructor-arg>
                    <bean class="javax.cache.expiry.Duration">
                      <constructor-arg value="DAYS"/>
                      <constructor-arg value="4"/>
                    </bean>
                  </constructor-arg>
                </bean>
              </property>
          </bean>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="inputExceptionsCacheTemplate*"/>
              <property name="cacheMode" value="PARTITIONED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.input.exception"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="statisticsEnabled" value="true"/>
              <property name="readFromBackup" value="false"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="256"/>

                <property name="affinityBackupFilter">
                          <bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
                            <constructor-arg>
                                                  <array value-type="java.lang.String">
                                                  <value>RACK_ID</value>
                                                     </array>
                             </constructor-arg>
                          </bean>
                </property>

              </bean>
            </property>
              <property name="expiryPolicyFactory">
                <bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
                  <constructor-arg>
                    <bean class="javax.cache.expiry.Duration">
                      <constructor-arg value="DAYS"/>
                      <constructor-arg value="15"/>
                    </bean>
                  </constructor-arg>
                </bean>
              </property>
          </bean>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="outputDataCacheTemplate*"/>
              <property name="cacheMode" value="PARTITIONED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.output"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="sqlSchema" value=""/>
              <property name="statisticsEnabled" value="true"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="256"/>

                <property name="affinityBackupFilter">
                          <bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
                            <constructor-arg>
                                                  <array value-type="java.lang.String">
                                                  <value>RACK_ID</value>
                                                     </array>
                             </constructor-arg>
                          </bean>
                </property>

              </bean>
            </property>
              <property name="expiryPolicyFactory">
                <bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
                  <constructor-arg>
                    <bean class="javax.cache.expiry.Duration">
                      <constructor-arg value="DAYS"/>
                      <constructor-arg value="450"/>
                    </bean>
                  </constructor-arg>
                </bean>
              </property>
          </bean>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="reconAuditDataCacheTemplate*"/>
              <property name="cacheMode" value="PARTITIONED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.referencedata"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="sqlSchema" value=""/>
              <property name="statisticsEnabled" value="true"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="256"/>

                <property name="affinityBackupFilter">
                          <bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
                            <constructor-arg>
                                                  <array value-type="java.lang.String">
                                                  <value>RACK_ID</value>
                                                     </array>
                             </constructor-arg>
                          </bean>
                </property>

              </bean>
            </property>
          </bean>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="fileDataCacheTemplate*"/>
              <property name="cacheMode" value="PARTITIONED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.input"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="statisticsEnabled" value="true"/>
              <property name="queryParallelism" value="4"/>
              <property name="eagerTtl" value="true"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="256"/>

                <property name="affinityBackupFilter">
                          <bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
                            <constructor-arg>
                                                  <array value-type="java.lang.String">
                                                  <value>RACK_ID</value>
                                                     </array>
                             </constructor-arg>
                          </bean>
                </property>

              </bean>
            </property>
              <property name="expiryPolicyFactory">
                <bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
                  <constructor-arg>
                    <bean class="javax.cache.expiry.Duration">
                      <constructor-arg value="DAYS"/>
                      <constructor-arg value="5"/>
                    </bean>
                  </constructor-arg>
                </bean>
              </property>
          </bean>
          <bean id="cache-template-bean" abstract="true"
                class="org.apache.ignite.configuration.CacheConfiguration">
              <property name="name" value="shortLivedReferenceDataTemplate*"/>
              <property name="cacheMode" value="PARTITIONED"/>
              <property name="backups" value="1"/>
              <property name="atomicityMode" value="ATOMIC"/>
              <property name="dataRegionName" value="dr.dev.input.exception"/>
              <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
              <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
              <property name="statisticsEnabled" value="true"/>
              <property name="managementEnabled" value="true"/>
            <property name="affinity">
              <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
                  <property name="partitions" value="64"/>

                <property name="affinityBackupFilter">
                          <bean class="org.apache.ignite.cache.affinity.rendezvous.ClusterNodeAttributeAffinityBackupFilter">
                            <constructor-arg>
                                                  <array value-type="java.lang.String">
                                                  <value>RACK_ID</value>
                                                     </array>
                             </constructor-arg>
                          </bean>
                </property>

              </bean>
            </property>
              <property name="expiryPolicyFactory">
                <bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
                  <constructor-arg>
                    <bean class="javax.cache.expiry.Duration">
                      <constructor-arg value="DAYS"/>
                      <constructor-arg value="2"/>
                    </bean>
                  </constructor-arg>
                </bean>
              </property>
          </bean>

      </list>
    </property>

    <property name="sqlSchemas">
      <list>
        <value>dataInput</value>
      </list>
    </property>

  </bean>
</beans>
0

There are 0 best solutions below