We are using a continuous query to transfer data across all clients nodes. However we have a scaling grid so we often run into issue where data node keeps trying to connect to client to send the data from continuous query which has already scaled down. This brings system to a halt as PME operation cannot acquire a lock, so topology doesn't get updated.
In order to resolve this, I want to use parameter TxTimeoutOnPartitionMapExchange
which will allow PME to proceed.
However in order to utilise this parameter, do i need to change atomicityMode
of my caches to transactional? If yes then, will the process of data node trying to send data from continuous query count as a transaction?
In summary, I am trying to work out if TxTimeoutOnPartitionMapExchange
parameter help in my situation with continuous query and what will be the steps to enable this parameter.
EDIT: Stacktrace of issue I am trying to solve:
- Continuous keeps trying to reserve the client and i believe it holds global lock here which blocks cache updates and checkpointing
:
Deadlock: false
Completed: 1999706
Thread [name="sys-stripe-6-#7%pv-ib-valuation%", id=42, state=WAITING, blockCnt=52537, waitCnt=734400]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3229)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:2100)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:2365)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1964)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1935)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1917)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1324)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1261)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:1059)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.access$600(CacheContinuousQueryHandler.java:90)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler$2.onEntryUpdated(CacheContinuousQueryHandler.java:459)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:447)
at o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2495)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2657)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2118)
- This starts coming up after reserveclient call is struck as it is unable to acquire a lock
:
>>> Possible starvation in striped pool.
Thread name: sys-stripe-4-#5%pv-ib-valuation%
Queue: []
Deadlock: false
Completed: 6328076
Thread [name="sys-stripe-4-#5%pv-ib-valuation%", id=40, state=WAITING, blockCnt=111790, waitCnt=2018248]
Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@66d8e343, ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1663)
at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2715)
at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2679)
at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1051)
at o.a.i.i.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:243)
at o.a.i.i.processors.cache.GridCacheUtils.unwindEvicts(GridCacheUtils.java:873)
at o.a.i.i.processors.cache.GridCacheIoManager.onMessageProcessed(GridCacheIoManager.java:1189)
So overall my analysis so far is that if a client is gone then continuous query keeps trying to connect holding a lock which blocks everything.
- Sample page locks dump. Its a similar page link dump everytime and all threads just seem to be waiting and not locked
:
Page locks dump:
Thread=[name=checkpoint-runner-#94%pv-ib-valuation%, id=162], state=WAITING
Locked pages = []
Locked pages log: name=checkpoint-runner-#94%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=checkpoint-runner-#95%pv-ib-valuation%, id=163], state=WAITING
Locked pages = []
Locked pages log: name=checkpoint-runner-#95%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=checkpoint-runner-#96%pv-ib-valuation%, id=164], state=WAITING
Locked pages = []
Locked pages log: name=checkpoint-runner-#96%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=checkpoint-runner-#97%pv-ib-valuation%, id=165], state=WAITING
Locked pages = []
Locked pages log: name=checkpoint-runner-#97%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-0-#15%pv-ib-valuation%, id=50], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-0-#15%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-1-#16%pv-ib-valuation%, id=51], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-1-#16%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-10-#25%pv-ib-valuation%, id=60], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-10-#25%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-11-#26%pv-ib-valuation%, id=61], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-11-#26%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-12-#27%pv-ib-valuation%, id=62], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-12-#27%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-13-#28%pv-ib-valuation%, id=63], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-13-#28%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-14-#29%pv-ib-valuation%, id=64], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-14-#29%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-15-#30%pv-ib-valuation%, id=65], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-15-#30%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-2-#17%pv-ib-valuation%, id=52], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-2-#17%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-3-#18%pv-ib-valuation%, id=53], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-3-#18%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-4-#19%pv-ib-valuation%, id=54], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-4-#19%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-5-#20%pv-ib-valuation%, id=55], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-5-#20%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-6-#21%pv-ib-valuation%, id=56], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-6-#21%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-7-#22%pv-ib-valuation%, id=57], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-7-#22%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-8-#23%pv-ib-valuation%, id=58], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-8-#23%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=data-streamer-stripe-9-#24%pv-ib-valuation%, id=59], state=WAITING
Locked pages = []
Locked pages log: name=data-streamer-stripe-9-#24%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=db-checkpoint-thread-#93%pv-ib-valuation%, id=161], state=TIMED_WAITING
Locked pages = []
Locked pages log: name=db-checkpoint-thread-#93%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=dms-writer-thread-#77%pv-ib-valuation%, id=145], state=WAITING
Locked pages = []
Locked pages log: name=dms-writer-thread-#77%pv-ib-valuation% time=(1674196038673, 2023-01-20 06:27:18.673)
Thread=[name=exchange-worker-#71%pv-ib-valuation%, id=139], state=TIMED_WAITING
Locked pages = []
Locked pages log: name=exchange-worker-#71%pv-ib-valuation% time=(1674196038673, 2023-01-20 06:27:18.673)
Thread=[name=lock-cleanup-0, id=278], state=WAITING
Locked pages = []
Locked pages log: name=lock-cleanup-0 time=(1674196038673, 2023-01-20 06:27:18.673)
Thread=[name=lock-cleanup-scheduled-0, id=171], state=WAITING
Locked pages = []
Locked pages log: name=lock-cleanup-scheduled-0 time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=main, id=1], state=WAITING
Locked pages = []
Locked pages log: name=main time=(1674196038673, 2023-01-20 06:27:18.673)
Thread=[name=query-#5729%pv-ib-valuation%, id=6455], state=WAITING
Locked pages = []
Locked pages log: name=query-#5729%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=query-#5730%pv-ib-valuation%, id=6456], state=WAITING
Locked pages = []
Locked pages log: name=query-#5730%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=query-#5735%pv-ib-valuation%, id=6461], state=WAITING
Locked pages = []
Locked pages log: name=query-#5735%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=query-#5736%pv-ib-valuation%, id=6462], state=WAITING
Locked pages = []
Locked pages log: name=query-#5736%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-0-#1%pv-ib-valuation%, id=36], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-0-#1%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-1-#2%pv-ib-valuation%, id=37], state=RUNNABLE
Locked pages = []
Locked pages log: name=sys-stripe-1-#2%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-10-#11%pv-ib-valuation%, id=46], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-10-#11%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-11-#12%pv-ib-valuation%, id=47], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-11-#12%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-12-#13%pv-ib-valuation%, id=48], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-12-#13%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-13-#14%pv-ib-valuation%, id=49], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-13-#14%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-2-#3%pv-ib-valuation%, id=38], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-2-#3%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-3-#4%pv-ib-valuation%, id=39], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-3-#4%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-4-#5%pv-ib-valuation%, id=40], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-4-#5%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-5-#6%pv-ib-valuation%, id=41], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-5-#6%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-6-#7%pv-ib-valuation%, id=42], state=RUNNABLE
Locked pages = []
Locked pages log: name=sys-stripe-6-#7%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-7-#8%pv-ib-valuation%, id=43], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-7-#8%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-8-#9%pv-ib-valuation%, id=44], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-8-#9%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=sys-stripe-9-#10%pv-ib-valuation%, id=45], state=WAITING
Locked pages = []
Locked pages log: name=sys-stripe-9-#10%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
Thread=[name=ttl-cleanup-worker-#62%pv-ib-valuation%, id=127], state=TIMED_WAITING
Locked pages = []
Locked pages log: name=ttl-cleanup-worker-#62%pv-ib-valuation% time=(1674196038674, 2023-01-20 06:27:18.674)
TxTimeoutOnPartitionMapExchange
is about rolling back active transactions to unlock a PME process. It won't magically unlock every PME that could be stuck due to different reasons.For sure, it's worth having this setting configured in any case. To enable it, you need to adjust your server nodes' configuration and set this property to some value, like 30 secs. Here is an example of XML changes.
Speaking of the original CQ issue with client disconnects, I'd expect Ignite to handle that automatically with no problems. In other words, I don't think the issue of a hung PME is caused by a continuous query itself, but rather by something else, like yes, active TXs without a timeout.
You don't need to change atomicyMode of your caches. Transactions can't be applied to a non-transactional cache (atomic).