Environment:
- Apache Ignite 2.4 running on Amazon Linux. VM is 16CPUs/122GB ram. There is plenty of room there.
- 5 nodes, 12GB each
cacheMode = PARTITIONED
backups = 0
OnheapCacheEnabled = true
atomicityMode = ATOMIC
rebalacneMode = SYNC
rebalanceBatchSize = 1MB
copyOnread = false
rebalanceThrottle = 0
rebalanceThreadPoolSize = 4
Basically we have a process that populates the cache on startup and then receives periodic updates from Kafka, propagating them to the cache.
The number of elements in the cache is more or less stable over time (there is just a little fluctuation since we have a mixture of create, update and delete events), but what we have noticed is that the distribution of data across the different nodes is very uneven, with one of the nodes having at least double the number of keys (and memory utilization) as the others. Over time, that node either runs out of memory, or starts doing very long GCs and loses contact with the rest of the cluster.
My expectation was that Ignite would balance the data across the different nodes, but reality shows something completely different. Am I missing something here? Why do we see this imbalance and how do we fix it?
Thanks in advance.
Bottom line, although our hash function had good distribution, the default affinity function was not yielding a good distribution of keys (and, consequently, memory) across the nodes in the cluster. We replaced it with a very naive one (
partition # % # of nodes
), and that improved the distribution quite a bit (less than 2% variance).This not a generic solution; it works for us because our entire cluster is in one VM and we don't use replication. For massive clusters crossing VM boundaries and replication, keeping the replicated data in separate servers is mandatory, and the naive approach won't cut it.