We are currently facing a problem in our setup with a Wildfly/JGroups cluster in a Kubernetes environment. We have a varying amount of Wildfly (30.0.0) nodes that need to communicate with each other and form a cluster for ArtemisMQ JMS message handling. We are using dns.DNS_PING for discovery in the cluster and TCP as the main protocol for JGroups.
We use the following Wildfly CLI commands to setup the JGroups cluster:
`echo "Kubernetes interface and bindings"/interface=kubernetes:add(nic=eth0)/interface=private:add(inet-address="${jboss.bind.address.private:127.0.0.1}")/interface=dns:add(site-local-address=true)/socket-binding-group=standard-sockets/socket-binding=jgroups-tcp:add(interface=dns, port=7800)/socket-binding-group=standard-sockets/socket-binding=jgroups-tcp-fd:add(interface=dns, port=57800)/socket-binding-group=standard-sockets/socket-binding=http:write-attribute(name=interface,value=dns)/socket-binding-group=standard-sockets/socket-binding=https:write-attribute(name=interface,value=dns)
echo "JGroups"/extension=org.jboss.as.clustering.jgroups:add()/subsystem=jgroups:add()#/subsystem=jgroups:write-attribute(name=default-stack,value=tcp)
echo "TCP stack"batch/subsystem=jgroups/stack=tcp:add()#/subsystem=jgroups/stack=tcp:add/subsystem=jgroups/stack=tcp/transport=TCP:add(socket-binding=jgroups-tcp)/subsystem=jgroups/stack=tcp/protocol=MERGE3:add/subsystem=jgroups/stack=tcp/protocol=FD_SOCK:add(socket-binding=jgroups-tcp-fd)/subsystem=jgroups/stack=tcp/protocol=VERIFY_SUSPECT:add/subsystem=jgroups/stack=tcp/protocol=pbcast.NAKACK2:add/subsystem=jgroups/stack=tcp/protocol=UNICAST3:add/subsystem=jgroups/stack=tcp/protocol=pbcast.STABLE:add/subsystem=jgroups/stack=tcp/protocol=pbcast.GMS:add/subsystem=jgroups/stack=tcp/protocol=MFC:add/subsystem=jgroups/stack=tcp/protocol=FRAG3:addrun-batch
echo "JGroups Channel"/subsystem=jgroups/channel=ee:add(stack=tcp)/subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp)#/subsystem=jgroups/channel=ee:write-attribute(name=cluster,value=kubernetes)/subsystem=jgroups:write-attribute(name=default-channel,value=ee)
echo "DNS_PING Protocol"/subsystem=jgroups/stack=tcp/protocol=dns.DNS_PING:add(add-index=0,properties={dns_query="_ping._tcp.avaloq-wb-sync-manager-ping.namespace001.svc.cluster.local.",dns_record_type=SRV})`
The DNS_PING query points to a Kubernetes service that exposes the nodes we want to have in the cluster.
Now on a productive deployment we are getting massive amount of threads created by DNS_PING. We also see, that one thread is blocking the others and is hanging in the "PlainSocket.socketConnect" method. We have sock_conn_timeout set to 300 milliseconds for JGroups, so this wait should not really happen.
In the end, Wildfly is not able to start any more threads (OS level threads cannot be created any more). We are still unsure what exactly causes this problem, but we assume it might be the file descriptor limit being reached. In the end we have around 4000 threads, of which approximately 75% are DNS-Ping related.
The hanging thread looks like that:
{
"thread-id" => 109424945L,
"thread-name" => "Timer temp thread-20460,ee,avaloq-wb-sync-manager-0",
"thread-state" => "RUNNABLE",
"blocked-time" => -1L,
"blocked-count" => 1L,
"waited-time" => -1L,
"waited-count" => 1L,
"lock-info" => undefined,
"lock-name" => undefined,
"lock-owner-id" => -1L,
"lock-owner-name" => undefined,
"stack-trace" => [
{
"file-name" => "PlainSocketImpl.java",
"line-number" => -2,
"class-name" => "java.net.PlainSocketImpl",
"method-name" => "socketConnect",
"native-method" => true
},
{
"file-name" => "AbstractPlainSocketImpl.java",
"line-number" => 412,
"class-name" => "java.net.AbstractPlainSocketImpl",
"method-name" => "doConnect",
"native-method" => false
},
{
"file-name" => "AbstractPlainSocketImpl.java",
"line-number" => 255,
"class-name" => "java.net.AbstractPlainSocketImpl",
"method-name" => "connectToAddress",
"native-method" => false
},
{
"file-name" => "AbstractPlainSocketImpl.java",
"line-number" => 237,
"class-name" => "java.net.AbstractPlainSocketImpl",
"method-name" => "connect",
"native-method" => false
},
{
"file-name" => "SocksSocketImpl.java",
"line-number" => 392,
"class-name" => "java.net.SocksSocketImpl",
"method-name" => "connect",
"native-method" => false
},
{
"file-name" => "Socket.java",
"line-number" => 609,
"class-name" => "java.net.Socket",
"method-name" => "connect",
"native-method" => false
},
{
"file-name" => "Util.java",
"line-number" => 461,
"class-name" => "org.jgroups.util.Util",
"method-name" => "connect",
"native-method" => false
},
{
"file-name" => "TcpConnection.java",
"line-number" => 96,
"class-name" => "org.jgroups.blocks.cs.TcpConnection",
"method-name" => "connect",
"native-method" => false
},
{
"file-name" => "TcpConnection.java",
"line-number" => 88,
"class-name" => "org.jgroups.blocks.cs.TcpConnection",
"method-name" => "connect",
"native-method" => false
},
{
"file-name" => "BaseServer.java",
"line-number" => 295,
"class-name" => "org.jgroups.blocks.cs.BaseServer",
"method-name" => "getConnection",
"native-method" => false
},
{
"file-name" => "BaseServer.java",
"line-number" => 208,
"class-name" => "org.jgroups.blocks.cs.BaseServer",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "TCP.java",
"line-number" => 91,
"class-name" => "org.jgroups.protocols.TCP",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "BasicTCP.java",
"line-number" => 146,
"class-name" => "org.jgroups.protocols.BasicTCP",
"method-name" => "sendUnicast",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1638,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "sendToSingleMember",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1632,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "doSend",
"native-method" => false
},
{
"file-name" => "NoBundler.java",
"line-number" => 38,
"class-name" => "org.jgroups.protocols.NoBundler",
"method-name" => "sendSingleMessage",
"native-method" => false
},
{
"file-name" => "NoBundler.java",
"line-number" => 30,
"class-name" => "org.jgroups.protocols.NoBundler",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1620,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1353,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "_send",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1262,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "down",
"native-method" => false
},
{
"file-name" => "DNS_PING.java",
"line-number" => 189,
"class-name" => "org.jgroups.protocols.dns.DNS_PING",
"method-name" => "sendDiscoveryRequest",
"native-method" => false
},
{
"file-name" => "DNS_PING.java",
"line-number" => 182,
"class-name" => "org.jgroups.protocols.dns.DNS_PING",
"method-name" => "findMembers",
"native-method" => false
},
{
"file-name" => "Discovery.java",
"line-number" => 217,
"class-name" => "org.jgroups.protocols.Discovery",
"method-name" => "invokeFindMembers",
"native-method" => false
},
{
"file-name" => "Discovery.java",
"line-number" => 228,
"class-name" => "org.jgroups.protocols.Discovery",
"method-name" => "lambda$findMembers$0",
"native-method" => false
},
{
"file-name" => undefined,
"line-number" => -1,
"class-name" => "org.jgroups.protocols.Discovery$$Lambda$968/0x0000000840b0bc40",
"method-name" => "run",
"native-method" => false
},
{
"file-name" => "TimeScheduler3.java",
"line-number" => 324,
"class-name" => "org.jgroups.util.TimeScheduler3$Task",
"method-name" => "run",
"native-method" => false
},
{
"file-name" => "ContextReferenceExecutor.java",
"line-number" => 49,
"class-name" => "org.jboss.as.clustering.context.ContextReferenceExecutor",
"method-name" => "execute",
"native-method" => false
},
{
"file-name" => "ContextualExecutor.java",
"line-number" => 70,
"class-name" => "org.jboss.as.clustering.context.ContextualExecutor$1",
"method-name" => "run",
"native-method" => false
},
{
"file-name" => "Thread.java",
"line-number" => 829,
"class-name" => "java.lang.Thread",
"method-name" => "run",
"native-method" => false
}
],
"suspended" => false,
"in-native" => false,
"locked-monitors" => [{
"class-name" => "java.net.SocksSocketImpl",
"identity-hash-code" => 139076230,
"locked-stack-depth" => 1,
"locked-stack-frame" => {
"file-name" => "AbstractPlainSocketImpl.java",
"line-number" => 412,
"class-name" => "java.net.AbstractPlainSocketImpl",
"method-name" => "doConnect",
"native-method" => false
}
}],
"locked-synchronizers" => [{
"class-name" => "java.util.concurrent.locks.ReentrantLock$FairSync",
"identity-hash-code" => 740591308
}]
},
And a typical waiting thread:
"thread-id" => 109424946L,
"thread-name" => "Timer temp thread-20461,ee,avaloq-wb-sync-manager-0",
"thread-state" => "WAITING",
"blocked-time" => -1L,
"blocked-count" => 1L,
"waited-time" => -1L,
"waited-count" => 1L,
"lock-info" => {
"class-name" => "java.util.concurrent.locks.ReentrantLock$FairSync",
"identity-hash-code" => 740591308
},
"lock-name" => "java.util.concurrent.locks.ReentrantLock$FairSync@2c2486cc",
"lock-owner-id" => 109424945L,
"lock-owner-name" => "Timer temp thread-20460,ee,avaloq-wb-sync-manager-0",
"stack-trace" => [
{
"file-name" => "Unsafe.java",
"line-number" => -2,
"class-name" => "jdk.internal.misc.Unsafe",
"method-name" => "park",
"native-method" => true
},
{
"file-name" => "LockSupport.java",
"line-number" => 194,
"class-name" => "java.util.concurrent.locks.LockSupport",
"method-name" => "park",
"native-method" => false
},
{
"file-name" => "AbstractQueuedSynchronizer.java",
"line-number" => 885,
"class-name" => "java.util.concurrent.locks.AbstractQueuedSynchronizer",
"method-name" => "parkAndCheckInterrupt",
"native-method" => false
},
{
"file-name" => "AbstractQueuedSynchronizer.java",
"line-number" => 943,
"class-name" => "java.util.concurrent.locks.AbstractQueuedSynchronizer",
"method-name" => "doAcquireInterruptibly",
"native-method" => false
},
{
"file-name" => "AbstractQueuedSynchronizer.java",
"line-number" => 1263,
"class-name" => "java.util.concurrent.locks.AbstractQueuedSynchronizer",
"method-name" => "acquireInterruptibly",
"native-method" => false
},
{
"file-name" => "ReentrantLock.java",
"line-number" => 317,
"class-name" => "java.util.concurrent.locks.ReentrantLock",
"method-name" => "lockInterruptibly",
"native-method" => false
},
{
"file-name" => "BaseServer.java",
"line-number" => 277,
"class-name" => "org.jgroups.blocks.cs.BaseServer",
"method-name" => "getConnection",
"native-method" => false
},
{
"file-name" => "BaseServer.java",
"line-number" => 208,
"class-name" => "org.jgroups.blocks.cs.BaseServer",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "TCP.java",
"line-number" => 91,
"class-name" => "org.jgroups.protocols.TCP",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "BasicTCP.java",
"line-number" => 146,
"class-name" => "org.jgroups.protocols.BasicTCP",
"method-name" => "sendUnicast",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1638,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "sendToSingleMember",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1632,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "doSend",
"native-method" => false
},
{
"file-name" => "NoBundler.java",
"line-number" => 38,
"class-name" => "org.jgroups.protocols.NoBundler",
"method-name" => "sendSingleMessage",
"native-method" => false
},
{
"file-name" => "NoBundler.java",
"line-number" => 30,
"class-name" => "org.jgroups.protocols.NoBundler",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1620,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "send",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1353,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "_send",
"native-method" => false
},
{
"file-name" => "TP.java",
"line-number" => 1262,
"class-name" => "org.jgroups.protocols.TP",
"method-name" => "down",
"native-method" => false
},
{
"file-name" => "DNS_PING.java",
"line-number" => 189,
"class-name" => "org.jgroups.protocols.dns.DNS_PING",
"method-name" => "sendDiscoveryRequest",
"native-method" => false
},
{
"file-name" => "DNS_PING.java",
"line-number" => 182,
"class-name" => "org.jgroups.protocols.dns.DNS_PING",
"method-name" => "findMembers",
"native-method" => false
},
{
"file-name" => "Discovery.java",
"line-number" => 217,
"class-name" => "org.jgroups.protocols.Discovery",
"method-name" => "invokeFindMembers",
"native-method" => false
},
{
"file-name" => "Discovery.java",
"line-number" => 228,
"class-name" => "org.jgroups.protocols.Discovery",
"method-name" => "lambda$findMembers$0",
"native-method" => false
},
{
"file-name" => undefined,
"line-number" => -1,
"class-name" => "org.jgroups.protocols.Discovery$$Lambda$968/0x0000000840b0bc40",
"method-name" => "run",
"native-method" => false
},
{
"file-name" => "TimeScheduler3.java",
"line-number" => 324,
"class-name" => "org.jgroups.util.TimeScheduler3$Task",
"method-name" => "run",
"native-method" => false
},
{
"file-name" => "ContextReferenceExecutor.java",
"line-number" => 49,
"class-name" => "org.jboss.as.clustering.context.ContextReferenceExecutor",
"method-name" => "execute",
"native-method" => false
},
{
"file-name" => "ContextualExecutor.java",
"line-number" => 70,
"class-name" => "org.jboss.as.clustering.context.ContextualExecutor$1",
"method-name" => "run",
"native-method" => false
},
{
"file-name" => "Thread.java",
"line-number" => 829,
"class-name" => "java.lang.Thread",
"method-name" => "run",
"native-method" => false
}
],
"suspended" => false,
"in-native" => false,
"locked-monitors" => [],
"locked-synchronizers" => []
},
Did anyone experience a similar problem?