Tactics to explore performance degradation of java based web app over time

228 Views Asked by At

I am working on enterprise java application which has a lot of tools/frameworks in it already, such as Struts, JAX-RS and Spring MVC. It contains UIs and REST endpoints bundled together in a .war file. The project is evolving and we are getting rid of older tools, striving for sticking up with only Spring MVC/Webflux.

Application is performing search on millions of XML/JSON records and recently the search engine was switched from Marklogic to Elasticsearch.

What we have noticed is that on production with not that heavy usage (up to 1.7k rpm on 2-4 application nodes) response times on some of the endpoints are increasing over time. Elasticsearch has a space to grow and does not show any signs of a huge load. So currently we have to restart/replace prod instances once in like a week or two when average response time is over 3 seconds instead of a regular 200-300 millis.

I tried to get CPU and heap flame graphs using async-profiler but the load profile is changing on every measurement as we have bunch of features available so I cannot really compare how graphs are changing over time.

Can you advise me on some tactics/approaches for finding the proper place in the code?

1

There are 1 best solutions below

0
On

Found issue. It is related to thread pooling.

What we have noticed is that over time amount of active tomcat threads were growing together with response times: Usage of tomcat thread pool over time On the image you can also see that the server was restarted on May 9th.

I was able to get a heap dump before the server restarted and after some digging found an interesting repeated piece in thread dump:

Thread xxx
  at sun.misc.Unsafe.park(ZJ)V (Native Method)
  at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V (LockSupport.java:175)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()V (AbstractQueuedSynchronizer.java:2039)
  at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(Ljava/lang/Object;Ljava/lang/Object;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/Future;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:377)
  at org.apache.http.pool.AbstractConnPool.access$200(Lorg/apache/http/pool/AbstractConnPool;Ljava/lang/Object;Ljava/lang/Object;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/Future;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:67)
  at org.apache.http.pool.AbstractConnPool$2.get(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:243)
  at org.apache.http.pool.AbstractConnPool$2.get(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; (AbstractConnPool.java:191)
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(Ljava/util/concurrent/Future;JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/HttpClientConnection; (PoolingHttpClientConnectionManager.java:282)
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/HttpClientConnection; (PoolingHttpClientConnectionManager.java:269)
  at org.apache.http.impl.execchain.MainClientExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (MainClientExec.java:191)
  at org.apache.http.impl.execchain.ProtocolExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (ProtocolExec.java:185)
  at org.apache.http.impl.execchain.RetryExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (RetryExec.java:89)
  at org.apache.http.impl.execchain.RedirectExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (RedirectExec.java:111)
  at org.apache.http.impl.client.InternalHttpClient.doExecute(Lorg/apache/http/HttpHost;Lorg/apache/http/HttpRequest;Lorg/apache/http/protocol/HttpContext;)Lorg/apache/http/client/methods/CloseableHttpResponse; (InternalHttpClient.java:185)
  at org.apache.http.impl.client.CloseableHttpClient.execute(Lorg/apache/http/client/methods/HttpUriRequest;Lorg/apache/http/protocol/HttpContext;)Lorg/apache/http/client/methods/CloseableHttpResponse; (CloseableHttpClient.java:83)
  at org.apache.http.impl.client.CloseableHttpClient.execute(Lorg/apache/http/client/methods/HttpUriRequest;)Lorg/apache/http/client/methods/CloseableHttpResponse; (CloseableHttpClient.java:108)
  at io.searchbox.client.http.JestHttpClient.executeRequest(Lorg/apache/http/client/methods/HttpUriRequest;)Lorg/apache/http/client/methods/CloseableHttpResponse; (JestHttpClient.java:136)
  at io.searchbox.client.http.JestHttpClient.execute(Lio/searchbox/action/Action;Lorg/apache/http/client/config/RequestConfig;)Lio/searchbox/client/JestResult; (JestHttpClient.java:70)
  at io.searchbox.client.http.JestHttpClient.execute(Lio/searchbox/action/Action;)Lio/searchbox/client/JestResult; (JestHttpClient.java:63)
...

In our case we are using Jest library to talk to Elasticsearch. Internally it is using Apache HTTP client & Apache HTTP Async Client.

As you see on the thread dump it's clear that this thread was waiting for an available thread in HTTP Client's thread pool. And there were more threads with exactly the same stack.

What I also discovered is that we set maxTotal (maximum total number of connections) to 20 and defaultMaxPerRoute (maximum connections per route) to 2:

By default the pool allows only 20 concurrent connections in total and two concurrent connections per a unique route. The two connection limit is due to the requirements of the HTTP specification. However, in practical terms this can often be too restrictive.

See Connection pools description.

So the fix I did is increased those values to 50 and 40 respectively. I would still prefer to have this parameters unbounded and grow with a usage but for now stick to these values.