We still need the old WebSphere based Kudos Boards from ISW. Users reported that ajax operations like setting due dates or responsible people in a card are very slow, some even abort before finished. Until now I can't see a pattern here why this occurs. It doesn't seem related to a specific user or client.
How could this be troubleshooted? I'd like to take one step back and measure first, how long it takes to process requests. Then generating a list of all requests that took more than a specific time to process, lets say more than one second. So we know how many users are affected from this and there are maybe some shared things between them, like users on a slow network or something like that.
Collecting request duration logs from IHS
The Apache module mod_log_config allows certain properties to be logged. Interesting is
%T
:Using the
%{UNIT}T
format, it could be displayed in microseconds instead of seconds, which is better for analyzing later. So I extended the common log format to write the processing time first inconf/httpd.conf
:Another good idea is to add the User-Agent too, which gave some information about the type of client:
First check the configuration file to make sure we didn't break anything:
Now we can do a graceful reastart on productive environments. This way, the webserver finishes all currently processed requests before applying our new config. In opposite to a hard restart, most of the users shouldn't even know that we restarted their webserver.
All new requests are logged with the duration in ms like this:
Analyzing the measured data
We now have several hundred MB of logs from a single day. A filter is required to get usefull information from them. When looking in the logs, we need to do two things to get information about slow requests from them:
I wrote the following command for this purpose:
The first
egrep
call filters out all requests which are not interesting here:/push
is from the long keep alive connections, kudos do this as well. I also excluded mp4 movies because those files are naturally large. When downloading a 500MB movie take some time, it's nothing we need to worry about. Sometimes I also saw them with GET parameters likemovie.mp4?logDownload=true&downloadType=view
so those were filtered as well.egrep '^[0-9]{4,}'
sort out requests which have a processing time less than 4 digits, wich basically mean >= 1000ms. Since I'm looking for heavy performance issues, this seems a good point to start.sort -rnk
orders our logs by duration descending, so we have the slowest entries at the beginning.wc -l $log
is just to know how much slow entries we collected this time. Now they could be viewed withless $log
. Our large access log with over 2 million entries was reduced to 2k.