We recently upgraded one of our webservers from PHP 5.3 (Debian Squeeze package, using libmysqlclient and APC) to PHP 5.4 (Debian Wheezy, Dotdeb package, using mysqlnd, Opcache and APCu). After working fine for almost one day, we experienced "mysql server has gone away" errors for every request. All other servers with the same load which still run PHP 5.3 with libmysqlclient using the same MySQL server had no problem at all. On all servers we use:
max_execution_time = 60
default_socket_timeout = 60
On our PHP 5.3 servers we did not change any mysql/my.cnf timeouts. We know about problems with read_timeout (mysql), wait_timeout (mysql), default_socket_timeout (php) and max_execution_time (php), but only in context of batch scripts with long running queries. Our webservers usually respond in about 300ms, so those timeouts should not be an issue here.
It became really strange when we removed the server from loadbalancing, so there was no load anymore, but we still had 180 busy Apache processes. Even a apache2ctl graceful
did not change anything, even hours later apache2ctl status
said:
Apache Server Status for localhost
Server Version: Apache/2.2.22 (Debian)
Server Built: Jun 16 2014 03:51:14
__________________________________________________________________
Current Time: Tuesday, 22-Jul-2014 10:17:44 CEST
Restart Time: Monday, 21-Jul-2014 18:43:37 CEST
Parent Server Generation: 26
Server uptime: 15 hours 34 minutes 6 seconds
Total accesses: 596973 - Total Traffic: 1.6 GB
CPU Usage: u6288.72 s463.96 cu.01 cs0 - 12% CPU load
10.7 requests/sec - 30.8 kB/second - 2962 B/request
176 requests currently being processed, 99 idle workers
GGGGGG_GGGGGGGGG_GG_GGGGGGGGGGGGGGGGGGGG_GGGGGG_GGGGGGGG_GGGGGGG
GGGGGGGGGG_G_GGGGGGGG_G_GG__GGGGGG_GGGGG_GGG___GG_GGGGGGGG_G_GGG
GGGGGGGGGGGG_G_GG__GG_GGG_GGGGGGGGG__GGG_GGG_G_G_GG_G_GGGGGGGGGG
GGG_GGG_GG_GGG_GG_G_GGG_______________.___._W___________________
____.___________.______.........................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
....
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process
Only apache2ctl restart
solved the issue and everything worked fine again. The MySQL error is the only "useful" error message we found so far.
Could it be an issue with mysqlnd, opcache or apcu and PHP 5.4.30? Are there any known problems which could result in the behavior we have experienced?
Or do you have an idea how to debug the "mysql server has gone away" issue?
We probably found out why it comes to the "mysql server has gone away" error: On the MySQL server we configured a
wait_timeout
of 30 seconds, which is less than 60 secondsmax_execution_time
. So under certain conditions something seems to take more than 30 seconds while we are reading a result-set from MySQL, so the server closes the connection while we are still trying to get data from the server. That leads us to the next questions:What function consumes so much time while we are in a loop reading a result-set from mysql?
Why does
apache2ctl graceful
not restart the Apache Processes, even whenmax_execution_time
should abort the scripts after 60 seconds?I think the answer to both questions is a bug in APCu. Because if I look at the hanging Apache childs, I get
FUTEX_WAIT
from strace:If I look at such a process using gdb it seems to hang at
pthread_rwlock_wrlock()
, what I get is:OK, pthread_rwlock is used for locking in APCu, and problems with locking mechanism is a really good explanation for what we see here, in that we definitely have code which reads/writes to APCu inside of loops through MySQL result-sets, and if there is a problem with locking (which already has been a problem for us in the past with APC as well) it could take >30 and <60 seconds so the MySQL error is what we see. And after that situation something with APCu goes really wrong, so the php script can't be aborted by
max_execution_time
and not be restarted byapache2ctl graceful
anymore.In APCu issue tracker I could find very similar issues: https://github.com/krakjoe/apcu/issues/19
But we found another hint. The crash always happens, when there are about 70k keys in APCu, and it does not depend on apc.shm_size, but we found out that our APCu monitoring script produces "PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 78 bytes)" errors when calling apcu_cache_info() at line 47 at the same time when we see the crash. So we have to look into the script why it consumes so much memory, AFAIR id read's all the data for calculating memory fragmentation, perhaps we should remove that part...
But we had a lot of problems with APC in the past, we switched to APCu/Opcache only because we got seg faults with latest APC and PHP 5.4.30 and the issue mentioned above is open for one year now. We are happy to see recent activity on yac, perhaps lockless is a more stable option. If we can't fix by removing problems from our monitoring script, we will switch to local memcached instances, it will be slower but we know it's very stable.