fwrite becomes slow after long uptime

334 Views Asked by At

Recently we had a production server up for 50+ days exhibit slow fwrite times. Sporadically, a single fwrite() would take 50 to 300 msec to complete (typically 300 to 2400 bytes). We spent a few days investigating, collecting stats, trying a number of things. Finally after rebooting the system the problem is gone and the server is running normal, as-expected operation. Here are some notes:

-the system is a Xeon 2660 16-core with one HDD and one SSD, Ubuntu 12.04, 3.2.0-49-generic. The HDD is about 88% full and the SSD 75%. fstat() shows optimal HDD blocksize of 4096

-the application software running on the system is two different executables that open, run, and close repeatedly, running for intervals from a minute to several hours, writing numerous wav files of various sizes on a continuous basis while they are running

-both the HDD and SSD exhibited the issue. Writes to ramdisk were Ok

My question: is there any known issue where the Linux I/O interface can reach a point, over time, where a single flush or other I/O operation takes 50 or even 300+ msec to complete ?

We tried defragmenting both drives, setvbuf() variations, and non-blocking file descriptors (fcntl), without any change. After reboot we see wav file extents the same as before, ranging from 1 to 10 typically, depending on file size. The only hint seemed to be that we could occasionally catch a thread briefly showing long I/O wait time or in "uninterruptible sleep" state. For that we used htop (turning on Detailed CPU Usage) and this command:

     for x in `seq 1 1 100`; do ps -eo state,tid,pid,cmd | grep "^D"; echo "----"; sleep 0.25; done

which would (occasionally) show something like "flush-252:0"

We looked through this thread on slow fwrites along with many other discussions but did not find anything that helped other than the usual "probably if you reboot it will go away". Which of course is good advice, but doesn't avoid the next occurrence.

After the reboot, we went on a hunt for any left-over file handles not being closed by those two (2) apps before terminating, and did find one case. My understanding is that should not have an effect.

0

There are 0 best solutions below