Write IO breakups on linux?

302 Views Asked by At

My application is using O_DIRECT for flushing 2MB worth of data directly to a 3-way-stripe storage (mounted as an lvm volume)..

I am getting a very pathetic write speed on this storage. The iostat shows that the large request size is being broken into smaller ones.

avgrq-sz is <20... There aren't much read on that drive.

It takes around 2 seconds to flush down 2MB worth of contiguous memory blocks (using mlock to assure that), sector aligned (using posix_memalign), whereas tests with dd and iozone rate the storage capable of > 20Mbps of write speed.

I would appreciate any clues on how to investigate this issue further.

PS: If this is not the right forum for this query, I would appreciate indicators to a one that could be helpful.

Thanks.

1

There are 1 best solutions below

0
On

Write IO breakups on linux?

The disk itself may have a maximum request size, there is a tradeoff being block size and latency (the bigger the request being sent to the disk the longer it will likely take to to be consumed) and there can be constraints on how much vectored I/O a driver can consume in a single request. Given all the above, the kernel is going to "break up" single requests that are too large when submitting further down the stack.

I would appreciate any clues on how to investigate this issue further.

Unfortunately it's hard to say why the avgrq-sz is so small (if its in sectors that about 10KBytes per I/O) without seeing the code actually submitting the I/O (maybe your program is submitting 10KByte buffers?). We also don't know if iozone and dd were using O_DIRECT during the questioners test. If they weren't then their I/O would have been going into the write back cache and then streamed out later and the kernel can do that in a more optimal fashion.

Note: Using O_DIRECT is NOT a go faster stripe. In the right circumstances O_DIRECT can lower overhead BUT writing O_DIRECTly to do disk increases the pressure on you to submit I/O in parallel (e.g. via AIO/io_uring or via multiple processes/threads) if you want to reach the highest possible throughput because you have robbed the kernel of its best way of creating parallel submission to the device for you.