When should I use REQ_OP_FLUSH in my kernel blockdev driver, and what is the expected behavior of the hardware that receives the REQ_OP_FLUSH (or equivalent SCSI cmd)?
In the Linux kernel, when a struct bio is flagged as REQ_OP_FLUSH is passed to a RAID controller volume in writeback mode, is the RAID controller supposed to flush its dirty caches?
It seems to me that this is the purpose of REQ_OP_FLUSH but that is at odds with wanting to be fast with writeback: If the cache is battery-backed, shouldn't the controller ignore the flush?
In ext4's super.c ext4_sync_fs() function, the write skips a call to blkdev_issue_flush() when barriers are disabled via the barrier=0 mount option. This seems to imply that RAID controllers will flush their caches when they are told to...but does RAID firmware ever break the rules?
- Is the flush behavior dependent on the firmware implementation and manufacturer?
- Where is the SAS/SCSI specification on the subject?
- Other considerations?
Christoph Hellwig on the linux-block mailing list said:
Keith Busch at kernel.org:
If this sounds backwards, then consider this using a RAID controller cache as an example:
A RAID controller with a non-volatile "writeback" cache (from the controller's perspective, ie, with battery) is a "write through"
device as far as the kernel is concerned because the controller will return the write as complete as soon as it is in the persistent cache.
A RAID controller with a volatile "writeback" cache (from the controller's perspective, ie without battery) is a "write back"
device as far as the kernel is concerned because the controller will return the write as complete as soon as it is in the cache, but the cache is not persistent! So in that case flush/FUA is necessary.
[ Reference: https://lore.kernel.org/all/[email protected]/ ]
From personal experience, not all raid controllers will properly set queue/write_cache as indicated by Keith above. If you know your array has a non-volatile cache running in write-back mode then check make sure it is in "write through" so flushes will be dropped:
and fix it if it isn't in the proper mode. These settings below might seem backdwards, but if they do, then re-read #1 and #2 above because these are correct:
If you have a non-volatile cache (ie, with BBU):
If you have a volatile cache (ie, without BBU):
So the answer to the question about when to flag
REQ_OP_FLUSHin your kernel code is this: whenever you think your code should commit to disk. Since the block layer can re-order anybiorequest,and then you are guaranteed to have the IO from #1 on disk.
However, if the device being written has cache_mode in "write through" mode, then the flush will complete immediately and its up to your controller do do its job and keep the non-volatile cache active, even after a power loss (BBU, supercap, flashcache, etc).