Let's say we have a custom backup service that follows the rsync approach suggested by Mike Rubel. In order to make backup rotations, this cp
command must be used:
cp -al source target
Having this, I'm trying to rotate a 35GB directory which has a lot of small files (~5KB-200KB), i.e. a very large tree directory. The problem is that it lasts at least five hours. It seems a lot to me, specially by using the -l
option.
Is it normal that behaviour with SATA disks? May the -al
combination flag be causing an extra overhead in cp command that results in that delay?
Thanks!
If the files are all around two gigabytes in size, I would think this is very slow. If the files are all around 200 bytes in size, I would think this is fast. Well, I don't actually know how small the files have to be before I would think that this speed is fast, but if they are all pretty tiny, your drive will be spending most of its time seeking, reading metadata, writing metadata, committing journals, and so forth.
But it sounds frustrating, either way.
A few ideas spring instantly to mind:
You could turn off
a_time
uptimes for the specific filesystem in question, if you don't usea_time
for anything. (Add thenoatime
mount(8)
option to yourfstab(5)
file.) This would prevent a huge amount of very small scattered writes all over the 'reading' side of your copy operation. This might knock off some small percentage of time. 5%? 10%? Maybe more? The plus side is it takes a few seconds to usemount(8)
-oremount,noatime
and then find out. :)You could use hardlinks instead of copies. (cp(1)
mentions a-l
command line option to use links -- I must sheepishly admit I've never tried, I've always made my links withln(1)
, but doing so for hundreds of thousands of files sounds unfun. So try-l
tocp(1)
and report back. :) The benefit to using hardlinks is (a) saved diskspace (b) saved disk bandwidth -- only the metadata is read/written, which could be thousands of times faster. It might not be the tool you want though, it really depends upon how your applications modify data while the backup operation is running.You could figure some smarter replacement for the whole thing.
rsync
is an excellent tool, but not supremely brilliant.git(1)
may be a smarter tool for your task. Without making a copy at all first, this might go much much faster.You could use some clever block device tricks: for example, LVM snapshots, to allow your backup operation to proceed in parallel with use, and remove the snapshot when the backup is done. This ought to be significantly faster if there isn't much churn in your data. If there is a lot of churn, it might be only slightly better. But it'd let your rsync start off near immediately rather than the other side of a five hour window.