Overhead of -a flag in cp command

1.2k Views Asked by At

Let's say we have a custom backup service that follows the rsync approach suggested by Mike Rubel. In order to make backup rotations, this cp command must be used:

cp -al source target

Having this, I'm trying to rotate a 35GB directory which has a lot of small files (~5KB-200KB), i.e. a very large tree directory. The problem is that it lasts at least five hours. It seems a lot to me, specially by using the -l option.

Is it normal that behaviour with SATA disks? May the -al combination flag be causing an extra overhead in cp command that results in that delay?

Thanks!

1

There are 1 best solutions below

1
On BEST ANSWER

If the files are all around two gigabytes in size, I would think this is very slow. If the files are all around 200 bytes in size, I would think this is fast. Well, I don't actually know how small the files have to be before I would think that this speed is fast, but if they are all pretty tiny, your drive will be spending most of its time seeking, reading metadata, writing metadata, committing journals, and so forth.

But it sounds frustrating, either way.

A few ideas spring instantly to mind:

  • You could turn off a_time uptimes for the specific filesystem in question, if you don't use a_time for anything. (Add the noatime mount(8) option to your fstab(5) file.) This would prevent a huge amount of very small scattered writes all over the 'reading' side of your copy operation. This might knock off some small percentage of time. 5%? 10%? Maybe more? The plus side is it takes a few seconds to use mount(8) -oremount,noatime and then find out. :)

  • You could use hardlinks instead of copies. (cp(1) mentions a -l command line option to use links -- I must sheepishly admit I've never tried, I've always made my links with ln(1), but doing so for hundreds of thousands of files sounds unfun. So try -l to cp(1) and report back. :) The benefit to using hardlinks is (a) saved diskspace (b) saved disk bandwidth -- only the metadata is read/written, which could be thousands of times faster. It might not be the tool you want though, it really depends upon how your applications modify data while the backup operation is running.

  • You could figure some smarter replacement for the whole thing. rsync is an excellent tool, but not supremely brilliant. git(1) may be a smarter tool for your task. Without making a copy at all first, this might go much much faster.

  • You could use some clever block device tricks: for example, LVM snapshots, to allow your backup operation to proceed in parallel with use, and remove the snapshot when the backup is done. This ought to be significantly faster if there isn't much churn in your data. If there is a lot of churn, it might be only slightly better. But it'd let your rsync start off near immediately rather than the other side of a five hour window.