Judith Retief
2006-Sep-29 10:25 UTC
rsyncing many files and hard links: optimisation suggestions?
I suspect the standard optimisation - breaking up the rsync into smaller batches - is not going to work for us. This is our situation: We rsync two directories in /spool to a backup. They are large: almost 2mil files in the first dir, with about 4mil hard links in the second one linking to them. The files don't change much; a few thousand added daily, and with about 3 hard links created for each. Another few hundred might be deleted or changed. We use rsync -a -H --delete This worked very well at first, but since reaching 1mil files performance has dropped dramatically. Knowing about the mem problems when rsyncing lots of files, my first option was to break the rsync down in batches. Don't think this will work though: - firstly: rsync only uses 30% of the mem, no swap mem is used. So mem isn't the issue. - secondly: I think the hard links won't be created correctly. This is why: If I have hard links like so: /spool/foo/real-file /spool/bar/bar1/real-file -> /spool/foo/real-file /spool/bar/bar2/real-file -> /spool/foo/real-file then rsync -H user@host:/spool/foo /spool rsync -H user@host:/spool/bar /spool will result in _two_ copies of real-file on the client. And if the 'bar' rsync is split into two rsync batches: rsync -H user@host:/spool/bar/bar1 /spool/bar rsync -H user@host:/spool/bar/bar2 /spool/bar I'm going to have three copies of real-file, rather than one copy and two hard links, isn't it? When I do an strace on rsync on the client, it's almost invariably busy with lstat'ing the local drive. I guess this is the receiver building up its file list? And when the file list contains lots of hard links then it has to sort all the files in one huge list? If the problem is the actual disk access, then I can't think of anything to do. If it is the sorting, then cutting down the batch sizes should help, at the expense of having copies of some files rather than hard links. Or am I missing a major point here?
Jamie Lokier
2006-Sep-29 11:32 UTC
rsyncing many files and hard links: optimisation suggestions?
Judith Retief wrote:> If the problem is the actual disk access, then I can't think of anything to > do. If it is the sorting, then cutting down the batch sizes should help, at > the expense of having copies of some files rather than hard links.You can tell whether it's the disk accesses or the sorting/kernel by looking at the CPU usage while it's running (e.g. using "top"). If it's 100% CPU in user space, there may be scope for optimising rsync's code. If it's 100% CPU in the kernel, there's not a lot you can do - you might be able to use a different strategy for checking for changed files than just scanning them all, though. If it's not close to 100% total, it's disk I/O, in which case you might be able to preload the disk's inode tables into cache to reduce head-seek time. A program called "treescan" helps with that. -- Jamie
Wayne Davison
2006-Sep-29 19:39 UTC
rsyncing many files and hard links: optimisation suggestions?
On Fri, Sep 29, 2006 at 12:10:51PM +0200, Judith Retief wrote:> I'm going to have three copies of real-file, rather than one copy and > two hard links, isn't it?Yes, rsync can only hard-link together the files that are in a single transfer. What version of rsync are you running? Hard-linking got a lot more efficient in modern versions (it got a lot better in 2.6.1, and there were some minor improvements since then).> When I do an strace on rsync on the client, it's almost invariably > busy with lstat'ing the local drive. I guess this is the receiver > building up its file list?If the client is the receiver, then it is either doing its --delete pass, or checking if files are up-to-date or not. You'll save some disk I/O by switching from --delete to --delete-during (so that a directory gets checked for sending and deleting at about the same time). ..wayne..