Lachlan Cranswick
2001-Nov-20 22:45 UTC
Rsync: Re: patch to enable faster mirroring of large filesystems
Is there any chance this can be added into the distribution as it sounds really nifty. Another suggestion unless I have read the following - would it be useful to have a command option in rsync to generate the file list by doing the "find" and outputting into a standard format? (As this would make it less OS specific or kludgy?) Cheers, Lachlan. At 16:06 19/11/01 -0500, you wrote:>I have attached a patch that adds 4 options to rsync that have helped >me to speed up my mirroring. I hope this is useful to someone else, >but I fear that my relative inexperience with rsync has caused me to >miss a way to do what I want without having to patch the code. So please >let me know if I'm all wet. > >Here's my story: I have a large filesystem (around 20 gigabytes of data) >that I'm mirroring over a T1 link to a backup site. Each night, >about 600 megabytes of data needs to be transferred to the backup site. >Much of this data has been appended to the end of various existing files, >so a tool like rsync that sends partial updates instead of the whole >file is appropriate. > >Normally, one could just use rsync with the --recursive and --delete features >to do this. However, this takes a lot more time than necessary, basically >because rsync spends a lot of time walking through the directory tree >(which contains over 300,000 files). > >One can speed this up by caching a listing of the directory tree. I maintain >an additional state file at the backup site that contains a listing >of the state of the tree after the last backup operation. This is essentially >equivalent to saving the output of "find . -ls" in a file. > >Then, the next night, one generates the updated directory tree for the source >file system and does a diff with the directory listing on the backup file >system to find out what has changed. This seems to be much faster than >using rsync's recursive and delete features. > >I have my own script and programs to delete any files that have been removed, >and then I just need to update the files that have been added or changed. >One could use cpio for this, but it's too slow when only partial files >have changed. > >So I added the following options to rsync: > > --source-list SRC arg will be a (local) file name containinga list of files, or - to read file names from stdin> --null used with --source-list to indicate that thefile names will be separated by null (zero) bytes instead of linefeed characters; useful with gfind -print0> --send-dirs send directory entries even though not inrecursive mode> --no-implicit-dirs do not send implicit directories (parents ofthe file being sent)> >The --source-list option allows me to supply an explicit list of filenames >to transport without using the --recursive feature and without playing >around with include and exclude files. I'm not really clear on whether >the include and exclude files could have gotten me the same place, but it >seems to me that they work hand-in-hand with the --recursive feature that >I don't want to use. > >The --null flag allows me to handle files with embedded linefeeds. This >is in the style of gnu find's -print0 operator. > >The --send-dirs overcomes a problem where rsync refuses to send directories >unless it's in recursive mode. One needs this to make sure that even >empty directories get mirrored. > >And the --no-implicit-dirs option turns off the default behavior in which >all the parent directories of a file are transmitted before sending the >file. That default behavior is very inefficient in my scenario where I >am taking the responsibility for sending those directories myself. > >So, the patch is attached. If you think it's an abomination, please let >me know what the better solution is. If you would like some elaboration >on how this stuff really works, please let me know. > >Cheers, >Andy > >Attachment Converted: C:\Eudora\Attach\rsync-2.4.6-srclist.patch >----------------------- Lachlan M. D. Cranswick Collaborative Computational Project No 14 (CCP14) for Single Crystal and Powder Diffraction Birkbeck University of London and Daresbury Laboratory Postal Address: CCP14 - School of Crystallography, Birkbeck College, Malet Street, Bloomsbury, WC1E 7HX, London, UK Tel: (+44) 020 7631 6849 Fax: (+44) 020 7631 6803 E-mail: l.m.d.cranswick@dl.ac.uk WWW: http://www.ccp14.ac.uk/
Dave Dykstra
2001-Nov-21 08:55 UTC
Rsync: Re: patch to enable faster mirroring of large filesystems
On Tue, Nov 20, 2001 at 11:45:44AM +0000, Lachlan Cranswick wrote:> > Is there any chance this can be added into the distribution as it sounds > really nifty.I exchanged some off-list email with the patch author and besides the fact that it adds too many options I object to it because it only supports copying from the local side to remote, not also from remote to local. His option is essentially the same as the --files-from option that was discussed last January. See the thread in the archives beginning at http://lists.samba.org/pipermail/rsync/2001-January/003368.html In summary, he can do pretty much what he wants by making an --include-from list that lists all the parent directories of the files he wants plus all the files he wants and end it with an --exclude '*', but before rsync 2.4.0 I had an optimization (which I put in when I officially maintained rsync) that would directly read the included files in that situation rather than recurse through all the directories. The author of rsync Andrew Tridgell took that optimization out in 2.4.0 because he thought it was confusing that the optimization didn't require explicitly listing the parent directories like an --exclude '*' otherwise does, and I couldn't prove that recursing through the directories made a significant performance impact. Later people argued that a new option --files-from would be worth doing just for convenience even if not for performance, but I said I still wanted people to do some performance testing before I'd implement it. I wanted people to run version 2.3.2 on their systems and compare the time difference between running with and without my optimization, which you can force by simply putting in a single wildcard in one included filename. I still want to write a --files-from option sometime, and I'm still waiting for somebody who has an application that could use it to do some performance measurements with rsync 2.3.2. I agree that --files-from has value on its own without performance implications, but somebody has to want it badly enough to put it in a little effort if they'd like me to implement it. - Dave Dykstra