Hi, In my situation I'm using rsync to backup a server with (currently) about 570,000 files. These are all little files and maybe .1% of them change or new ones are added in any 15 minute period. I've split the main tree up so rsync can run on sub sub directories of the main tree. It does each of these sub sub directories sequentially. I would have liked to run some of these in parallel, but that seems to increase i/o on the main server too much. Today I tried the following: For all subsub directories a) Fork a "du -s subsubdirectory" on the destination subsubdirectory b) Run rsync on the subsubdirectory c) repeat untill done Seems to have improved the time it takes by about 25-30%. It looks like the du can run ahead of the rsync...so that while rsync is building its file list, the du is warming up the file cache on the destination. Then when rsync looks to see what it needs to do on the destination, it can do this more efficiently. Looks like a keeper so far. Any other suggestions? (was thinking of a previous suggestion of setting /proc/sys/vm/vfs_cache_pressure to a low value). Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20091015/58056c0d/attachment.html>
Darryl Dixon - Winterhouse Consulting
2009-Oct-16 03:13 UTC
Nice little performance improvement
> Hi, > > In my situation I'm using rsync to backup a server with (currently) about > 570,000 files. > These are all little files and maybe .1% of them change or new ones are > added in > any 15 minute period. >Hi Mike, We have three filesystems that between them have approx 22 million files, and around 10-20,000 new or changed files every business day. In order to expeditiously move these new files offsite, we use a modified version of pyinotify to log all added/altered files across the entire filesystem(s) and then every five minutes feed the list to rsync with the --files-from option. This works very effectively and quickly. regards, Darryl Dixon Winterhouse Consulting Ltd http://www.winterhouseconsulting.com
On Thu, 2009-10-15 at 19:07 -0700, Mike Connell wrote:> Today I tried the following: > > For all subsub directories > a) Fork a "du -s subsubdirectory" on the destination > subsubdirectory > b) Run rsync on the subsubdirectory > c) repeat untill done > > Seems to have improved the time it takes by about 25-30%. It looks > like the du can > run ahead of the rsync...so that while rsync is building its file > list, the du is warming up > the file cache on the destination. Then when rsync looks to see what > it needs to do > on the destination, it can do this more efficiently.Interesting. If you're not using incremental recursion (the default in rsync >= 3.0.0), I can see that the "du" would help by forcing the destination I/O to overlap the file-list building in time. But with incremental recursion, the "du" shouldn't be necessary because rsync actually overlaps the checking of destination files with the file-list building on the source. -- Matt
Hi,> In order to expeditiously move these new files offsite, we use a modified > version of pyinotify to log all added/altered files across the entire > filesystem(s) and then every five minutes feed the list to rsync with the > --files-from option. This works very effectively and quickly.Interesting... How do you tell rsync to delete files that were deleted from the source, or is that not part of your use case? Thanks, Mike