Dave Dykstra
2001-Nov-29 08:36 UTC
Rsync: Re: patch to enable faster mirroring of large filesystems
Rsync list: Alberto and I have done a couple more exchanges by private email, and we found that he wasn't turning on my include/exclude optimization in his test because he had an "exclude" directive in rsyncd.conf. He has now removed that and run the test again. His very interesting results are below with my comments. Note that his case is rather pathological because he's got over a million files in only 400 directories, so he must have an average of over 2500 files per directory, which are very large directories. He's got about 65% of the files explicitly listed in his --include-from file. On Wed, Nov 28, 2001 at 03:18:24PM -0500, Alberto Accomazzi wrote: ...> Both machines are SUN ultra 80s (2x450 UII, 1GB RAM), on a rather busy LAN, > so take results with a grain of salt: > > # syncronization of 1.1M files in 400 directories (135,000 to be updated): > > > date ; rsync-2.3.2 -avvzn rsync://adsfore.harvard.edu/test/ . ; date > Wed Nov 28 10:01:15 EST 2001 > receiving file list at Wed Nov 28 10:01:17 2001 > done receiving file list at Wed Nov 28 10:27:17 2001 > [ ...list of approx 135,000 files to be updated or 2.4 MB... ] > wrote 539699 bytes read 17803469 bytes 11046.77 bytes/sec > total size is 1137025227 speedup is 61.99 > Wed Nov 28 10:28:55 EST 2001 > > # syncronization of 722,941 files (13MB) in bib.list from the same directory > > > date ; rsync-2.3.2 -avvzn --include-from bib.list --exclude '*' rsync://adsfore.harvard.edu/test/. . ; date > Wed Nov 28 10:53:03 EST 2001 > sending exclude list at Wed Nov 28 11:56:22 2001 > done sending exclude list at Wed Nov 28 14:13:48 2001 > receiving file list at Wed Nov 28 14:13:48 2001 > (using include-only optimization) done receiving file list at Wed Nov 28 14:18:59 2001 > [ ...list of approx 3,200 files to be updated or 58 KB... ] > wrote 16640660 bytes read 10001913 bytes 2143.15 bytes/sec > total size is 755673302 speedup is 28.36 > Wed Nov 28 14:20:15 EST 2001Note the difference in total bytes written; presumably it was the exclude list.> The astonishing thing here is the time spent by the client in fiddling and > sending the exclude list! Just over 1hr to create the list in memory and > more than 2hrs to send it. When I trussed the process during the exclude > list sending time this is what I saw for every file: > > [...] > write(3, " + ", 2) = 2 > poll(0xFFBED858, 1, 60000) = 1 > write(3, " J 9 0 / J 9 0 - 0 5 8 6".., 17) = 17 > poll(0xFFBED850, 1, 60000) = 1 > write(3, "13\0\0\0", 4) = 4 > [...] > > so it looks like sending the exclude list is quite inefficient and therefore > --file-from should definitely not use this code to do the same thing.Indeed, it definitely should be doing buffering! It looks like there's a function io_start_buffering() that should be called. I don't know why it isn't called until later, though, and there may be a good reason. It's getting called in send_file_list() and in do_recv(), both of which are called from client_run() in main.c after send_exclude_list(). Could you play with calling it before send_exclude_list()? You're sending the list from the receiver to the sender so you're in the second half client_run(), the do_recv() part. You may possibly need to call io_flush() more often, although I don't think so. Could that be the reason why io_start_buffering() wasn't turned on earlier? Looks like buffering can be disabled with io_end_buffering() if you need to.> Also I'm sure that the 1hr spent building the exclude list can be > greatly reduced by just slurping in the file list in memory.Yes, I think that 1hr can be completely bypassed by reading the --files-from file directly inside the send_exclude_list() function and bypassing all the work done by make_exclude_list() to generate the in-memory representation of the exclude patterns. I sure am glad you ran this test because otherwise I probably wouldn't have thought of doing that. Hmm, wait, the remote side would still be building the in-memory exclude pattern representations. I guess that needs a short-circuit too.> I guess the good news is how quickly the results came back from the server, > which is where your optimization kicks in.Yes, that only took 5 minutes!> I've started one last test that > won't trigger the optimization out of curiosity, although these numbers > clearly show that most of the gain can be had by bypassing the > include/exclude dance on the client side.I expect the 5 minutes part will go up significantly and the rest will stay the same. I'd like to know by how much. - Dave Dykstra
Eric Whiting
2001-Nov-29 16:14 UTC
Rsync: Re: patch to enable faster mirroring of large filesystems
Dave Dykstra wrote:> > Note that his case is rather pathological because he's got over a million > files in only 400 directories, so he must have an average of over 2500 > files per directory, which are very large directories. He's got about 65% > of the files explicitly listed in his --include-from file.I have over a million files I rsync to about a dozen locations every day. I'm pretty sure I have more dirs, but not tons more. Most of the locations are remote offices, but when I rsync over a local 100Mbit segment it still takes about 2 hours just to verify the files/dirs on both sides in a no-data-change situation. In other words -- I'm interested in these different optimizations as well. My client/servers are a solaris linux mix. I don't have the 2G of RAM on all boxes to support a single rsync so it gets broken down into a for-loop across some top level dirs. eric
Alberto Accomazzi
2001-Nov-30 03:02 UTC
Rsync: Re: patch to enable faster mirroring of large filesystems
Here are some more results from my tests towards implementing --files-from. I have modified (actually "hacked" is more appropriate here -- read on) the source of rsync-2.3.2 to implement the command line option --files-from in an effort to test how well this feature would work in a real-case scenario with lots of files to be transferred. I used version 2.3.2 because it includes Dave's optimization on the server side which sends right back the files without attempting regular expression matching as it's done with include/exclude patterns. This is what my modifications do: - add --files-from=FILE option, to read a list of files to be transferred - modify send_exclude_list in exclude.c to send the list of files to be transferred in addition to the regular exclude list, and fake an --exclude '*' just to turn on the optimization on the server side - turn on buffering in send_exclude_list before the list is sent These are the new numbers from my latest test run with this patched version of rsync-2.3.2; you should compare them to the ones I reported previously below; the list I'm sending contains 722,941 files (13MB):> date; rsync-2.3.2 -avvzn --files-from /tmp/bib.list \rsync://adsfore.harvard.edu/test/. .; date Wed Nov 28 21:50:20 EST 2001 sending files-from list at Wed Nov 28 21:50:20 2001 done sending files-from list at Wed Nov 28 23:30:16 2001 receiving file list at Wed Nov 28 23:30:16 2001 (using include-only optimization) done receiving file list at Wed Nov 28 23:35:03 2001 [ ...list of over 3,000 files to be updated... ] total: matches=0 tag_hits=0 false_alarms=0 data=0 wrote 16640660 bytes read 10001913 bytes 4204.62 bytes/sec total size is 755673302 speedup is 28.36 Wed Nov 28 23:35:55 EST 2001 These numbers show that reading the filenames this way rather than using the code in place to deal with the include/exclude list cuts the startup time down to 0 (from 1hr). The actual sending of the filenames is down from 2h 15m to 1h 40m. The reason this isn't better is due to the fact that turning buffering on only helps the client, while the server still has to do unbuffered reads because of the way the list is sent across. As far as I can tell there is no way to get around the buffering without a protocol change or a different approach to sending this list. Given the data above, I think implementing --files-from this way would be the wrong way to go, for a number of reasons: - it's a hack to treat the list of files as an include list, and prevents the correct use and implementation of other includes/excludes; I still think the two options should be orthogonal, so that saying: rsync --files-from=foo.list rsync://server/module . would be equivalent to: cat foo.list | xargs -n 1 -i rsync rsync://server/module/{} . except that in the first case we can do the transfer with one rsync call - my patch currently implements this in a very inefficient way, given the fact that the file list is sent across uncompressed and unbuffered; as the numbers show, this is a killer for applications that as in my case need to send large lists across - the option only works on the client side, while it may be desireable to have the same option on the server side (just like we have for include/excludes),so that people could say in the invocation of the remote rsync command --files-from=foo or even put a directive "files from = foo" in rsync.conf I don't really understand the guts of rsync to be able to come up with the right patch, but I hope that my ramblings will help move the discussion forward. From what I see there is enough interest to have the option in rsync, and I hope that we can get there, but right now I feel that a half-baked job is not going to cut it at least for the people like me who have large file lists to move around. If anybody is intererested in taking a look at the patch, you can get it from http://ads.harvard.edu/~alberto/rsync/rsync-2.3.2-files-from.patch -- Alberto In message <20011128153633.B2821@lucent.com>, Dave Dykstra writes:> Rsync list: Alberto and I have done a couple more exchanges by private > email, and we found that he wasn't turning on my include/exclude > optimization in his test because he had an "exclude" directive in > rsyncd.conf. He has now removed that and run the test again. His very > interesting results are below with my comments. > > Note that his case is rather pathological because he's got over a million > files in only 400 directories, so he must have an average of over 2500 > files per directory, which are very large directories. He's got about 65% > of the files explicitly listed in his --include-from file. > > > > On Wed, Nov 28, 2001 at 03:18:24PM -0500, Alberto Accomazzi wrote: > ... > > Both machines are SUN ultra 80s (2x450 UII, 1GB RAM), on a rather busy LAN, > > so take results with a grain of salt: > > > > # syncronization of 1.1M files in 400 directories (135,000 to be updated): > > > > > date ; rsync-2.3.2 -avvzn rsync://adsfore.harvard.edu/test/ . ; date > > Wed Nov 28 10:01:15 EST 2001 > > receiving file list at Wed Nov 28 10:01:17 2001 > > done receiving file list at Wed Nov 28 10:27:17 2001 > > [ ...list of approx 135,000 files to be updated or 2.4 MB... ] > > wrote 539699 bytes read 17803469 bytes 11046.77 bytes/sec > > total size is 1137025227 speedup is 61.99 > > Wed Nov 28 10:28:55 EST 2001 > > > > # syncronization of 722,941 files (13MB) in bib.list from the same directory> > > > > date ; rsync-2.3.2 -avvzn --include-from bib.list --exclude '*' rsync://adsfore.harvard.edu/test/. . ; date> > Wed Nov 28 10:53:03 EST 2001 > > sending exclude list at Wed Nov 28 11:56:22 2001 > > done sending exclude list at Wed Nov 28 14:13:48 2001 > > receiving file list at Wed Nov 28 14:13:48 2001 > > (using include-only optimization) done receiving file list at Wed Nov 28 14:18:59 2001> > [ ...list of approx 3,200 files to be updated or 58 KB... ] > > wrote 16640660 bytes read 10001913 bytes 2143.15 bytes/sec > > total size is 755673302 speedup is 28.36 > > Wed Nov 28 14:20:15 EST 2001 > > > Note the difference in total bytes written; presumably it was the exclude > list. > > > The astonishing thing here is the time spent by the client in fiddling and > > sending the exclude list! Just over 1hr to create the list in memory and > > more than 2hrs to send it. When I trussed the process during the exclude > > list sending time this is what I saw for every file: > > > > [...] > > write(3, " + ", 2) = 2 > > poll(0xFFBED858, 1, 60000) = 1 > > write(3, " J 9 0 / J 9 0 - 0 5 8 6".., 17) = 17 > > poll(0xFFBED850, 1, 60000) = 1 > > write(3, "13\0\0\0", 4) = 4 > > [...] > > > > so it looks like sending the exclude list is quite inefficient and therefore> > --file-from should definitely not use this code to do the same thing. > > Indeed, it definitely should be doing buffering! It looks like there's a > function io_start_buffering() that should be called. I don't know why it > isn't called until later, though, and there may be a good reason. It's > getting called in send_file_list() and in do_recv(), both of which are > called from client_run() in main.c after send_exclude_list(). Could you > play with calling it before send_exclude_list()? You're sending the list > from the receiver to the sender so you're in the second half client_run(), > the do_recv() part. You may possibly need to call io_flush() more often, > although I don't think so. Could that be the reason why io_start_buffering() > wasn't turned on earlier? Looks like buffering can be disabled with > io_end_buffering() if you need to. > > > > Also I'm sure that the 1hr spent building the exclude list can be > > greatly reduced by just slurping in the file list in memory. > > Yes, I think that 1hr can be completely bypassed by reading the > --files-from file directly inside the send_exclude_list() function > and bypassing all the work done by make_exclude_list() to generate the > in-memory representation of the exclude patterns. I sure am glad you > ran this test because otherwise I probably wouldn't have thought of > doing that. > > Hmm, wait, the remote side would still be building the in-memory exclude > pattern representations. I guess that needs a short-circuit too. > > > > > I guess the good news is how quickly the results came back from the server, > > which is where your optimization kicks in. > > Yes, that only took 5 minutes! > > > I've started one last test that > > won't trigger the optimization out of curiosity, although these numbers > > clearly show that most of the gain can be had by bypassing the > > include/exclude dance on the client side. > > I expect the 5 minutes part will go up significantly and the rest will > stay the same. I'd like to know by how much. > > > - Dave Dykstra**************************************************************************** Alberto Accomazzi mailto:aaccomazzi@cfa.harvard.edu NASA Astrophysics Data System http://adsabs.harvard.edu Harvard-Smithsonian Center for Astrophysics http://cfawww.harvard.edu 60 Garden Street, MS 83, Cambridge, MA 02138 USA ****************************************************************************