thr3ads.net - rsync - Rsync: Re: patch to enable faster mirroring of large filesystems [Nov 2001]

If this information is useful, please help other people find it:
Share via:

Dave Dykstra

2001-Nov-29 08:36 UTC

Rsync: Re: patch to enable faster mirroring of large filesystems

Rsync list: Alberto and I have done a couple more exchanges by private
email, and we found that he wasn't turning on my include/exclude
optimization in his test because he had an "exclude" directive in
rsyncd.conf.  He has now removed that and run the test again.  His very
interesting results are below with my comments.

Note that his case is rather pathological because he's got over a million
files in only 400 directories, so he must have an average of over 2500
files per directory, which are very large directories.  He's got about 65%
of the files explicitly listed in his --include-from file.



On Wed, Nov 28, 2001 at 03:18:24PM -0500, Alberto Accomazzi wrote:
...> Both machines are SUN ultra 80s (2x450 UII, 1GB RAM), on a rather busy LAN,
> so take results with a grain of salt:
>
> # syncronization of 1.1M files in 400 directories (135,000 to be updated):
> 
> > date ; rsync-2.3.2 -avvzn rsync://adsfore.harvard.edu/test/ . ; date
> Wed Nov 28 10:01:15 EST 2001
> receiving file list at Wed Nov 28 10:01:17 2001
> done receiving file list at Wed Nov 28 10:27:17 2001
> [ ...list of approx 135,000 files to be updated or 2.4 MB... ]
> wrote 539699 bytes  read 17803469 bytes  11046.77 bytes/sec
> total size is 1137025227  speedup is 61.99
> Wed Nov 28 10:28:55 EST 2001
> 
> # syncronization of 722,941 files (13MB) in bib.list from the same
directory
> 
> > date ; rsync-2.3.2 -avvzn --include-from bib.list --exclude
'*' rsync://adsfore.harvard.edu/test/. . ; date
> Wed Nov 28 10:53:03 EST 2001
> sending exclude list at Wed Nov 28 11:56:22 2001
> done sending exclude list at Wed Nov 28 14:13:48 2001
> receiving file list at Wed Nov 28 14:13:48 2001
> (using include-only optimization) done receiving file list at Wed Nov 28
14:18:59 2001
> [ ...list of approx 3,200 files to be updated or 58 KB... ]
> wrote 16640660 bytes  read 10001913 bytes  2143.15 bytes/sec
> total size is 755673302  speedup is 28.36
> Wed Nov 28 14:20:15 EST 2001

Note the difference in total bytes written; presumably it was the exclude
list.

> The astonishing thing here is the time spent by the client in fiddling and
> sending the exclude list!  Just over 1hr to create the list in memory and 
> more than 2hrs to send it.  When I trussed the process during the exclude
> list sending time this is what I saw for every file:
> 
> [...]
> write(3, " +  ", 2)                             = 2
> poll(0xFFBED858, 1, 60000)                      = 1
> write(3, " J 9 0 / J 9 0 - 0 5 8 6".., 17)      = 17
> poll(0xFFBED850, 1, 60000)                      = 1
> write(3, "13\0\0\0", 4)                         = 4
> [...]
> 
> so it looks like sending the exclude list is quite inefficient and
therefore
> --file-from should definitely not use this code to do the same thing.
Indeed, it definitely should be doing buffering!  It looks like there's a
function io_start_buffering() that should be called.  I don't know why it
isn't called until later, though, and there may be a good reason.  It's
getting called in send_file_list() and in do_recv(), both of which are
called from client_run() in main.c after send_exclude_list().  Could you
play with calling it before send_exclude_list()?  You're sending the list
from the receiver to the sender so you're in the second half client_run(),
the do_recv() part.  You may possibly need to call io_flush() more often,
although I don't think so.  Could that be the reason why
io_start_buffering()
wasn't turned on earlier?  Looks like buffering can be disabled with
io_end_buffering() if you need to.

> Also I'm sure that the 1hr spent building the exclude list can be 
> greatly reduced by just slurping in the file list in memory.
Yes, I think that 1hr can be completely bypassed by reading the
--files-from file directly inside the send_exclude_list() function
and bypassing all the work done by make_exclude_list() to generate the
in-memory representation of the exclude patterns.  I sure am glad you
ran this test because otherwise I probably wouldn't have thought of
doing that.

Hmm, wait, the remote side would still be building the in-memory exclude
pattern representations.  I guess that needs a short-circuit too.


> I guess the good news is how quickly the results came back from the server,
> which is where your optimization kicks in. 
Yes, that only took 5 minutes!
> I've started one last test that
> won't trigger the optimization out of curiosity, although these numbers
> clearly show that most of the gain can be had by bypassing the
> include/exclude dance on the client side.
I expect the 5 minutes part will go up significantly and the rest will
stay the same.  I'd like to know by how much.


- Dave Dykstra

Eric Whiting

2001-Nov-29 16:14 UTC

head link

Rsync: Re: patch to enable faster mirroring of large filesystems

Dave Dykstra wrote:>
> Note that his case is rather pathological because he's got over a
million
> files in only 400 directories, so he must have an average of over 2500
> files per directory, which are very large directories.  He's got about
65%
> of the files explicitly listed in his --include-from file.
I have over a million files I rsync to about a dozen locations every
day. I'm pretty sure I have more dirs, but not tons more. Most of the
locations are remote offices, but when I rsync over a local 100Mbit
segment it still takes about 2 hours just to verify the files/dirs on
both sides in a no-data-change situation.

In other words -- I'm interested in these different optimizations as
well. 

My client/servers are a solaris linux mix. I don't have the 2G of RAM on
all boxes to support a single rsync so it gets broken down into a
for-loop across some top level dirs. 

eric

Alberto Accomazzi

2001-Nov-30 03:02 UTC

head link

Rsync: Re: patch to enable faster mirroring of large filesystems

Here are some more results from my tests towards implementing --files-from.
I have modified (actually "hacked" is more appropriate here -- read
on)
the source of rsync-2.3.2 to implement the command line option --files-from
in an effort to test how well this feature would work in a real-case scenario
with lots of files to be transferred.  I used version 2.3.2 because it 
includes Dave's optimization on the server side which sends right back the
files without attempting regular expression matching as it's done with
include/exclude patterns.

This is what my modifications do:

- add --files-from=FILE option, to read a list of files to be transferred
- modify send_exclude_list in exclude.c to send the list of files to be
  transferred in addition to the regular exclude list, and fake an 
  --exclude '*' just to turn on the optimization on the server side
- turn on buffering in send_exclude_list before the list is sent 

These are the new numbers from my latest test run with this patched version of
rsync-2.3.2; you should compare them to the ones I reported previously below; 
the list I'm sending contains 722,941 files (13MB):
> date; rsync-2.3.2 -avvzn --files-from /tmp/bib.list \	rsync://adsfore.harvard.edu/test/. .; date
Wed Nov 28 21:50:20 EST 2001
sending files-from list at Wed Nov 28 21:50:20 2001
done sending files-from list at Wed Nov 28 23:30:16 2001
receiving file list at Wed Nov 28 23:30:16 2001
(using include-only optimization) done receiving file list at Wed Nov 28
23:35:03 2001
[ ...list of over 3,000 files to be updated... ]
total: matches=0  tag_hits=0  false_alarms=0 data=0
wrote 16640660 bytes  read 10001913 bytes  4204.62 bytes/sec
total size is 755673302  speedup is 28.36
Wed Nov 28 23:35:55 EST 2001

These numbers show that reading the filenames this way rather than using
the code in place to deal with the include/exclude list cuts the startup
time down to 0 (from 1hr).  The actual sending of the filenames is down
from 2h 15m to 1h 40m.  The reason this isn't better is due to the fact
that turning buffering on only helps the client, while the server still
has to do unbuffered reads because of the way the list is sent across. 
As far as I can tell there is no way to get around the buffering without
a protocol change or a different approach to sending this list.

Given the data above, I think implementing --files-from this way would
be the wrong way to go, for a number of reasons:

- it's a hack to treat the list of files as an include list, and prevents
  the correct use and implementation of other includes/excludes; I still
  think the two options should be orthogonal, so that saying:
	rsync --files-from=foo.list rsync://server/module .
  would be equivalent to:
	cat foo.list | xargs -n 1 -i rsync rsync://server/module/{} .
  except that in the first case we can do the transfer with one rsync call

- my patch currently implements this in a very inefficient way, given the
  fact that the file list is sent across uncompressed and unbuffered; as
  the numbers show, this is a killer for applications that as in my case
  need to send large lists across

- the option only works on the client side, while it may be desireable to 
  have the same option on the server side (just like we have for 
  include/excludes),so that people could say in the invocation of the remote 
  rsync command --files-from=foo or even put a directive "files from =
foo"
  in rsync.conf

I don't really understand the guts of rsync to be able to come up with the
right patch, but I hope that my ramblings will help move the discussion 
forward.  From what I see there is enough interest to have the option in
rsync, and I hope that we can get there, but right now I feel that a 
half-baked job is not going to cut it at least for the people like me who
have large file lists to move around.

If anybody is intererested in taking a look at the patch, you can get it from
http://ads.harvard.edu/~alberto/rsync/rsync-2.3.2-files-from.patch


-- Alberto


In message <20011128153633.B2821@lucent.com>, Dave Dykstra writes:
> Rsync list: Alberto and I have done a couple more exchanges by private
> email, and we found that he wasn't turning on my include/exclude
> optimization in his test because he had an "exclude" directive in
> rsyncd.conf.  He has now removed that and run the test again.  His very
> interesting results are below with my comments.
> 
> Note that his case is rather pathological because he's got over a
million
> files in only 400 directories, so he must have an average of over 2500
> files per directory, which are very large directories.  He's got about
65%
> of the files explicitly listed in his --include-from file.
> 
> 
> 
> On Wed, Nov 28, 2001 at 03:18:24PM -0500, Alberto Accomazzi wrote:
> ...
> > Both machines are SUN ultra 80s (2x450 UII, 1GB RAM), on a rather busy
LAN,
> > so take results with a grain of salt:
> >
> > # syncronization of 1.1M files in 400 directories (135,000 to be
updated):
> > 
> > > date ; rsync-2.3.2 -avvzn rsync://adsfore.harvard.edu/test/ . ;
date
> > Wed Nov 28 10:01:15 EST 2001
> > receiving file list at Wed Nov 28 10:01:17 2001
> > done receiving file list at Wed Nov 28 10:27:17 2001
> > [ ...list of approx 135,000 files to be updated or 2.4 MB... ]
> > wrote 539699 bytes  read 17803469 bytes  11046.77 bytes/sec
> > total size is 1137025227  speedup is 61.99
> > Wed Nov 28 10:28:55 EST 2001
> > 
> > # syncronization of 722,941 files (13MB) in bib.list from the same
director
y> > 
> > > date ; rsync-2.3.2 -avvzn --include-from bib.list --exclude
'*' rsync://a
dsfore.harvard.edu/test/. . ; date> > Wed Nov 28 10:53:03 EST 2001
> > sending exclude list at Wed Nov 28 11:56:22 2001
> > done sending exclude list at Wed Nov 28 14:13:48 2001
> > receiving file list at Wed Nov 28 14:13:48 2001
> > (using include-only optimization) done receiving file list at Wed Nov
28 14
:18:59 2001> > [ ...list of approx 3,200 files to be updated or 58 KB... ]
> > wrote 16640660 bytes  read 10001913 bytes  2143.15 bytes/sec
> > total size is 755673302  speedup is 28.36
> > Wed Nov 28 14:20:15 EST 2001
> 
> 
> Note the difference in total bytes written; presumably it was the exclude
> list.
> 
> 
> The astonishing thing here is the time spent by the client in fiddling and
> > sending the exclude list!  Just over 1hr to create the list in memory
and
> > more than 2hrs to send it.  When I trussed the process during the
exclude
> > list sending time this is what I saw for every file:
> > 
> > [...]
> > write(3, " +  ", 2)                             = 2
> > poll(0xFFBED858, 1, 60000)                      = 1
> > write(3, " J 9 0 / J 9 0 - 0 5 8 6".., 17)      = 17
> > poll(0xFFBED850, 1, 60000)                      = 1
> > write(3, "13\0\0\0", 4)                         = 4
> > [...]
> > 
> > so it looks like sending the exclude list is quite inefficient and
therefor
e> > --file-from should definitely not use this code to do the same thing.
> 
> Indeed, it definitely should be doing buffering!  It looks like there's
a
> function io_start_buffering() that should be called.  I don't know why
it
> isn't called until later, though, and there may be a good reason. 
It's
> getting called in send_file_list() and in do_recv(), both of which are
> called from client_run() in main.c after send_exclude_list().  Could you
> play with calling it before send_exclude_list()?  You're sending the
list
> from the receiver to the sender so you're in the second half
client_run(),
> the do_recv() part.  You may possibly need to call io_flush() more often,
> although I don't think so.  Could that be the reason why
io_start_buffering()
> wasn't turned on earlier?  Looks like buffering can be disabled with
> io_end_buffering() if you need to.
> 
> 
> > Also I'm sure that the 1hr spent building the exclude list can be 
> > greatly reduced by just slurping in the file list in memory.
> 
> Yes, I think that 1hr can be completely bypassed by reading the
> --files-from file directly inside the send_exclude_list() function
> and bypassing all the work done by make_exclude_list() to generate the
> in-memory representation of the exclude patterns.  I sure am glad you
> ran this test because otherwise I probably wouldn't have thought of
> doing that.
> 
> Hmm, wait, the remote side would still be building the in-memory exclude
> pattern representations.  I guess that needs a short-circuit too.
> 
> 
> 
> > I guess the good news is how quickly the results came back from the
server,
> > which is where your optimization kicks in. 
> 
> Yes, that only took 5 minutes!
> 
> > I've started one last test that
> > won't trigger the optimization out of curiosity, although these
numbers
> > clearly show that most of the gain can be had by bypassing the
> > include/exclude dance on the client side.
> 
> I expect the 5 minutes part will go up significantly and the rest will
> stay the same.  I'd like to know by how much.
> 
> 
> - Dave Dykstra


****************************************************************************
Alberto Accomazzi                          mailto:aaccomazzi@cfa.harvard.edu
NASA Astrophysics Data System                      http://adsabs.harvard.edu
Harvard-Smithsonian Center for Astrophysics        http://cfawww.harvard.edu
60 Garden Street, MS 83, Cambridge, MA 02138 USA   
****************************************************************************

Maybe Matching Threads

Search for more maybe matching threads

rsync - Nov 2001 - Rsync: Re: patch to enable faster mirroring of large filesystems

Rsync: Re: patch to enable faster mirroring of large filesystems

Rsync: Re: patch to enable faster mirroring of large filesystems

Rsync: Re: patch to enable faster mirroring of large filesystems

Maybe Matching Threads