thr3ads.net - rsync - rsync 1tb+ each day [Feb 2003]

If this information is useful, please help other people find it:
Share via:

Kenny Gorman

2003-Feb-05 06:29 UTC

rsync 1tb+ each day

I am rsyncing 1tb of data each day.  I am finding in my testing that 
actually removing the target files each day then rsyncing is faster than 
doing a compare of the source->target files then rsyncing over the delta 
blocks.  This is because we have a fast link between the two boxes, and 
that are disk is fairly slow. I am finding that the creation of the temp 
file (the 'dot file') is actually the slowest part of the operation. 
This has to be done for each file because the timestamp and at least a 
couple blocks are guaranteed to have changed (oracle files).

My question is this:

Is it possible to tell rsync to update the blocks of the target file 
'in-place' without creating the temp file (the 'dot file')?  I
can
guarantee that no other operations are being performed on the file at 
the same time.  The docs don't seem to indicate such an option.

Thx in advance..
-kg

Bennett Todd

2003-Feb-05 06:37 UTC

head link

rsync in-place (was Re: rsync 1tb+ each day)

2003-02-04T14:29:48 Kenny Gorman:> Is it possible to tell rsync to update the blocks of the target file
> 'in-place' without creating the temp file (the 'dot file')?
I can
> guarantee that no other operations are being performed on the file at
> the same time. The docs don't seem to indicate such an option.
No, it's not possible, and making it possible would require a deep
and fundamental redesign and re-implementation of rsync; the result
wouldn't resemble the current program much.

Here's a sketch of the heart of the rsync algorithm (for finer
details, see the tech report available from[1]).

Let's call the two endpoints the sender (who has the newer version
of the file), and the receiver (who wants to update its older local
copy to match that on the sender).

The receiver computes checksums on each block of the destination
file, and streams them to the sender.

The sender finds all instances of any of those blocks in the source
file. Then the sender transmits instructions to the receiver,
describing how to build a spiffy new copy of the newer source file,
using a mixture of actual chunks of new contents, and blocks taken
from the older version of the file. The receiver follows these
instructions, copying blocks as needed from the old version and
combining them with the new bits to construct the new file. It's
then moved into place.

This algorithm by nature expects that the old version of the
destination file is used as a source for taking blocks, in building
the new version. Adjusting this algorithm to work in-place is
non-trivial.

-Bennett

[1] <URL:http://samba.anu.edu.au/rsync/tech_report/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url :
http://lists.samba.org/archive/rsync/attachments/20030204/5f77d423/attachment.bin

Eric Whiting

2003-Feb-05 06:47 UTC

head link

rsync 1tb+ each day

I think the -W option might do what you would have described here.

eric


Kenny Gorman wrote:
> I am rsyncing 1tb of data each day.  I am finding in my testing that 
> actually removing the target files each day then rsyncing is faster 
> than doing a compare of the source->target files then rsyncing over 
> the delta blocks.  This is because we have a fast link between the two 
> boxes, and that are disk is fairly slow. I am finding that the 
> creation of the temp file (the 'dot file') is actually the slowest 
> part of the operation. This has to be done for each file because the 
> timestamp and at least a couple blocks are guaranteed to have changed 
> (oracle files).
>
> My question is this:
>
> Is it possible to tell rsync to update the blocks of the target file 
> 'in-place' without creating the temp file (the 'dot file')?
I can
> guarantee that no other operations are being performed on the file at 
> the same time.  The docs don't seem to indicate such an option.
>
> Thx in advance..
> -kg
>

Steve Bonds

2003-Feb-05 09:56 UTC

head link

rsync 1tb+ each day

On Tue, 4 Feb 2003, Kenny Gorman kgorman-at-paypal.com |Rsync List| wrote:
> My question is this:
> 
> Is it possible to tell rsync to update the blocks of the target file
> 'in-place' without creating the temp file (the 'dot file')?
It does not look like this is possible.  In receiver.c around line 452 you
can see:

/* recv file data */
recv_ok = receive_data(f_in,buf,fd2,fname,file->length);

fd2 looks to always be a file descriptor generated via do_mkstemp.
> I can guarantee that no other operations are being performed on the
> file at the same time.  The docs don't seem to indicate such an
> option.
Rsync works best over low bandwidth links for which the disk I/O is
minimal.  You might try one of these ideas for your high-bandwidth
environment:
  + hack receiver.c so that receive_data uses fd1 (the original file)
    - also comment out finish_transfer, which does the rename and 
      sets the permissions.  If perms are important, then set them 
      manually.
    - check the dest file closely to make sure it's not mangled
    - This is completely untested.  If you're not comfortable with 
      hacking and the usual followup debugging, then skip it.
  + compile and run rsync with profiling libraries to make sure that 
    it's slow where you think it's slow.  I've been surprised 
    before.  ;-)
  + use a "dumber" file transfer method (FTP, netcat) that will be 
    faster on beefier hardware.  Netcat is especially fast if you have
    a private network where you don't have to worry about pesky things
    like authentication.

For Oracle datafiles I've had excellent luck with a homebrew file transfer
that "compresses" blocks of zeros by sending a message that means
"Hey!
I just read XXX blocks of nothing but zeroes."  The receiver then creates
a sparse file on the destination for this area of zeroes.  It's very handy
for making copies of the database for read-only purposes and it saves lots
of disk space.  It doesn't help at all vs. something like netcat if your
datafiles are mostly full, tho.

  -- Steve

PS:  You can get info on netcat from: 
     http://www.sans.org/rr/audit/netcat.php

Steve Bonds

2003-Feb-05 09:59 UTC

head link

rsync 1tb+ each day

On Tue, 4 Feb 2003, Steve Bonds wrote:
> You might try one of these ideas for your high-bandwidth environment:
>   + hack receiver.c so that receive_data uses fd1 (the original file)
>     - also comment out finish_transfer, which does the rename and 
>       sets the permissions.  If perms are important, then set them 
>       manually.
>     - check the dest file closely to make sure it's not mangled
>     - This is completely untested.  If you're not comfortable with 
>       hacking and the usual followup debugging, then skip it.
I just thought of several ways where this could result in some nasty
infinite loops within rsync.  You're probably better off skipping this
"hack it" section and focussing on something else.  ;-)

  -- Steve

jw schultz

2003-Feb-05 12:34 UTC

head link

rsync 1tb+ each day

On Tue, Feb 04, 2003 at 11:29:48AM -0800, Kenny Gorman
wrote:> I am rsyncing 1tb of data each day.  I am finding in my testing that 
> actually removing the target files each day then rsyncing is faster than 
> doing a compare of the source->target files then rsyncing over the delta
> blocks.  This is because we have a fast link between the two boxes, and 
> that are disk is fairly slow. I am finding that the creation of the temp 
> file (the 'dot file') is actually the slowest part of the
operation.
> This has to be done for each file because the timestamp and at least a 
> couple blocks are guaranteed to have changed (oracle files).
As others have mentioned -W (--whole-file) will help here.

The reason the temp-file is so slow is that it is reading
blocks from the disk and writing them to other blocks on the
same disk.  This means every block that is unchanged must be
transfered twice over the interface where changed blocks are
only transfered once.  If the files are very large this is
guaranteed to cause a seek storm.

Further, all of this happens after the entire file has been
read once to generate the block checksums.  Unless your
tree is smallish reads from the checksum pass will have been
flushed from cache by the time you do the final transfer.
--whole-file elminiates most of the disk activity.  You no
longer do the block checksum pass and replace the local copying
(read+write) with a simple write from the network.

Most likely your network is faster than the disks.  For
files that change but change very little your disk subsystem
would have to be more than triple the speed of your network
for the rsync algorythm (as oposed to the utility) to be of
benefit.  If the files change a lot then you merely need
double the speed.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

Craig Barratt

2003-Feb-05 23:41 UTC

head link

rsync 1tb+ each day

> I am rsyncing 1tb of data each day.  I am finding in my testing that 
> actually removing the target files each day then rsyncing is faster than 
> doing a compare of the source->target files then rsyncing over the delta
> blocks.  This is because we have a fast link between the two boxes, and 
> that are disk is fairly slow. I am finding that the creation of the temp 
> file (the 'dot file') is actually the slowest part of the
operation.
> This has to be done for each file because the timestamp and at least a 
> couple blocks are guaranteed to have changed (oracle files).
How big are the individual files?  If they are bigger than 1-2GB then it
is possible rsync is failing on the first pass and repeating the file.
You should be able to see from the output of -vv (you will see a
message like "redoing fileName (nnn)").

The reason for this is that the first-pass block checksum (32 bits Adler
+ 16 bits of MD4) is too small for large files.  There was a long thread
about this a few months ago.  The first message was from Terry Reed
around mid Oct 2002 ("Problem with checksum failing on large files").

In any case, as your already note, if the network is fast and the disk
is slow then copying the files will be faster.  Rsync on the receiving
side reads each file 1-2 times and writes each file once, while copying
just requires a write on the receiving side.

Another comment: rsync doesn't buffer its writes, so each write
is a block (as little as 700 bytes, or up to 16K for big files).
Buffering the writes might help.  There is an optional buffering
patch (patches/craigb-perf.diff) included with rsync 2.5.6 that
improves the write buffering, plus other I/O buffering.  That
might improve the write performance, althought so far significant
improvements have only been seen on cygwin.

Craig

Maybe Matching Threads

Search for more seemingly similar threads

rsync - Feb 2003 - rsync 1tb+ each day

rsync 1tb+ each day

rsync in-place (was Re: rsync 1tb+ each day)

rsync 1tb+ each day

rsync 1tb+ each day

rsync 1tb+ each day

rsync 1tb+ each day

rsync 1tb+ each day

Maybe Matching Threads