thr3ads.net - rsync - --fuzzy question [May 2009]

If this information is useful, please help other people find it:
Share via:

Julian Pace Ross

2009-May-14 09:15 UTC

--fuzzy question

Hi,
I have a file that changes slightly in size every day and has the timestamp
appended to it.. for example on the 14th may:

MybackedUpFileBlabla_200905140219.bak

This is transferred by rsync to another server.
The next day that file is deleted and substituted by a new file on the
sender.. the new file would be named for example (15th May):

MybackedUpFileBlabla_200905150221.bak

The new file will be generally slightly larger in size, but the containing
directory is exactly the same.
I was hoping to use --fuzzy and --delete-after, but it doesn't seem to be
speeding up the transfer. I am assuming that this is because I have both a
change in name AND a change is size/modtime?

I was looking into the find_fuzzy function, but i'm not sure if there's
anything I can tweak in there to make this work.

Thanks for any help
Julian
-------------- next part --------------
HTML attachment scrubbed and removed

Julian Pace Ross

2009-May-15 15:14 UTC

head link

--fuzzy question

Not sure if this got through to the list as I haven't received it back as
usually happens...

> Hi,
> I have a file that changes slightly in size every day and has the timestamp
> appended to it.. for example on the 14th may:
>
> MybackedUpFileBlabla_200905140219.bak
>
> This is transferred by rsync to another server.
> The next day that file is deleted and substituted by a new file on the
> sender.. the new file would be named for example (15th May):
>
> MybackedUpFileBlabla_200905150221.bak
>
> The new file will be generally slightly larger in size, but the containing
> directory is exactly the same.
> I was hoping to use --fuzzy and --delete-after, but it doesn't seem to
be
> speeding up the transfer. I am assuming that this is because I have both a
> change in name AND a change is size/modtime?
>
> I was looking into the find_fuzzy function, but i'm not sure if
there's
> anything I can tweak in there to make this work.
>
> Thanks for any help
> Julian
>
>-------------- next part --------------
HTML attachment scrubbed and removed

Ryan Malayter

2009-May-20 04:12 UTC

head link

--fuzzy question

On Thu, May 14, 2009 at 4:10 AM, Julian Pace Ross <linux@prisma.com.mt>
wrote:> Hi,
> I have a file that changes slightly in size every day and has the timestamp
> appended to it.. for example on the 14th may:
> MybackedUpFileBlabla_200905140219.bak
> This is transferred by rsync to another server.
> The next day that file is deleted and substituted by a new file on the
> sender.. the new file would be named for example (15th May):
> MybackedUpFileBlabla_200905150221.bak
> The new file will be generally slightly larger in size, but the containing
> directory is exactly the same.
> I was hoping to use --fuzzy and --delete-after, but it doesn't seem to
be
> speeding up the transfer. I am assuming that this is because I have both a
> change in name AND a change is size/modtime?
> I was looking into the find_fuzzy function, but i'm not sure if
there's
> anything I can tweak in there to make this work.
I am using rsync for the exact same purpose, with very similar file
names and it seems to work just fine on 3.0.5 running on both Linux
and Windows (cwrsync).

Some possible causes I've encountered:
o The source files are compressed or encrypted, which will prevent
sync from matching any blocks. gzip includes a special
"rsync-friendly" compression mode, but all other popular forms of
compression prevent rsync from finding matches.
o The source files are very large, and the default rsync block size
for large files prevents matches from being found. You can try forcing
a smaller block size (trading CPU time for bandwidth).
o The source files are some sort of indexed database files. (SQL
Server uses a .bak extension) If you rebuild or refresh database
indexes between your backups, this actually changes every page of the
database, preventing rsync from finding matches. Also, if you use
indexes on non-sequential clustering indexes, even small amounts of
data change can result in updates to nearly every database page.

-- 
RPM

Ryan Malayter

2009-May-20 19:29 UTC

head link

--fuzzy question

On Wed, May 20, 2009 at 2:26 AM, Julian Pace Ross <linux@prisma.com.mt>
wrote:> Thanks Ryan!
> In fact I found it's a combination of factors you mentioned... i.e. a
> compressed SQL .bak file, so contrary to what I thought, the fuzzy file was
> indeed being found but no matches were being found in the file... thanks
> again for the info.
If you have the disk space at both ends, I would suggest doing what I
do for SQL backup synchronization.

1) Write *uncompressed* .bak files for your databases (with timestamps
in the file name, such as those produced by the database maintenance
plan engine). This enables the use of --fuzzy, as you have discovered.
2) use Rsync to transfer the uncompressed files, but with the -z
option enbaled. This compresses the data over the wire, but
decompresses it at the receiving end.
3) Adjust the rsync block size to something smaller if necessary to
find more matches. I basically went down to 32KB rsync blocks for one
15 GB database file (rsync would by default use something like 129KB
on a file this big). This eats up a lot more CPU, but if irsync can
still output data faster than your network connection can handle, it
is the most time-efficient way to go. Use multiples of 8KB, as that is
the internal page size inherent in MS SQL Server databases. Trial and
error is your friend here. Run rsyc with low priority (START /LOW
rsync.exe) so the CPU usage doesn't impact SQL Server.
4) Minimize any jobs you have to automatically rebuild indexes. Use
UPDATE STATISTICS instead on a daily basis, and rebuild only when
index fragmentation gets heavy. There are lots of scripts out there on
the net which will automate that for you.
5) Minimize the rebuilds of denormalized "reporting" tables or other
non-essential data. Move these off into other databases that you don't
replicate if possible.
6) Watch out for non-sequential clustered indexes. We use GUIDs for
primary keys on many tables, and this causes updates and inserts to be
spread randomly throughout the table as it is physically stored. Even
channging just 5% of the data can result in a change to every database
page in such a scenario). Hot tables which use emails or other VARCHAR
fields as clustered index keys also result in similar behavior.

Most of these suggestions would apply for rsyncing any sort of
database backup file... Exchange, PostgreSQL, Oracle, or even
(horror!) MySQL.

-- 
RPM

Apparently Analagous Threads

Search for more apparently analagous threads

rsync - May 2009 - --fuzzy question

--fuzzy question

--fuzzy question

--fuzzy question

--fuzzy question

Apparently Analagous Threads