thr3ads.net - rsync - Large file - match process taking days [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Rob Bosch

2008-Jul-30 13:17 UTC

Large file - match process taking days

I've been trying to figure out why some large files are taking a long time
to rsync (80GB file).  With this file, the match process is taking days.
I've added logging to verbose level 4.  The output from match.c is at the
point where it is writing out the "potential match at" message.  In a
9 hour
period the match verbiage has changed from:

potential match at 14993337175 i=2976 sum=7c07ae74
potential match at 14993834514 i=3517 sum=0956772e
potential match at 14994673480 i=3232 sum=9be33b55
potential match at 14994912897 i=4739 sum=7b87587a
potential match at 14996877980 i=1453 sum=b7715246
potential match at 14999624225 i=906 sum=d9d831c6
potential match at 14999951039 i=2235 sum=6ca97091
potential match at 15001174331 i=3866 sum=12f966ee
potential match at 15001209073 i=2080 sum=783c7750
potential match at 15001399336 i=4522 sum=87f122e0
potential match at 15001543265 i=1360 sum=85dee02c
potential match at 15001770789 i=1637 sum=c55912e6
potential match at 15002913113 i=2783 sum=3fdbf408
potential match at 15004011466 i=3552 sum=ea7d0f44
potential match at 15005784863 i=2758 sum=cf9e00d6

To

potential match at 19827231165 i=3880 sum=f0b58ab2
potential match at 19827785238 i=4099 sum=f3338531
potential match at 19827870435 i=1232 sum=6abf175c
potential match at 19829135485 i=4472 sum=1ed3674e
potential match at 19829758278 i=2705 sum=dc796cb7
potential match at 19830224336 i=2959 sum=f0bd8161
potential match at 19830896106 i=3185 sum=6f83947a
potential match at 19832087866 i=1306 sum=14b38acb
potential match at 19832536037 i=1411 sum=3de116db
potential match at 19833817328 i=102 sum=45a8d003
potential match at 19835208508 i=2706 sum=e326d8e4
potential match at 19836927143 i=1591 sum=e357d821
potential match at 19838869812 i=4324 sum=1b113e13
potential match at 19839194857 i=3894 sum=03e116c1
potential match at 19839789868 i=3285 sum=39139716

I believe this means that 4.8GB of the file has been processed in this 9
hour period?  Blocksize is currently manually set at 1149728, 4 times the
default value.  Any idea on why it would be taking so long to go through
this portion of the sync process?  Rsync version is 3.0.3 on both ends.

Rob

Rob Bosch

2008-Jul-30 13:42 UTC

head link

Large file - match process taking days

The files are very similar, a maximum of about 5GB of data differences over
80GB.  The CPU on both sides is low (3-5 percent) and the memory usage is
low (11MB on the client, not sure on the server).

Full rsync options are:

-ruvvvvityz --partial --partial-dir=.rsync-partial --links --ignore-case
--preallocate --ignore-errors --stats --del --block-size=1149728 -I 

I'm using the -I option to force a full sync since date/time changes on
database files is not a reliable measure of changes.

I'll try the block-size at 1638400 although I have not seen a big change in
moving it from about 287000 (default square root) to 1149728.

Rob

Shachar Shemesh

2008-Jul-30 13:57 UTC

head link

Large file - match process taking days

Rob Bosch wrote:> I've been trying to figure out why some large files are taking a long
time
> to rsync (80GB file).  With this file, the match process is taking days.
> I've added logging to verbose level 4.  The output from match.c is at
the
> point where it is writing out the "potential match at" message. 
In a 9 hour
> period the match verbiage has changed from:
>
>   Can you tell where the bottleneck is? Is it on the sender's CPU? The 
receiver's? The network? Local IO on either sides?> I believe this means that 4.8GB of the file has been processed in this 9
> hour period?  Blocksize is currently manually set at 1149728, 4 times the
> default value. Rsync does have some CPU inefficient behavior for especially large 
files. However, it should not happen at the block size you are using 
(assuming the files are fairly identical). Try increasing it a little 
further, to 1638400 (80% utilization on the hash table), and see if 
things are any better.

Are the files fairly identical?

Shachar

Rob Bosch

2008-Aug-02 15:19 UTC

head link

Large file - match process taking days

I believe I've figured out why the process was taking so long...or at least
have a theory.  In the end it appears that much of the data was being sent
even though the "true" amount of data change was less than 7% of the
filesize.  

Exchange uses a database page size of 4K.  Many times a page is deleted and
then new data is written to that page (delete a message, new message
arrives).  Exchange will try to keep the data file size constant by reusing
freed up space and it will do online "defragmentation" nightly by
default.
Defragmentation might be the wrong term because online defragmentation
really "makes additional database space available by detecting and removing
database objects that are no longer being used."

Although only 7% of the file is changing, the overall number of data pages
would approach 1.5 million.  In all likelihood, these pages would be spread
throughout the file.

So if the usual approach of making the blocksize larger to process the file
is used then rsync actually performs worse.  This is because a change in a
single 4K data page (likely occurrence) will cause the entire block to be
sent.  This is what I was seeing in the earlier tests, increasing blocksize
decreases performance by sending more data.

When I changed the blocksize to be close to the default of sqrt(filesize)
but rounded down to a function of 4K, rsync performance is much better.  The
performance of a 4K rounded blocksize is better than the default (in this
case, 262144).

I'm continuing to test to find the "best" blocksize for these
types of
files.  I'm just sending this info for future reference for those using
rsync for large Exchange files or other database files.

Rob

Maybe Matching Threads

Search for more maybe matching threads

rsync - Jul 2008 - Large file - match process taking days

Large file - match process taking days

Large file - match process taking days

Large file - match process taking days

Large file - match process taking days

Maybe Matching Threads