thr3ads.net - rsync - Block size optimization - let rsync find the optimal blocksize by itself. [Jun 2002]

If this information is useful, please help other people find it:
Share via:

Olivier Lachambre

2002-Jun-30 09:24 UTC

Block size optimization - let rsync find the optimal blocksize by itself.

Hello,
  Another French student in the rsync mailing list. I have been working on
rsync this year for a documentation project for school and I would like to
give some comment about rsync block size optimization first, and then to
submit a way to make rsync choose by itself the optimal blocksize when
updating a large number of files.

  Well, the first comment: during my work, I wanted to verify that the
theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
PHd Thesis was actually the good one, and when doing the tests on randomly
generated & modified files I discovered that the size sqrt(78*n/Q) is the
actual optimal block size, I tried to understand this by reading all the
thesis, then quite a lot of documentation about rsync but I just can't
figure out why the theorical & experimental optimal block sizes so much
don't match. I _really_ don't think it's coming from my tests, there
must be
somewhat else. Maybe the rsync developpers have just changed some part of
the algorithm. And also, even without using data compression during the
sync, rsync is always more efficient as it should be theorically, actually
between 1.5 and 2 times more efficient. Nobody will complain about that but
I'd be happy if someone would be nice enough to explain me this thing.

  Now the auto-optimization algorithm when updating many files at a time.
Let's consider a set of files to be updated. We will consider only the files
which have been changed since the last update (e.g. we can find the other
ones by sending a MD5 sum for each file and trying to match it). We sync the
first file, but the client keeps the old local version and can find how many
differences between the two files there is and then guess the optimal block
size. We assume that the percentage of differences between the files is a
bit the same in the same set of files. So we use for the second file the
optimal size found for the first file. Then for the third file we use the
(arithmetic or geometric?) average of the first two files and so on... Once
we have synced a certain number of files (10? 100?) we always use the same
size which is supposed to be the best one.
  Sorry I'm too long, hope you'll understand everything,
Olivier

P.S. I am not a programmer of any kind so don't wait for me to write any
line of C (I know I'm a bad boy).
_______

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD
FRANCE

e-mail : lachambre@club-internet.fr

Donovan Baarda

2002-Jun-30 16:20 UTC

head link

Block size optimization - let rsync find the optimal blocksize by itself.

On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote:
[...]>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is
the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests,
there must be
> somewhat else. Maybe the rsync developpers have just changed some part of
> the algorithm. And also, even without using data compression during the
> sync, rsync is always more efficient as it should be theorically, actually
> between 1.5 and 2 times more efficient. Nobody will complain about that but
> I'd be happy if someone would be nice enough to explain me this thing.
I believe that the compression option turns on compression of transfered
data, including file lists, instruction streams etc. Even with compression
turned off, the miss data in the delta is still compressed. 

Another thing to be aware of is when rsync compresses miss data, it also
compresses all the hit data to prime the compressor with context, and throws
away the compressed output for the hit data.

Perhaps this "context compression" is affecting the optimal block
size?
>   Now the auto-optimization algorithm when updating many files at a time.
> Let's consider a set of files to be updated. We will consider only the
files
> which have been changed since the last update (e.g. we can find the other
> ones by sending a MD5 sum for each file and trying to match it). We sync
the
> first file, but the client keeps the old local version and can find how
many
> differences between the two files there is and then guess the optimal block
> size. We assume that the percentage of differences between the files is a
> bit the same in the same set of files. So we use for the second file the
> optimal size found for the first file. Then for the third file we use the
> (arithmetic or geometric?) average of the first two files and so on... Once
> we have synced a certain number of files (10? 100?) we always use the same
> size which is supposed to be the best one.
>   Sorry I'm too long, hope you'll understand everything,
> Olivier
This relies on optimal block size being related for a set of files. I'm not
sure that this heuristic actually applies, and I don't know how much benefit
this would buy for all the complexity it would add.
> P.S. I am not a programmer of any kind so don't wait for me to write
any
> line of C (I know I'm a bad boy).
It might be an interesting experiment to write a wrapper for rsync that
accumulated stats on all transfers it was used for and "learned" the
optimal
block size to use. This would not require any changes to rsync, and could be
written in something like Python.


-- 
----------------------------------------------------------------------
ABO: finger abo@minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------

tridge@samba.org

2002-Jun-30 17:11 UTC

head link

Block size optimization - let rsync find the optimal blocksize by itself.

Olivier,
>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is
the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests,
there must be
> somewhat else.
First off, you need to make sure you are taking into account the
conditions I mentioned for that optimal size to be correct. In
particular I assumed:

  If, for example, we assume that the two files are the same except for
  Q sequences of bytes, with each sequence smaller than the block size
  and separated by more than the block size from the next sequence

In practice there is no 'correct' model for real files, so I chose a
simple module that I thought would give a reasonable approximation
while being easy to analyse.

Also, you didn't take into account that the function I gave was for
the simpler version of rsync that I introduced in chapter 3. Later in
the thesis I discuss how s_s can be reduced without compromising the
algorithm (see 'Smaller Signatures' in chapter 4). That changes the
calculation of optimal block size quite a bit.

Thanks for looking at this though. I haven't thought closely about
this algorithm in a long time! 

Cheers, Tridge

jw schultz

2002-Jun-30 17:46 UTC

head link

Block size optimization - let rsync find the optimal blocksize by itself.

I dislike flaming and i don't intend my comments as flame
but you have made some statements that at face value are
problematic.  Perhaps you could correct my misunderstanding.

On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre
wrote:> Hello,
>   Another French student in the rsync mailing list. I have been working on
> rsync this year for a documentation project for school and I would like to
> give some comment about rsync block size optimization first, and then to
> submit a way to make rsync choose by itself the optimal blocksize when
> updating a large number of files.
> 
>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is
the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests,
there must be
> somewhat else. Maybe the rsync developpers have just changed some part of
> the algorithm. And also, even without using data compression during the
> sync, rsync is always more efficient as it should be theorically, actually
> between 1.5 and 2 times more efficient. Nobody will complain about that but
> I'd be happy if someone would be nice enough to explain me this thing.
Firstly, I'll give Andrew the benefit of the doubt.  His
track record has been very good. 

In the real world file generation and modification is not
random.  The nearest to random any data gets is when it has
been encrypted.  Compression does increases the randomness
of new data somewhat but modifications are hardly random.

Most real files contain mostly text and as any cryptographer
can tell you text is decidedly non-random.  Executables are
similarly non-random as they are encoded communication of
nouns and verbs in the language of the CPU.  Databases are
highly structured and again contain mostly text.  The
nearest thing to random data you will find are files stored
in a compressed format such as audio, image or newer office
(not MS-Office) files such as Open Office.

To test your ideas you would do much better to observe live
systems.
> 
> P.S. I am not a programmer of any kind so don't wait for me to write
any
> line of C (I know I'm a bad boy).
How does someone who is "not a programmer of any kind"
randomly generate and modify files and run all these tests.

On a more pleasant level;  You talk of between 1.5 and 2
times more efficient.  That is rather vague.  How much
effect does this difference in efficiency affect overall
network bandwidth, disk IO, and memory and CPU utilisation.
i.e. changing x had decreased bandwidth utilisation by .2%
under ABC conditions.  Doubling the efficiency of something
that effects only 2% (reducing it to 1.1%) of the load on a
plentiful resource (CPU) might be worthwhile but hardly
exciting.  Trimming network load by 12% would be very
interesting to those rsyncing over slow internet links.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

Olivier Lachambre

2002-Jul-01 09:30 UTC

head link

Block size optimization - let rsync find the optimal blocksize by itself.

At 17:09 30/06/2002 -0700, you wrote:>Olivier,
>
>>   Well, the first comment: during my work, I wanted to verify that the
>> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in
his
>> PHd Thesis was actually the good one, and when doing the tests on
randomly
>> generated & modified files I discovered that the size sqrt(78*n/Q)
is the
>> actual optimal block size, I tried to understand this by reading all
the
>> thesis, then quite a lot of documentation about rsync but I just
can't
>> figure out why the theorical & experimental optimal block sizes so
much
>> don't match. I _really_ don't think it's coming from my
tests, there must be
>> somewhat else.
>
>First off, you need to make sure you are taking into account the
>conditions I mentioned for that optimal size to be correct. In
>particular I assumed:
>
>  If, for example, we assume that the two files are the same except for
>  Q sequences of bytes, with each sequence smaller than the block size
>  and separated by more than the block size from the next sequence
>
>In practice there is no 'correct' model for real files, so I chose a
>simple module that I thought would give a reasonable approximation
>while being easy to analyse.
I did not explain at all what my tests were : I did not use real files
but a randomly generated file in which I have put 1 byte long
differences, separated from another difference by much more than the block
size.
>Also, you didn't take into account that the function I gave was for
>the simpler version of rsync that I introduced in chapter 3. Later in
>the thesis I discuss how s_s can be reduced without compromising the
>algorithm (see 'Smaller Signatures' in chapter 4). That changes the
>calculation of optimal block size quite a bit.
I think this is the main reason for such results in my test.

  Thanks for answering my question.

  Amicalement,

Olivier

P.S. Maybe one day all this will be available somewhere on the Internet
(I don't have time do it now because of exams actually).
_______

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD
FRANCE

e-mail : lachambre@club-internet.fr

Olivier Lachambre

2002-Jul-02 02:07 UTC

head link

Block size optimization - let rsync find the optimal blocksize by itself.

At 17:45 30/06/2002 -0700, you wrote:>
>I dislike flaming and i don't intend my comments as flame
>but you have made some statements that at face value are
>problematic.  Perhaps you could correct my misunderstanding.
[...]>>   Well, the first comment: during my work, I wanted to verify that the
>> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in
his
>> PHd Thesis was actually the good one, and when doing the tests on
randomly
>> generated & modified files I discovered that the size sqrt(78*n/Q)
is the
>> actual optimal block size, I tried to understand this by reading all
the
>> thesis, then quite a lot of documentation about rsync but I just
can't
>> figure out why the theorical & experimental optimal block sizes so
much
>> don't match. I _really_ don't think it's coming from my
tests, there must be
>> somewhat else. Maybe the rsync developpers have just changed some part
of
>> the algorithm. And also, even without using data compression during the
>> sync, rsync is always more efficient as it should be theorically,
actually
>> between 1.5 and 2 times more efficient. Nobody will complain about that
but
>> I'd be happy if someone would be nice enough to explain me this
thing.
>
>Firstly, I'll give Andrew the benefit of the doubt.  His
>track record has been very good. 
>
I think that the theorical optimal block size sqrt(24*n/Q) given by Andrew
in his PHd Thesis was actually the good one in the first version of rsync.
Rsync has been deeply modified since, as he explained it yesterday, that
is why the former optimal block size is no more accurate, which is not a
problem but still interesting to know.
>
>In the real world file generation and modification is not
>random.  The nearest to random any data gets is when it has
>been encrypted.  Compression does increases the randomness
>of new data somewhat but modifications are hardly random.
>
>Most real files contain mostly text and as any cryptographer
>can tell you text is decidedly non-random.  Executables are
>similarly non-random as they are encoded communication of
>nouns and verbs in the language of the CPU.  Databases are
>highly structured and again contain mostly text.  The
>nearest thing to random data you will find are files stored
>in a compressed format such as audio, image or newer office
>(not MS-Office) files such as Open Office.
>
>To test your ideas you would do much better to observe live
>systems.
I know that -fortunately- real world is not random, but I used 
random files because this is the only way to test Andrew's
theorical work, which can maybe later give birth to
applications in the "real world", e.g. in the "algorithm" I
gave to auto-optimize rsync, you must choose a model to
find the best size for a set of files, or you won't be able to
calculate the optimal size for this set when you know the optimal
size for each file. And to validate the model, dealing with
random files first might not be the worst idea.
>
>> 
>> P.S. I am not a programmer of any kind so don't wait for me to
write any
>> line of C (I know I'm a bad boy).
>
>How does someone who is "not a programmer of any kind"
>randomly generate and modify files and run all these tests.
>
Let's rassure you: I said I was not a programmer because I do
write almost only CaML code (you may know OCaML, which is,
I heard, a wonderful & powerful language, but I use its ancestor CaML Light
http//caml.inria.fr/ which is rather experimental, so I don't do big
programming, only algorithmics) which is (in my case) more maths/theory
than real computing. This is the difference between theorical &
practical computing. But I still can write enough stuff to generate
random files, which is more maths than real programming.
To run all the tests, I just write bash scripts running under Linux.
I synced a 1MB random file with 100 sets of differences (from 10 to
1000 differences) and 200 block sizes (from 10 to 2000 bytes) for
each set. To count how many bytes went through the network, I
used the --stats flag. Eventually CaML programs & C.A.S. compute
the whole tests for me. 

>
>On a more pleasant level;  You talk of between 1.5 and 2
>times more efficient.  That is rather vague.  How much
>effect does this difference in efficiency affect overall
>network bandwidth, disk IO, and memory and CPU utilisation.
>i.e. changing x had decreased bandwidth utilisation by .2%
>under ABC conditions.  Doubling the efficiency of something
>that effects only 2% (reducing it to 1.1%) of the load on a
>plentiful resource (CPU) might be worthwhile but hardly
>exciting.  Trimming network load by 12% would be very
>interesting to those rsyncing over slow internet links.
>
I have decided that efficiency is only the speedup rate,
this means I only care about network traffic, so between
1.5 & 2 times more efficient is just: "extremement excitant".
The main goal of rsync is to spare bandwidth utilisation, I
guess my definition for efficiency is quite the good one
(I know what a very slow link is).
I didn't care about CPU, disks IO (although I broke one
of my disks the other day) & memory this time.

As I wrote it in my last mail, you will get all the details online
one day (let's say end of August), and see it's quite serious.
You can ask me for the complete document right now, but it's not all
done and still in French... (PDF ~50pages ~700kB)

Amicalement,

Olivier
_______

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD
FRANCE

e-mail : lachambre@club-internet.fr

Olivier Lachambre

2002-Jul-02 02:07 UTC

head link

Block size optimization - let rsync find the optimal blocksize by itself.

At 09:19 01/07/2002 +1000, you wrote:
>[...]
>This relies on optimal block size being related for a set of files. I'm
not
>sure that this heuristic actually applies, and I don't know how much
benefit
>this would buy for all the complexity it would add.
>
I think that many clients do not care about multiplying complexity by 2 or
5, even if the speedup rate is only multiplied by 1.1.

Olivier
_______

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD
FRANCE

e-mail : lachambre@club-internet.fr

Olivier Lachambre

2002-Jul-21 10:55 UTC

head link

Block size optimization - let rsync find the optimal blocksize by itself.

At 12:36 03/07/2002 +1000, you wrote:>On Tue, Jul 02, 2002 at 11:06:30AM +0200, Olivier Lachambre wrote:
>> At 09:19 01/07/2002 +1000, you wrote:
>> 
>> >[...]
>> >This relies on optimal block size being related for a set of files.
I'm not
>> >sure that this heuristic actually applies, and I don't know how
much benefit
>> >this would buy for all the complexity it would add.
>> >
>> 
>> I think that many clients do not care about multiplying complexity by 2
or
>> 5, even if the speedup rate is only multiplied by 1.1.
>
>The only problem is, multiplying complexity by 2 or 5 multiplies the bugs by
>2 or 5. Are those clients still excited by the 1.1x speedup now?
>
>The other problem with increasing complexity is the "setting"
effect that
>complexity causes. The more complex code becomes, the harder it gets to
>modify, so that 1.1x speedup that introduced 5x complexity now makes it
>impossible to implement a much better 1.5x speedup.
>
>In my experience, complex solutions usually have worse performance than
>simple ones anyway.
When talking about complexity I mean complexity as defined in theorical
computing (that's at least how we call it in French). In this case,
complexity is realated to the time the algorithm needs to be executed, 
and not to the size of the source code or the size of the binary.
For example :
     for i = 0 to n do
       for i = 0 to n do 
         (nothing)
       done
     done
has a complexity proportionnal to n^2, which is sometimes a lot, but the
program is not complex in any way.
Myself at home I would be happy to spare 10% of the time spent on the
Internet (as local communication are not free) but I don't care if my 
computer works twice or 5 times more, since I have a 56K modem. I guess that
the problem for many user is low bandwidth and not low CPU ressources
(I know there are exceptions). I think that the code of the algorithm 
I proposed would not be simple but would take time to run.

Olivier
__________

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD (F)

<lachambre@club-internet.fr>

Apparently Analagous Threads

Search for more apparently analagous threads

rsync - Jun 2002 - Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Block size optimization - let rsync find the optimal blocksize by itself.

Apparently Analagous Threads