Olivier Lachambre
2002-Jun-30 09:24 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
Hello, Another French student in the rsync mailing list. I have been working on rsync this year for a documentation project for school and I would like to give some comment about rsync block size optimization first, and then to submit a way to make rsync choose by itself the optimal blocksize when updating a large number of files. Well, the first comment: during my work, I wanted to verify that the theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his PHd Thesis was actually the good one, and when doing the tests on randomly generated & modified files I discovered that the size sqrt(78*n/Q) is the actual optimal block size, I tried to understand this by reading all the thesis, then quite a lot of documentation about rsync but I just can't figure out why the theorical & experimental optimal block sizes so much don't match. I _really_ don't think it's coming from my tests, there must be somewhat else. Maybe the rsync developpers have just changed some part of the algorithm. And also, even without using data compression during the sync, rsync is always more efficient as it should be theorically, actually between 1.5 and 2 times more efficient. Nobody will complain about that but I'd be happy if someone would be nice enough to explain me this thing. Now the auto-optimization algorithm when updating many files at a time. Let's consider a set of files to be updated. We will consider only the files which have been changed since the last update (e.g. we can find the other ones by sending a MD5 sum for each file and trying to match it). We sync the first file, but the client keeps the old local version and can find how many differences between the two files there is and then guess the optimal block size. We assume that the percentage of differences between the files is a bit the same in the same set of files. So we use for the second file the optimal size found for the first file. Then for the third file we use the (arithmetic or geometric?) average of the first two files and so on... Once we have synced a certain number of files (10? 100?) we always use the same size which is supposed to be the best one. Sorry I'm too long, hope you'll understand everything, Olivier P.S. I am not a programmer of any kind so don't wait for me to write any line of C (I know I'm a bad boy). _______ Olivier Lachambre 2, rue Roger Courtois 25 200 MONTBELIARD FRANCE e-mail : lachambre@club-internet.fr
Donovan Baarda
2002-Jun-30 16:20 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote: [...]> Well, the first comment: during my work, I wanted to verify that the > theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his > PHd Thesis was actually the good one, and when doing the tests on randomly > generated & modified files I discovered that the size sqrt(78*n/Q) is the > actual optimal block size, I tried to understand this by reading all the > thesis, then quite a lot of documentation about rsync but I just can't > figure out why the theorical & experimental optimal block sizes so much > don't match. I _really_ don't think it's coming from my tests, there must be > somewhat else. Maybe the rsync developpers have just changed some part of > the algorithm. And also, even without using data compression during the > sync, rsync is always more efficient as it should be theorically, actually > between 1.5 and 2 times more efficient. Nobody will complain about that but > I'd be happy if someone would be nice enough to explain me this thing.I believe that the compression option turns on compression of transfered data, including file lists, instruction streams etc. Even with compression turned off, the miss data in the delta is still compressed. Another thing to be aware of is when rsync compresses miss data, it also compresses all the hit data to prime the compressor with context, and throws away the compressed output for the hit data. Perhaps this "context compression" is affecting the optimal block size?> Now the auto-optimization algorithm when updating many files at a time. > Let's consider a set of files to be updated. We will consider only the files > which have been changed since the last update (e.g. we can find the other > ones by sending a MD5 sum for each file and trying to match it). We sync the > first file, but the client keeps the old local version and can find how many > differences between the two files there is and then guess the optimal block > size. We assume that the percentage of differences between the files is a > bit the same in the same set of files. So we use for the second file the > optimal size found for the first file. Then for the third file we use the > (arithmetic or geometric?) average of the first two files and so on... Once > we have synced a certain number of files (10? 100?) we always use the same > size which is supposed to be the best one. > Sorry I'm too long, hope you'll understand everything, > OlivierThis relies on optimal block size being related for a set of files. I'm not sure that this heuristic actually applies, and I don't know how much benefit this would buy for all the complexity it would add.> P.S. I am not a programmer of any kind so don't wait for me to write any > line of C (I know I'm a bad boy).It might be an interesting experiment to write a wrapper for rsync that accumulated stats on all transfers it was used for and "learned" the optimal block size to use. This would not require any changes to rsync, and could be written in something like Python. -- ---------------------------------------------------------------------- ABO: finger abo@minkirri.apana.org.au for more info, including pgp key ----------------------------------------------------------------------
tridge@samba.org
2002-Jun-30 17:11 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
Olivier,> Well, the first comment: during my work, I wanted to verify that the > theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his > PHd Thesis was actually the good one, and when doing the tests on randomly > generated & modified files I discovered that the size sqrt(78*n/Q) is the > actual optimal block size, I tried to understand this by reading all the > thesis, then quite a lot of documentation about rsync but I just can't > figure out why the theorical & experimental optimal block sizes so much > don't match. I _really_ don't think it's coming from my tests, there must be > somewhat else.First off, you need to make sure you are taking into account the conditions I mentioned for that optimal size to be correct. In particular I assumed: If, for example, we assume that the two files are the same except for Q sequences of bytes, with each sequence smaller than the block size and separated by more than the block size from the next sequence In practice there is no 'correct' model for real files, so I chose a simple module that I thought would give a reasonable approximation while being easy to analyse. Also, you didn't take into account that the function I gave was for the simpler version of rsync that I introduced in chapter 3. Later in the thesis I discuss how s_s can be reduced without compromising the algorithm (see 'Smaller Signatures' in chapter 4). That changes the calculation of optimal block size quite a bit. Thanks for looking at this though. I haven't thought closely about this algorithm in a long time! Cheers, Tridge
jw schultz
2002-Jun-30 17:46 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
I dislike flaming and i don't intend my comments as flame but you have made some statements that at face value are problematic. Perhaps you could correct my misunderstanding. On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote:> Hello, > Another French student in the rsync mailing list. I have been working on > rsync this year for a documentation project for school and I would like to > give some comment about rsync block size optimization first, and then to > submit a way to make rsync choose by itself the optimal blocksize when > updating a large number of files. > > Well, the first comment: during my work, I wanted to verify that the > theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his > PHd Thesis was actually the good one, and when doing the tests on randomly > generated & modified files I discovered that the size sqrt(78*n/Q) is the > actual optimal block size, I tried to understand this by reading all the > thesis, then quite a lot of documentation about rsync but I just can't > figure out why the theorical & experimental optimal block sizes so much > don't match. I _really_ don't think it's coming from my tests, there must be > somewhat else. Maybe the rsync developpers have just changed some part of > the algorithm. And also, even without using data compression during the > sync, rsync is always more efficient as it should be theorically, actually > between 1.5 and 2 times more efficient. Nobody will complain about that but > I'd be happy if someone would be nice enough to explain me this thing.Firstly, I'll give Andrew the benefit of the doubt. His track record has been very good. In the real world file generation and modification is not random. The nearest to random any data gets is when it has been encrypted. Compression does increases the randomness of new data somewhat but modifications are hardly random. Most real files contain mostly text and as any cryptographer can tell you text is decidedly non-random. Executables are similarly non-random as they are encoded communication of nouns and verbs in the language of the CPU. Databases are highly structured and again contain mostly text. The nearest thing to random data you will find are files stored in a compressed format such as audio, image or newer office (not MS-Office) files such as Open Office. To test your ideas you would do much better to observe live systems.> > P.S. I am not a programmer of any kind so don't wait for me to write any > line of C (I know I'm a bad boy).How does someone who is "not a programmer of any kind" randomly generate and modify files and run all these tests. On a more pleasant level; You talk of between 1.5 and 2 times more efficient. That is rather vague. How much effect does this difference in efficiency affect overall network bandwidth, disk IO, and memory and CPU utilisation. i.e. changing x had decreased bandwidth utilisation by .2% under ABC conditions. Doubling the efficiency of something that effects only 2% (reducing it to 1.1%) of the load on a plentiful resource (CPU) might be worthwhile but hardly exciting. Trimming network load by 12% would be very interesting to those rsyncing over slow internet links. -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt
Olivier Lachambre
2002-Jul-01 09:30 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
At 17:09 30/06/2002 -0700, you wrote:>Olivier, > >> Well, the first comment: during my work, I wanted to verify that the >> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his >> PHd Thesis was actually the good one, and when doing the tests on randomly >> generated & modified files I discovered that the size sqrt(78*n/Q) is the >> actual optimal block size, I tried to understand this by reading all the >> thesis, then quite a lot of documentation about rsync but I just can't >> figure out why the theorical & experimental optimal block sizes so much >> don't match. I _really_ don't think it's coming from my tests, there must be >> somewhat else. > >First off, you need to make sure you are taking into account the >conditions I mentioned for that optimal size to be correct. In >particular I assumed: > > If, for example, we assume that the two files are the same except for > Q sequences of bytes, with each sequence smaller than the block size > and separated by more than the block size from the next sequence > >In practice there is no 'correct' model for real files, so I chose a >simple module that I thought would give a reasonable approximation >while being easy to analyse.I did not explain at all what my tests were : I did not use real files but a randomly generated file in which I have put 1 byte long differences, separated from another difference by much more than the block size.>Also, you didn't take into account that the function I gave was for >the simpler version of rsync that I introduced in chapter 3. Later in >the thesis I discuss how s_s can be reduced without compromising the >algorithm (see 'Smaller Signatures' in chapter 4). That changes the >calculation of optimal block size quite a bit.I think this is the main reason for such results in my test. Thanks for answering my question. Amicalement, Olivier P.S. Maybe one day all this will be available somewhere on the Internet (I don't have time do it now because of exams actually). _______ Olivier Lachambre 2, rue Roger Courtois 25 200 MONTBELIARD FRANCE e-mail : lachambre@club-internet.fr
Olivier Lachambre
2002-Jul-02 02:07 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
At 17:45 30/06/2002 -0700, you wrote:> >I dislike flaming and i don't intend my comments as flame >but you have made some statements that at face value are >problematic. Perhaps you could correct my misunderstanding.[...]>> Well, the first comment: during my work, I wanted to verify that the >> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his >> PHd Thesis was actually the good one, and when doing the tests on randomly >> generated & modified files I discovered that the size sqrt(78*n/Q) is the >> actual optimal block size, I tried to understand this by reading all the >> thesis, then quite a lot of documentation about rsync but I just can't >> figure out why the theorical & experimental optimal block sizes so much >> don't match. I _really_ don't think it's coming from my tests, there must be >> somewhat else. Maybe the rsync developpers have just changed some part of >> the algorithm. And also, even without using data compression during the >> sync, rsync is always more efficient as it should be theorically, actually >> between 1.5 and 2 times more efficient. Nobody will complain about that but >> I'd be happy if someone would be nice enough to explain me this thing. > >Firstly, I'll give Andrew the benefit of the doubt. His >track record has been very good. >I think that the theorical optimal block size sqrt(24*n/Q) given by Andrew in his PHd Thesis was actually the good one in the first version of rsync. Rsync has been deeply modified since, as he explained it yesterday, that is why the former optimal block size is no more accurate, which is not a problem but still interesting to know.> >In the real world file generation and modification is not >random. The nearest to random any data gets is when it has >been encrypted. Compression does increases the randomness >of new data somewhat but modifications are hardly random. > >Most real files contain mostly text and as any cryptographer >can tell you text is decidedly non-random. Executables are >similarly non-random as they are encoded communication of >nouns and verbs in the language of the CPU. Databases are >highly structured and again contain mostly text. The >nearest thing to random data you will find are files stored >in a compressed format such as audio, image or newer office >(not MS-Office) files such as Open Office. > >To test your ideas you would do much better to observe live >systems.I know that -fortunately- real world is not random, but I used random files because this is the only way to test Andrew's theorical work, which can maybe later give birth to applications in the "real world", e.g. in the "algorithm" I gave to auto-optimize rsync, you must choose a model to find the best size for a set of files, or you won't be able to calculate the optimal size for this set when you know the optimal size for each file. And to validate the model, dealing with random files first might not be the worst idea.> >> >> P.S. I am not a programmer of any kind so don't wait for me to write any >> line of C (I know I'm a bad boy). > >How does someone who is "not a programmer of any kind" >randomly generate and modify files and run all these tests. >Let's rassure you: I said I was not a programmer because I do write almost only CaML code (you may know OCaML, which is, I heard, a wonderful & powerful language, but I use its ancestor CaML Light http//caml.inria.fr/ which is rather experimental, so I don't do big programming, only algorithmics) which is (in my case) more maths/theory than real computing. This is the difference between theorical & practical computing. But I still can write enough stuff to generate random files, which is more maths than real programming. To run all the tests, I just write bash scripts running under Linux. I synced a 1MB random file with 100 sets of differences (from 10 to 1000 differences) and 200 block sizes (from 10 to 2000 bytes) for each set. To count how many bytes went through the network, I used the --stats flag. Eventually CaML programs & C.A.S. compute the whole tests for me.> >On a more pleasant level; You talk of between 1.5 and 2 >times more efficient. That is rather vague. How much >effect does this difference in efficiency affect overall >network bandwidth, disk IO, and memory and CPU utilisation. >i.e. changing x had decreased bandwidth utilisation by .2% >under ABC conditions. Doubling the efficiency of something >that effects only 2% (reducing it to 1.1%) of the load on a >plentiful resource (CPU) might be worthwhile but hardly >exciting. Trimming network load by 12% would be very >interesting to those rsyncing over slow internet links. >I have decided that efficiency is only the speedup rate, this means I only care about network traffic, so between 1.5 & 2 times more efficient is just: "extremement excitant". The main goal of rsync is to spare bandwidth utilisation, I guess my definition for efficiency is quite the good one (I know what a very slow link is). I didn't care about CPU, disks IO (although I broke one of my disks the other day) & memory this time. As I wrote it in my last mail, you will get all the details online one day (let's say end of August), and see it's quite serious. You can ask me for the complete document right now, but it's not all done and still in French... (PDF ~50pages ~700kB) Amicalement, Olivier _______ Olivier Lachambre 2, rue Roger Courtois 25 200 MONTBELIARD FRANCE e-mail : lachambre@club-internet.fr
Olivier Lachambre
2002-Jul-02 02:07 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
At 09:19 01/07/2002 +1000, you wrote:>[...] >This relies on optimal block size being related for a set of files. I'm not >sure that this heuristic actually applies, and I don't know how much benefit >this would buy for all the complexity it would add. >I think that many clients do not care about multiplying complexity by 2 or 5, even if the speedup rate is only multiplied by 1.1. Olivier _______ Olivier Lachambre 2, rue Roger Courtois 25 200 MONTBELIARD FRANCE e-mail : lachambre@club-internet.fr
Olivier Lachambre
2002-Jul-21 10:55 UTC
Block size optimization - let rsync find the optimal blocksize by itself.
At 12:36 03/07/2002 +1000, you wrote:>On Tue, Jul 02, 2002 at 11:06:30AM +0200, Olivier Lachambre wrote: >> At 09:19 01/07/2002 +1000, you wrote: >> >> >[...] >> >This relies on optimal block size being related for a set of files. I'm not >> >sure that this heuristic actually applies, and I don't know how much benefit >> >this would buy for all the complexity it would add. >> > >> >> I think that many clients do not care about multiplying complexity by 2 or >> 5, even if the speedup rate is only multiplied by 1.1. > >The only problem is, multiplying complexity by 2 or 5 multiplies the bugs by >2 or 5. Are those clients still excited by the 1.1x speedup now? > >The other problem with increasing complexity is the "setting" effect that >complexity causes. The more complex code becomes, the harder it gets to >modify, so that 1.1x speedup that introduced 5x complexity now makes it >impossible to implement a much better 1.5x speedup. > >In my experience, complex solutions usually have worse performance than >simple ones anyway.When talking about complexity I mean complexity as defined in theorical computing (that's at least how we call it in French). In this case, complexity is realated to the time the algorithm needs to be executed, and not to the size of the source code or the size of the binary. For example : for i = 0 to n do for i = 0 to n do (nothing) done done has a complexity proportionnal to n^2, which is sometimes a lot, but the program is not complex in any way. Myself at home I would be happy to spare 10% of the time spent on the Internet (as local communication are not free) but I don't care if my computer works twice or 5 times more, since I have a 56K modem. I guess that the problem for many user is low bandwidth and not low CPU ressources (I know there are exceptions). I think that the code of the algorithm I proposed would not be simple but would take time to run. Olivier __________ Olivier Lachambre 2, rue Roger Courtois 25 200 MONTBELIARD (F) <lachambre@club-internet.fr>