Hello, i`m using rsync to sync large virtual machine files from one esx server to another. rsync is running inside the so called "esx console" which is basically a specially crafted linux vm with some restrictions. the speed is "reasonable", but i guess it`s not the optimum - at least i don?t know where the bottleneck is. i`m not using ssh as transport but run rsync in deamon mode on the target. so this speeds things up if large amounts of data go over the wire. i read that rsync would be not very efficient with ultra-large files (i`m syncing files with up to 80gb size) regarding the bottleneck: neither cpu, network or disk is at their limits - neither on the source nor on the destination system. i don`t see 100% cpu, i don`t see 100% network or 100% disk i/o usage furthermore, i wonder: isn`t rsync just too intelligent for such file transfers, as the position of data inside that files (containing harddisk-images) won`t really change? i.e. we don`t need to check for data relocation, we just need to know if some blocks changed inside a block of size "x" and if there was a change, we could transfer that whole block again. so i wonder if we need a "rolling checksum" at all to handle this. wouldn`t checksums over fixed block size be sufficient for this task? regards roland ______________________________________________________ GRATIS f?r alle WEB.DE-Nutzer: Die maxdome Movie-FLAT! Jetzt freischalten unter http://movieflat.web.de
One way I've been trying to speedup rsync may not apply in every situation. In my situation when files change, they usually change completely. This is especially true for large files. So, the rsync algorithm does me no good. So, I've been using the "W" flag (e.g. rsync -avzW) to turn this off. I don't know objectively how much difference this makes but it seems reasonable. Comments? -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu
so, instead of 500M i would transfer 100GB over the network. that`s no option. besides that, for transferring complete files i know faster methods than rsync. one more question: how safe is transferring a 100gb file, i.e. as rsync is using checksums internally to compare the contents of two files, how can i calculate the risk of 2 files being NOT perfectly in sync after rsync run ? i assume there IS a risk, just as like there is a risk that 2 files may have the same md5 checksum by chance.... regards roland>List: rsync >Subject: Re: rsync speedup - how ? >From: Jon Forrest <jlforrest () berkeley ! edu> >Date: 2009-08-07 0:25:12 >Message-ID: h5fs8m$nqj$1 () ger ! gmane ! org >[Download message RAW] > >One way I've been trying to speedup rsync may >not apply in every situation. In my situation >when files change, they usually change completely. >This is especially true for large files. So, >the rsync algorithm does me no good. So, >I've been using the "W" flag (e.g. rsync -avzW) >to turn this off.>I don't know objectively how much difference >this makes but it seems reasonable.>Comments?________________________________________________________________ Neu: WEB.DE Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate f?r nur 19,99 Euro/mtl.!* http://produkte.web.de/go/02/
> devzero at web.de wrote: > > so, instead of 500M i would transfer 100GB over the network. > > that`s no option. > > I don't see how you came up with such numbers. > If files change completely then I don't see why > you would transfer more (or less) over the network. > The difference that I'm thinking of is that > by not using the rsync algorithm then you're > substantially reducing the number of disk I/Os.let me explain: all files are HUGE datafiles and they are of constant size. they are harddisk-images and the contents being changed inside, i.e. specific blocks in the files being accessed and rewritten. so, the question is: is rsync rolling checksum algorithm the perfect (i.e. fastest) algorithm to match changed blocks at fixed locations between source and destination files ? i`m not sure because i have no in depth knowledge of the mathematical background in rsync algorithm. i assume: no - but it`s only a guess...> The reason I say this, and I could be wrong since > I'm no rsync algorithm expert, is because when the > local version of a file and the remote version of > a file are completely different, and the rsync > algorithm is being used, the amount of I/O > that must be done consists of the I/Os that > compare the two files, plus the actual transfer > of the bits from the source file to the destination > file. (That's a very long sentence, isn't it.) > Please correct this thinking if it's wrong.yes, that`s correct. but what i`m unsure about is, if rsync isn`t doing too much work with detecting the differences. it doesn`t need to "look forth and back" (as i read somewhere it would) , it just need to check if block1 in filea differs from block1 in fileb.sorta stupid comparison without need for complex math or any real "intelligence" to detect relocation of data. see this post: http://www.mail-archive.com/backuppc-users at lists.sourceforge.net/msg08998.html> > besides that, for transferring complete files i know faster methods than rsync. > > Maybe so (I'd like to hear what you're referring to) but one reason > I like to use rsync is that using the '-avzW' flags > results in a perfect mirror on the destination, which is > my goal. Do your faster methods have a way of doing that?no, i have no faster replacement which is as good in perfect mirroring like rsync, but there are faster methods for transferring files. here is some example: http://communities.vmware.com/thread/29721> > one more question: > > how safe is transferring a 100gb file, i.e. as rsync > > is using checksums internally to compare the contents > > of two files, how can i calculate the risk of 2 files > > being NOT perfectly in sync after rsync run ? > > Assuming the rsync algorithm works correctly, I don't > see any difference between the end result of copying > a 100gb file with the rsync algorithm or without it. > The only difference is the amount of disk and network > I/O that must occur.the rsync algorithm is using checksumming to find differences. checksums are sort of "data reduction" which create a hash from a larger amount of data. i just want to understand what makes sure that there are no hash collisions which break the algorithm. mind that rsync exists for some time and by that time file sizes transferred with rsync may have grown by a factor of 100 or even 1000. regards roland ________________________________________________________________ Neu: WEB.DE Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate f?r nur 19,99 Euro/mtl.!* http://produkte.web.de/go/02/
> I really don't think it's a good idea to sync large data files in use, > which is modified frequently, e.g. SQL database, VMware image file. > > As rsync do NOT have the algorithm to keep those frequently modified > data file sync with the source file. And this will course data file > corrupted. > > If I'm wrong, please correct me. Thanks.they are not in use, as i do a snapshot before rsync. so, the file won`t change during transfer. so i`m doing sorta "crash consistent" copy. roland ________________________________________________________________ Neu: WEB.DE Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate f?r nur 19,99 Euro/mtl.!* http://produkte.web.de/go/02/
On Thu, Aug 06, 2009 at 08:15:39PM +0200, devzero at web.de wrote:> i read that rsync would be not very efficient with ultra-large files > (i`m syncing files with up to 80gb size)Things to try: - Be sure you're using rsync 3.x, as it has a better hash algorithm for the large numbers of checksum blocks that need be scanned on the sending side. - The --inplace option might help, since it can reduce the amount of write I/O when the file is being modified (though it does reduce the amount of backward matching). In a really large file where most of the data stays the same, this could be a big win. - Try setting the --block-size option. This will only help if the block size is so large it is missing finding matching data. In a huge file that is mostly unchanged, this may not be an issue. Note that decreasing the block size increases the amount of checksum data, and the amount of blocks in the matching algorithm. - The best things you could do would be to mount the virtual drives (source read-only, dest read/write) and copy within the file systems. That would allow rsync to use its size+mtime fast-check to skip most of the files. It would not, however, result in truly identical disk images, so may not be a solution for you. Keep in mind that the checksumming as it currently works requires the receiving side to read the whole file (sending its checksums), then (after that is done) the sending side reads the whole file (generating differences), which allows the receiving side to reconstruct the file while the sender is sending in the changes. Sadly, this means that the transfer serializes this file-reading time (since the sender wants to be able to find moved blocks from anywhere in the file). An interesting new option might be one that tells the sender to immediately start comparing the received checksums to the source file, and only check if the data matches (with no movement) or if it needs to send the changed data (i.e. this would skip scanning for moved data). For mostly unchanged, large files, that would allow concurrent reading of the receiving and sending files. Combined with --inplace, this might be a pretty large speedup for mostly-unchanged files. ..wayne..