Good day, I've got a question regarding the usage of rsync that I just cannot figure out. I've done a fare hunt for the answer, but I'm stumped. Here is the situation. I have two pc's running linux and using rsync to perform a backup from server1 to server2. For example: rsync -avzr -e 'ssh -i/root/.ssh/id_rsa' --delete /home/samba/admin/software www.some-server.com:/home/RemoteSystems/company/home/samba/admin Let's say I have a directory within rsync's scope to sync called directory1. Rsync is run and directory1 is sync'ed from server1 to server2. Also, a file named File1 is sync'ed because it is in the directory being sync'ed. Server1 server2 Directory1 Directory1 File1 File1 Now, let's say a user comes and changes the name of the Directory1 on server1 to DirectoryNew, rsync performs the following actions: 1. rsync recognizes that Directory 1 is not on server1, but it is on server2, so it flags it and it's contents for deletion on server2. 2. rsync recognizes that DirectoryNew is on server1, but not on server2, so it flags it and it's contents for copying to server2. 3. rsync performs these actions to make the two directories the same. This action is the simplest method of performing an rsync, but it would be nice to have rsync to be intelligent enough to recognize a name change but not an inode change on the source. So the action performed would be, 1. rsync recognizes that Directory1 is not on server1, but it's inode still is. Rsync reads the new directory name and flags the name change from Directory1 to DirectoryNew on server1. 2. Rsync reads server2 and sees that Directory1 exists, and flags a pending name change on server2 from Directory1 to DirectoryNew. 3. Name is changed on server2. No files or directories are deleted and re-transferred from source to destination as the structure under the directory has not changed. Why go through all this work? I've had personnel change a directory name that has several gigabytes of data in it without notifying me and at night, rsync tries to perform the directory and file dance and fails simply because the volume is so great. It would be nice to either, one, recognize a large discrepancy between the source and destination before anything occurs, by giving a message of amount of potential bytes that would be transferred, (this doesn't work with dry-run option), or do the fancy dance by recognizing a name change over a deletion of a directory. Thanks. Frank Thomas -------------- next part -------------- HTML attachment scrubbed and removed
On Thu 04 Oct 2007, Frank Thomas wrote:> > 1. rsync recognizes that Directory1 is not on server1, > but it's inode still is. Rsync reads the new directory name and flags > the name change from Directory1 to DirectoryNew on server1.The problem here is that rsync is stateless; i.e. it can't recognize that the inode is still there, because it has no idea the inode was ever there. To accommodate that, a major redesign would probably be needed. Paul Slootman
Frank Thomas, on 10/4/2007 3:57 PM, said the following:> it would be nice to have rsync to be intelligent enough to recognize > a name change but not an inode change on the source.Seems to me the best way to accomplish this is to be sure that the parent directory is not a directory that someone can rename... ie, when I rsync our home directories, there is no danger of anyone ever renaming the 'home' directory... So, just put the top level directories that *can* be renamed by the users into a parent directory that they can *not* rename, and use that for the root for your rsync... Or maybe I'm missing something? -- Best regards, Charles
On Thu, Oct 04, 2007 at 01:57:22PM -0600, Frank Thomas wrote:> This action is the simplest method of performing an rsync, but it > would be nice to have rsync to be intelligent enough to recognize a > name change but not an inode change on the source.For the next "feature release" of rsync after 3.0.0, I'm imagining adding support for a database API that would allow extra information about files to be maintained and used (completely optional, of course). In the scenario that you described, this new rsync would be run using a DB cache and the --checksum option, which would lookup files on the sending side by their inode + size + mtime + ctime (making the checksum lookup efficient). The receiving side would lookup incoming files based on its checksum + size + mtime + ctime in order to find a local file that it could use for hard-linking, copying, and/or renaming. In the meantime, the only rsync solution is the detect-renamed patch that Matt mentioned. ..wayne..
N.J. van der Horn (Nico)
2007-Oct-05 22:06 UTC
Renaming a directory results in an expensive retransmission
We are using rsync for several years, but since a couple of months we use it to backup remote servers, some with more than 200GB capacity. Especially Windows users sometimes have the (bad) habit to change the name of a directory with huge amounts of data below them. We see the same nasty results as you are talking about: * rsync "thinks" that the old directory name has disappeared, and deletes the directory on the target machine, throwing away the expensive transmission * the new directory name initiates a fresh / full (re)transmission, sometimes taking days.... while the "real work" would be done in minutes... * the servers we backup have between 20GB and 200GB capacity. * all rsync's are run in parallel, average sync time is 1.5 hour for 900GB. * when a "user" behaves as described, it takes days to a week to resync. It is a tricky problem to deal with i think, it is tempting to keep a checksum'd file/directory list on both sides with information like: * a fingerprint/signature/checksum to identify each file or directory * inode number * timestamp * filesize In case a files appears to be deleted, because the name/path is changed, it could possibly be identified by it's fingerprint and used to sync cleverly ;-) This in the thought of expanding --fuzzy, giving it more functionality (hint). For some time i am experimenting with a solution to this problem, by some sort of a "preprocessor", that tries to identify in the described way, creating hardlinks (ln) to let rsync think the files are already in the new location. I am traversing on both sides (remote and local) the directory trees, producing a file with the information described above, but it is still work in progress... The cost of keeping a database in this scenario would be truly justified for me. That rsync deletes the files in the old location is then no problem for me anymore. But.... i am just a user with needs... looking for a solution to a problem also, hoping this can be solved by the clever developers ;-) Maybe there is already a solution available, and we are chasing shadows ? Thanks, Nico Frank Thomas schreef:> > Good day, > > > > I?ve got a question regarding the usage of rsync that I just cannot > figure out. I?ve done a fare hunt for the answer, but I?m stumped. > > > > Here is the situation. > > > > I have two pc?s running linux and using rsync to perform a backup from > server1 to server2. For example: rsync -avzr -e 'ssh > -i/root/.ssh/id_rsa' --delete /home/samba/admin/software > www.some-server.com:/home/RemoteSystems/company/home/samba/admin > > Let?s say I have a directory within rsync?s scope to sync called > directory1. > > Rsync is run and directory1 is sync?ed from server1 to server2. Also, > a file named File1 is sync?ed because it is in the directory being > sync?ed. > > > > Server1 server2 > > Directory1 Directory1 > > File1 File1 > > > > Now, let?s say a user comes and changes the name of the Directory1 on > server1 to DirectoryNew, rsync performs the following actions: > > 1. rsync recognizes that Directory 1 is not on > server1, but it is on server2, so it flags it and it?s contents for > deletion on server2. > > 2. rsync recognizes that DirectoryNew is on server1, > but not on server2, so it flags it and it?s contents for copying to > server2. > > 3. rsync performs these actions to make the two > directories the same. > > > > This action is the simplest method of performing an rsync, but it > would be nice to have rsync to be intelligent enough to recognize a > name change but not an inode change on the source. So the action > performed would be, > > 1. rsync recognizes that Directory1 is not on > server1, but it?s inode still is. Rsync reads the new directory name > and flags the name change from Directory1 to DirectoryNew on server1. > > 2. Rsync reads server2 and sees that Directory1 > exists, and flags a pending name change on server2 from Directory1 to > DirectoryNew. > > 3. Name is changed on server2. No files or > directories are deleted and re-transferred from source to destination > as the structure under the directory has not changed. > > > > Why go through all this work? I?ve had personnel change a directory > name that has several gigabytes of data in it without notifying me and > at night, rsync tries to perform the directory and file dance and > fails simply because the volume is so great. It would be nice to > either, one, recognize a large discrepancy between the source and > destination before anything occurs, by giving a message of amount of > potential bytes that would be transferred, (this doesn?t work with > dry-run option), or do the fancy dance by recognizing a name change > over a deletion of a directory. > > > > Thanks. > > > > *Frank Thomas* > > >-- Behandeld door / Handled by: N.J. van der Horn (Nico) --- ICT Support Vanderhorn IT-works, www.vanderhorn.nl, Voorstraat 55, 3135 HW Vlaardingen, The Netherlands, Tel +31 10 2486060, Fax +31 10 2486061