Hi Everybody, I'm new to this list, but I have been using rsync for quite some time. First, congratulations to the rsync team for a very fine piece of software! I'm wondering whether rsync could help me to perform the following task: I have 5 million files on one side of the ocean, 100000 of which must be copied to the other side. Both numbers grow with time, and occasionally, some files must be removed from the "to be copied" list (i.e., they must be deleted on the receiving side, but kept on the sending side). I currently do this manually, but having rsync doing it would mean that the two archives could be sync'ed much more regularly. I tried to use a combination of --include-from=<list of files> --exclude='*', and it seems to work. However, I have the impression that the algorithm is far from optimal in this case: There is no usable pattern in the file names, and I have to list all of them in the "--include-from" file. rsync therefore makes 5000000 x 100000 comparisons approximately. The building of the file list is therefore extremely slow (found 8000 files after 2 hours, i.e. ~24 hours just to build the file list). [correct me if my understanding of how rsync works is wrong]. I have the impression that the above situation might not be so uncommon. So, is there another way that I missed in the doc to do that? What I would be looking for is a parameter: --file-list=<list of files> (which would override any "--in/exclude"). rsync would only consider these files, and ignore all the other ones, and also a "--delete-not-in-list" flag which would make all the files on the receiving side be deleted if they are not in the list. Of course, if there is another way using current rsync, it would be great! And sorry if I missed an obvious solution... Cheers, Stephan
On Fri, Jun 07, 2002 at 06:23:32PM -0400, Stephane Paltani wrote:> Hi Everybody, > > I'm new to this list, but I have been using rsync for quite some time. > First, congratulations to the rsync team for a very fine piece of software! > > I'm wondering whether rsync could help me to perform the following task: > > I have 5 million files on one side of the ocean, 100000 of which must be > copied to the other side. Both numbers grow with time, and occasionally, some > files must be removed from the "to be copied" list (i.e., they must be > deleted on the receiving side, but kept on the sending side). I currently > do this manually, but having rsync doing it would mean that the two archives > could be sync'ed much more regularly. > > I tried to use a combination of --include-from=<list of files> --exclude='*', > and it seems to work. However, I have the impression that the algorithm > is far from optimal in this case: There is no usable pattern in the > file names, and I have to list all of them in the "--include-from" file. > rsync therefore makes 5000000 x 100000 comparisons approximately. The building > of the file list is therefore extremely slow (found 8000 files after 2 hours, > i.e. ~24 hours just to build the file list). > [correct me if my understanding of how rsync works is wrong]. > > I have the impression that the above situation might not be > so uncommon. So, is there another way that I missed in the doc to do that? > What I would be looking for is a parameter: > --file-list=<list of files> (which would override any "--in/exclude"). > rsync would only consider these files, and ignore all the other ones, > and also a "--delete-not-in-list" flag which would make all the > files on the receiving side be deleted if they are not in the list. > > Of course, if there is another way using current rsync, it would be great! > And sorry if I missed an obvious solution...Sigh, another request for the --files-from I promised to write over 6 months ago, but I've been so overloaded at work lately that I don't know if I'm ever going to get to it. Perhaps someone else will have to do it. Somone implemented a version that was part of the way there at http://lists.samba.org/pipermail/rsync/2001-November/005272.html but among other problems it only worked when sending files and not when receiving files. It turns out that back in rsync 2.3.2 and earlier there was an optimization (which I wrote and actually was the primary reason that I volunteered to be maintainer of rsync for a while) that kicked in when there was list of includes with no wildcards followed by an --exclude '*', and there was no --delete. Instead of recursing through the files and doing comparisons, it would just directly open the files in the include list. It only had to be on the sending side, you might want to try 2.3.2 on your sending side to see if you get a significant performance boost. Andrew Tridgell took it out in 2.4.0 because he didn't like how it changed the usual semantics of requiring all parent directories to be explicitly listed in the include list. Your --delete-not-in-list suggestion has not been considered before, but something like that makes sense to me. - Dave Dykstra
On Fri, 7 Jun 2002, Stephane Paltani wrote:> I have 5 million files on one side of the ocean, 100000 of which must > be copied to the other side.This is the sort of problem that would benefit from the rsync_xfer.c program I'm working on (I mentioned an early version on the list a week or so ago). It allows total control of what gets sent by an external program, so there's no directory scan and no include/exclude processing. I could imagine writing a simple perl script that would take a list of files and turn it into a series of "cput" commands followed by any needed "del" commands to remove the names that vanished from the list after the last run. Unfortunately, the code is still at a very early stage, so it's not yet ready for use in a production environment. I've been working on a new version of the program that is able to transfer trees of files and will also have an improved socket protocol. It works through the tree incrementally, and thus it shouldn't use as much memory as the current rsync implementation. After I get the code in a little better shape, I'm planning to compare its performance with the current implementation and try to figure out if rsync might best benefit from adding support for a new (internal) protocol, or if it just needs some tweaks to the current one. ..wayne..
Just so that we don't forget the lessons from the past, let me point out that we had discussion and testing done on this subject back in November, with mixed results (i.e. YMMV): http://lists.samba.org/pipermail/rsync/2001-November/005398.html I think the consensus from that experiment was that implementing the option using the include/exclude mechanism was not the way to go (correct me if I'm wrong Dave). Andrew Schorr's patch does this differently but from what I can tell it would only work when uploading files to a server (which is the opposite of what my experiments with --files-from were): http://lists.samba.org/pipermail/rsync/2001-November/005272.html Since it seems that different people want this option for different purposes, we need to make sure that it gets implemented in a sensible way, with some testing being done to ensure that we still have decent performance and that it works in all cases (sending/receiving files). So I think our strategy should be to bug Dave Dykstra until he gives up and writes the patch :-) -- Alberto Stephane Paltani wrote:> Dave Dykstra wrote: > > > > Sigh, another request for the --files-from I promised to write over 6 > > months ago, but I've been so overloaded at work lately that I don't know if > > I'm ever going to get to it. Perhaps someone else will have to do it. > > He he, happy to see a general consensus for this feature! > > > It turns out that back in rsync 2.3.2 and earlier there was an optimization > > (which I wrote and actually was the primary reason that I volunteered to be > > maintainer of rsync for a while) that kicked in when there was list of > > includes with no wildcards followed by an --exclude '*', and there was no > > --delete. Instead of recursing through the files and doing comparisons, it > > would just directly open the files in the include list. It only had to be > on the sending side, you might want to try 2.3.2 on your sending side to > > see if you get a significant performance boost. Andrew Tridgell took it > > out in 2.4.0 because he didn't like how it changed the usual semantics of > > requiring all parent directories to be explicitly listed in the include > > list. > > Whoops! That did the trick for me! It took me 6 minutes to transfer 250 GB! > Too bad it has been turned down. I have the impression it would satisfy > most, if not all, "--files-from" lobbyists.**************************************************************************** Alberto Accomazzi mailto:aaccomazzi@cfa.harvard.edu NASA Astrophysics Data System http://adsabs.harvard.edu Harvard-Smithsonian Center for Astrophysics http://cfawww.harvard.edu 60 Garden Street, MS 83, Cambridge, MA 02138 USA ****************************************************************************