Alberto Accomazzi
2004-May-17 14:15 UTC
batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]
Chris, to put things in the right prespective, you should read (if you haven't done so already) the original paper describing the design behind batch mode. The design and implementation of this functionality goes back to a project called the Internet2 Distributed Storage Infrastructure (I2-DSI). As part of that project, the authors created a modified version of rsync (called rsync+) which had the capability of creating these batch sets for mirroring. Here are a couple of URLs describing the ideas and motivation behind it: http://www.ils.unc.edu/i2dsi/unc_rsync+.html http://www.ils.unc.edu/ils/research/reports/TR-1999-01.pdf Chris Shoemaker wrote:> Yes, I think you're right about the original design. And I guess we'd > want to preserve that capability. Or would we? > I'm having a little trouble seeing why this was the intended > use. I figure, there are three cases: > > A) If you have access to both source and dest, it doesn't really matter too > much who writes the batch -- this is like the local copy case. > B) If you have access to the dest but not the source, then you need the > client to write the batch -- and it's not far-fetched that you might have > other copies of dest to update. > C) However, having access to source but not dest is the only case that > _requires_ the sender to write the batch -- now what's the chance that you'll > have another identical dest to apply the batch to? And if you did, why > wouldn't you generate the batch on that dest as in case A, above? > > So, it seems to me that it's much more useful to have the receiver/client > write the batch than sender/client, or receiver/server, or sender/server. > But, maybe I'm just not appreciating what the potential uses of batch-mode > are.> > Survey: so who uses batch-mode and what for? I haven't used the feature but back when I read the docs on rsync+ I thought it was a clever way to do multicasting on the cheap. I think the only scenario where batch mode makes sense is when you need to distribute updates from a particular archive to a (large) number of mirror sites and you have tight control on the state of both client and server (so that you know exactly what needs to be updated on the mirror sites). This ensures that you can create a set of batch files that contain *all* the changes necessary for updating each mirror site. So basically I would use batch mode if I had a situation in which: 1) all mirror sites have the same set of files 2) rsync is invoked from each mirror site in exactly the same way (i.e. same command-line options) to pull data from a master server then instead of having N sites invoke rsync against the same archive, I would invoke it once, make it write out a set of batch files, then transfer the batch files to each client and run rsync locally using the batch set. The advantage of this is that the server only performs its computations once. An example of this usage would be using rsync to upgrade a linux distribution, say going from FC 1 to FC 2. All files from each distribution are frozen, so you should be able to create a single batch which incorporates all the changes and then apply that on each site carrying the distro. The question of whether the batch files should be on the client or server side is not easy to answer and in the end depends on exactly what you're trying to do. In general, I would say that since the contents of the batch mode depend on the status of both client and server, there is not a "natural" location for it. -- Alberto ******************************************************************** Alberto Accomazzi aaccomazzi(at)cfa harvard edu NASA Astrophysics Data System ads.harvard.edu Harvard-Smithsonian Center for Astrophysics www.cfa.harvard.edu 60 Garden St, MS 31, Cambridge, MA 02138, USA ********************************************************************
Chris Shoemaker
2004-May-18 02:13 UTC
batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]
On Mon, May 17, 2004 at 10:15:23AM -0400, Alberto Accomazzi wrote:> > Chris, > > to put things in the right prespective, you should read (if you haven't > done so already) the original paper describing the design behind batch > mode. The design and implementation of this functionality goes back to > a project called the Internet2 Distributed Storage Infrastructure > (I2-DSI). As part of that project, the authors created a modified > version of rsync (called rsync+) which had the capability of creating > these batch sets for mirroring. Here are a couple of URLs describing > the ideas and motivation behind it: > http://www.ils.unc.edu/i2dsi/unc_rsync+.html > http://www.ils.unc.edu/ils/research/reports/TR-1999-01.pdfAh, thank you. I had seen the first, but not the second. It was an interesting read, and it explains a lot. I see now why the write-batch hooks are in the _sender_ paths. This seems a reasonable design decision when the intention is to replicate changes to many remote copies. I can see some justification for wanting write-batch functionality with both sender and receiver. However, several things in the report seem to confirm by growing opinion that, if it has to be in only one, receiver is sufficient, while sender is not.> >use. I figure, there are three cases: > > > > A) If you have access to both source and dest, it doesn't really matter > > too > >much who writes the batch -- this is like the local copy case. > > B) If you have access to the dest but not the source, then you need the > >client to write the batch -- and it's not far-fetched that you might have > >other copies of dest to update. > > C) However, having access to source but not dest is the only case that > >_requires_ the sender to write the batch -- now what's the chance that > >you'll > >have another identical dest to apply the batch to? And if you did, why > >wouldn't you generate the batch on that dest as in case A, above? > > > > So, it seems to me that it's much more useful to have the > > receiver/client write the batch than sender/client, or receiver/server, or > >sender/server. But, maybe I'm just not appreciating what the potential > >uses of batch-mode are. > > > > Survey: so who uses batch-mode and what for? > > I haven't used the feature but back when I read the docs on rsync+ I > thought it was a clever way to do multicasting on the cheap. I think > the only scenario where batch mode makes sense is when you need to > distribute updates from a particular archive to a (large) number of > mirror sites and you have tight control on the state of both client and > server (so that you know exactly what needs to be updated on the mirror > sites). This ensures that you can create a set of batch files that > contain *all* the changes necessary for updating each mirror site. > > So basically I would use batch mode if I had a situation in which: > > 1) all mirror sites have the same set of files > 2) rsync is invoked from each mirror site in exactly the same way (i.e. > same command-line options) to pull data from a master server > > then instead of having N sites invoke rsync against the same archive, I > would invoke it once, make it write out a set of batch files, then > transfer the batch files to each client and run rsync locally using the > batch set. The advantage of this is that the server only performs its > computations once. An example of this usage would be using rsync to > upgrade a linux distribution, say going from FC 1 to FC 2. All files > from each distribution are frozen, so you should be able to create a > single batch which incorporates all the changes and then apply that on > each site carrying the distro.Indeed, what you describe seems to have been the design motivation. I can share what my desired application is: I want to create a mirror of a public server onto my local machine which physically disconnected from the Internet, and keep it current. So, I intend to first rsync update my own copy which _is_ networked while creating the batch set. Then I can sneakernet the batch set to the unnetworked machine and use rsync --read-batch to update it. This keeps the batch sets smallish even though the mirror is largish.> > The question of whether the batch files should be on the client or > server side is not easy to answer and in the end depends on exactly what > you're trying to do. In general, I would say that since the contents of > the batch mode depend on the status of both client and server, there is > not a "natural" location for it.While I agree there is some symmetry in the _origin_ of the batch set that would suggest that there is no natural location for it, I think the _intended use_ of the batch set strongly suggests that it will usually belong with the _receiver_ (irrespective of client/server). Specifically, the batch set is only useful for other receivers that are identical to the original receiver. The "knowledge" or "memory" of that exact state is more likely to reside with the receiver (who just left that state) than with the sender (who may never have been in that state). Therefore it is more likely to be useful to the receiver than to sender. Consider that even in the report's example of pushing the batch sets out to multiple mirrors, the authors recommend creating the batch set while updating a "near" or local copy to reduce network load. So, even when the initiator has full control over the replication _source_, the act of creating a batch set presumes such a degree of knowledge of, interest in, and control over, the _destination_, that creating batch sets at the destination (receiver) is not inappropriate. Of course, the clincher is a case such as mine, where I have _no_ control or access to the sender/server. I am only a client/receiver of a public anonymous rsyncd server, and the batch set I create is obviously only useful to me, so I'd like my receiver to create it. It's probably a good thing that --write-batch crashes the server's sender child. The mirror maintainers would probably be annoyed if I was filling their server hard drives with my batch sets. :-) All that said, I have no intention to remove write-batch hooks from sender paths. I figure, let the tool do what the tool does, and someone smarter than I will figure out what to use it for. However, IMHO, other than for pure local updates, batch-mode is pretty close to useless unless the receiver can write the batch set, and if I can pull it off, I will be provide a patch that does that. BTW, there is a work-around. If you don't mind duplicating the mirror twice, one solution is to do a regular (no --write-batch) rsync update of one copy of the mirror, and then do the --write-batch during a local to local rsync update of another copy of the mirror. Actually, this has some real advantages if your network connection is unreliable. Thanks for your input. -Chris> > > -- Alberto > > ******************************************************************** > Alberto Accomazzi aaccomazzi(at)cfa harvard edu > NASA Astrophysics Data System ads.harvard.edu > Harvard-Smithsonian Center for Astrophysics www.cfa.harvard.edu > 60 Garden St, MS 31, Cambridge, MA 02138, USA > ******************************************************************** >