On Sun, Nov 09, 2003 at 05:09:42PM -0500, Cedric Puddy
wrote:>
> Hi there,
>
> I'm using rsync with some large trees of files (on one
> disk, we have 30M files, for example, and a we might
> be copying say, 500k files in one tree. The file trees
> are reasonably balenced -- no single directory has thousands
> of files in it, for example. Our file system, at the moment,
> is ext3. We are very comfortable with it, and are hesitant
> to switch away from it, though JFS or Reiserfs could be
> persuasive if people's experience strongly suggests that
> they would help. My guess is that because the tree is
> reasonably balenced, changing filesystems isn't going
> to have a major effect on how big a bottleneck the filesystem
> may be.)
>
> ANYWAY, the point is, as you've guessed, that I hate having
> to wait 20 or 30 minutes in order to have a transfer start
> (even when I'm copying to a location that doesn't even
> have anything there yet, thus, no possibilities of deltas
> to figure out).
>
> I've never really asked about this because my assumption has
> always been that it takes that long, becuase it simply takes
> that long to scan the disks, populate rsync's data structures,
> and get the show on the road, and that if I want it quicker,
> then I can darn well get faster disks, etc.
>
> (a) is that assumption correct? Or am I missing anything?
Mostly. Cache is an issue. Pre-populating the inode and
dentry caches may help but would itself take time. 30
million files is too much to expect any kind of atomicity.
> (b) for those of you how understand rsync internals better
> than I (eg: anyone at all who's done anything with the
> code :P) Is there any possibility of rsync-in-daemon
> mode being able to leverage the File Alteration Monitor
> (FAM) efforts in order to cheaply maintain a more-or-less
> up to the moment map of the trees it is exporting?
> (I have reservations about this, because I seem to recall
> understanding that FAM was *not* designed to watch
> *vast* huge portions of huge filesystems -- more that
> it was designed for monitoring specific resources.)
No chance in mainline.
> For that matter, is this not the sort of thing that
> ReiserFS, with it's evolution towards a pluggable
> architecture, might be perfect for?
>
> (c) I assume that it would be folly (eg: something that complicates
> the problem space substantially) to try and write something
> that simply started copying, and built the map as it
> went along, or in the background (though I could see
> this as being very interesting for situations were ones
> network was *much* slower than ones disks).
>
> One of the reasons I ask is that I've often come across rsync
> being used as a sort of lazy filesystem mirroring tool, the
> point being to make a sync with a remote filesystem every,
> say, 10 minutes. Which is fine, until the file tree grows
> to large to parse in 10 minutes, in which case you have to
> (a) reduce the transfer frequency, and (b) resign yourself
> to have your i/o subsystem running flat out *all the time*.
Perhaps your Reiser4 plugin could log every file that is
changed and that could be fed to a --files-from argument or
a finely tuned utility that would rsync the files and
propagate deletes. Or maybe the better approach would be to
invest in a real cluster filesystem.
> Also, with the "monilithic" scan, the filesystem can easily
> change between the scan being done, and the actual directory/file
> in question being copied. Might it not be better all round
> to walk the tree progressively, making a sync plan for each
> "leaf node" of the tree as one reaches it?
Yes it would be better. We all agree but it cannot be done
without wholesale change to the protocol.
> Anyway, I'd be interested what people think -- this is an
> awesome tool, and if there's any chance that addressing
> some of these things is technically possible, I'd like to
> know. (Never know, I might be able to help get the work
> done, or at least fund someone)
>
> All the best,
>
> -Cedric
>
>
> --
> -
> | CCj/ClearLine - Unix/NT Administration and TCP/IP Network Services
> | 118 Louisa Street, Kitchener, Ontario, N2H 5M3, 519-741-2157
> \____________________________________________________________________
> Cedric Puddy, IS Director cedric@thinkers.org
> PGP Key Available at: http://www.thinkers.org/cedric
>
> --
> To unsubscribe or change options:
http://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
>
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw@pegasys.ws
Remember Cernan and Schmitt