Lester Hightower
2005-Apr-13 03:58 UTC
An idea: rsyncfs, an rsync-based real-time replicated filesystem
This is only my second email to the rsync mailing list. My first was sent under the title "Re: TODO hardlink performance optimizations" on Jan 3, 2004. The response of the rsync developers to that email was remarkable (in my opinion). I felt that the rsync performance enhancements that resulted from the ensuing discussion and code improvements were so consequential that I dared not post again until I felt I had another topic worthy of the time and consideration of this superb group. This idea might be a little far fetched, but if so I guess that means that I am just ignorant enough in this area to be willing to pitch a dumb idea. Sometimes ignorance is fruitful. If the list consensus is that it's a bad idea, or outside the goals of the rsync project, please feel free to just let me know that. I have thick skin and am fine with picking up my toys and leaving the playground until I have some other idea of possible value to contribute. :) So, here goes: Continuity of service, high availability, disaster recovery, etc. are all hot topics in today's world. Rsync is an excellent tool in the service continuity arsenal. However, routine rsyncs of large volumes with high file counts can cause service/performance issues on production servers tasked with their normal load plus servicing rsyncs for mirror slaves. Recently, I have learned of, and my group has done some evaluating of, real-time peer-to-peer filesystem replication technologies. Two commercial products in particular, PeerFS and Constant Replicator, have promising designs, but neither is open source and both are from smaller companies without proven track-records. http://www.radiantdata.com/ http://www.constantdata.com/products/cr.php These are _not_ cluster filesystems, where "cluster" would imply a single filesystem that N number of hosts can simultaneously access, but rather methods to keep N copies of a filesystem synchronized (in real time) across N number of Linux hosts. (**important distinction**) To help articulate my idea for rsyncfs, I feel it important to describe the two methods taken by PeerFS and Constant Replicator and to use that as a springboard to describe rsyncfs. Both embed themselves into the Linux kernel (as modules), and I assume into/under the VFS, but I am not sure. PeerFS is a peer-to-peer filesystem. It uses its own on-disk filesystem, so one uses PeerFS _instead_ of another filesystem like ext3 or reiser. Constant Replicator sits between the kernel and a normal filesystem, so one uses Constant Replicator on top of a "normal" Linux filesystem like ext3. Here are some simple block diagrams to illustrate how I think each is architected: Host A Host B Host A Host B +----------+ +----------+ +----------+ +----------+ | Block | | Block | | Block | | Block | | Device | | Device | | Device | | Device | +----------+ +----------+ +----------+ +----------+ ^^ ^^ ^^ ^^ +----------+ +----------+ +----------+ +----------+ | PeerFS |<->| PeerFS | | fs/ext3 | | fs/ext3 | +----------+ +----------+ +----------+ +----------+ ^^ ^^ ^^ ^^ +----------+ +----------+ +----------+ +----------+ | kernel | | kernel | | Constant |-->| Constant | | VFS | | VFS | |Replicator| |Replicator| +----------+ +----------+ +----------+ +----------+ ^^ ^^ +----------+ +----------+ | kernel | | kernel | | VFS | | VFS | +----------+ +----------+ PeerFS is a many-to-many replication system where all "peers" in the cluster are read/write. Constant Replicator is a one-to-many system where only one master is read/write, and every mirror is read-only. Replication communication between hosts in both systems is via TCP/IP. I see benefits to both designs. I personally don't need the many-to-many features of PeerFS, and though we tested it and it seemed to work well, the design scares me. It just seems that too many issues would arise that would be impossible to troubleshoot -- NFS file locking haunts me. Even with PeerFS, though, you can force a one-to-many replication system, so then my next angst: PeerFS is a closed source file system on Linux with no track-record. As best I can tell it is _not_ journalled and the system ships with mkfs.peerfs and fsck.peerfs tools. The Constant Replicator design appeals to me because it more closely mirrors my needs, and has fewer "scarey black boxes" in the design. However, the more I have thought about what is really going on here, the more I am convinced that my rsyncfs idea is doable. Let me diagram an rsyncfs scenario that I have in my head. Host A is the master, Host B the slave in this example: Host A Host B +----------+ +----------+ | Block | | Block | | Device | | Device | +----------+ +----------+ ^^ ^^ +----------+ +----------+ | fs/ext3 | | fs/ext3 | +----------+ +----------+ ^^ ^^ +----------+ +----------+ |VFS Change| | kernel | | Logger | | VFS | +----------+ +----------+ ^^ +----------+ | kernel | | VFS | +----------+ I envision the "VFS Change Logger" as a (hopefully very thin) middle-ware that sits between the kernel's VFS interfaces and a real filesystem, like ext3, reiser, etc. The "VFS Change Logger" will pass VFS calls to the underlying filesystem driver, but it will make note of certain types of calls. I have these in mind so far, but there are likely others: open( ... , "w"), write(fd), unlink( ... ), mkdir( ... ), rmdir( ... ) The "VFS Change Logger" is then responsible for periodically reporting the names of the paths (files and directories) that have changed to a named pipe (a FIFO) that it places in the root of the mounted filesystem it is managing. So, when rsyncfs is managing a filesystem with an empty root directory, one should expect to see a named pipe called "chglog.fifo", and if one cats that named pipe she will see a constant stream of pathnames that the "VFS Change Logger" has noted changes to. The actual replication happens in user-land with rsync as the transport. I think rsync will have to be tweaked a little to make this work, but given all the features already in rsync I don't think this will be a big deal. I envision an rsync running on Host A like: # rsync --constant --from0 --files-from-fifo=/vol/0/chglog.fifo ... that will be communicating with an "rsync --constant ..." on the other end. The --constant flag is my way of stating that both rsyncs should become daemons and plan to "constantly" exchange syncing information until killed -- that is, this is a "constant" rsync, not just one run. I believe that this could be a highly efficient method of keeping slave filesystems in sync, in very-near-real-time, and by leveraging all of the good work already done and ongoing in rsync. I also believe that two-way peering (more similar to PeerFS) might be possible (I'm not as confident in this one yet). Consider: Host A Host B Legend: +----------+ +----------+ <M> - master | Block | | Block | <S> - slave | Device | | Device | +----------+ +----------+ * Note, both boxes have ^^ ^^ a Master and a Slave +----------+ +----------+ rsync daemon running, | fs/ext3 | | fs/ext3 | each supporting one half +----------+ +----------+ of the full-duplex, or ^^ ^^ bidirectional, syncing. +----------+ +----------+ |VFS Change| |VFS Change| | Logger | | Logger | +----------+ +----------+ ^^ ^^ +----------+ +----------+ | kernel | | kernel | | VFS | | VFS | +----------+ +----------+ Now consider four rsync daemons, two on each host, establishing two-way syncing (just add another pair in the other direction). Without any "help" the communication would be M-rsync-A tells S-rsync-B to update /tmp/foo and it does, which modifies B's filesystem, so M-rsync-B tells S-rsync-A to update /tmp/foo, but it is determined that /tmp/foo matches, so nothing is done and things stop right there. In *theory* that sounds doable, but I smell race conditions in there somewhere. I don't feel capable of coding this myself or I would probably be sending code snippets along with this email. I am hoping that someone with much more kernel expertise than I might read this email and comment on its practicality, workability, difficulty, or maybe even be inspired to give it a go. I would appreciate feedback from anyone that read this far, positive or negative. I also apologize for such a lengthy email, but I wanted to share this idea with the rsync community, and it just took lots of space to convey it... Sincerely, -- Lester Hightower 10East Corp.
Clint Byrum
2005-Apr-13 05:10 UTC
An idea: rsyncfs, an rsync-based real-time replicated filesystem
Lester Hightower wrote:>I envision the "VFS Change Logger" as a (hopefully very thin) middle-ware >that sits between the kernel's VFS interfaces and a real filesystem, like >ext3, reiser, etc. The "VFS Change Logger" will pass VFS calls to the >underlying filesystem driver, but it will make note of certain types of >calls. I have these in mind so far, but there are likely others: > > open( ... , "w"), write(fd), unlink( ... ), mkdir( ... ), rmdir( ... ) > > >I like your ideas. There's obviously a need for this type of stuff. I do think that your ideas come somewhat close to the Inter-Mezzo filesystem for Linux, which was at one time in the mainline kernel, but some time ago dropped because it wasn't being maintained. Their approach, I think, is too focused on standards and openness, rather than making something that works. Maybe your idea will have more focus. :) http://www.inter-mezzo.org/
Craig Barratt
2005-Apr-13 06:55 UTC
An idea: rsyncfs, an rsync-based real-time replicated filesystem
Interesting ideas.> I envision the "VFS Change Logger" as a (hopefully very thin) middle-ware > that sits between the kernel's VFS interfaces and a real filesystem, like > ext3, reiser, etc. The "VFS Change Logger" will pass VFS calls to the > underlying filesystem driver, but it will make note of certain types of > calls...If I understand your description correctly, inotify does something close to this (although I'm not sure where it sits relative to VFS); see: http://www.kernel.org/pub/linux/kernel/people/rml/inotify/ It provides the information via /dev/notify based on ioctl requests. I vaguely recall inotify doesn't report hardlinks created via link() (at least based on looking at the utility example). Inotify will drop events if the application doesn't read them fast enough. So a major part of the design is how to deal with that case, and of course the related problem of how to handle a cold start (maybe just run rsync, although on a live file system it is hard to know how much has changed since rsync checked/updated each file/directory). Perhaps you would have a background program that slowly reads (or re-writes the first byte) of each file, so that over time all the files get checked (although file deletions won't be mirrored). Craig
Jan-Benedict Glaw
2005-Apr-13 07:55 UTC
An idea: rsyncfs, an rsync-based real-time replicated filesystem
On Tue, 2005-04-12 23:57:37 -0400, Lester Hightower <hightowe-rsync-list@10east.com> wrote in message <Pine.LNX.4.58.0504122319420.6482@les5.10east.com>:> I envision the "VFS Change Logger" as a (hopefully very thin) middle-ware > that sits between the kernel's VFS interfaces and a real filesystem, like > ext3, reiser, etc. The "VFS Change Logger" will pass VFS calls to the > underlying filesystem driver, but it will make note of certain types of > calls. I have these in mind so far, but there are likely others: > > open( ... , "w"), write(fd), unlink( ... ), mkdir( ... ), rmdir( ... )I implemented such a beast in terms of a LD_PRELOAD library that intercepts these calls (as well as a whole lot others that may update time information). Unfortunately, this was proprietary and I don't have access to the sources. However, I've done that, the idea is there, it can be done again. Though, this only works if there's a limited number of applications accessing the volume. So it won't work well to replicate /home , but it would work quite well for replicating a volume that's only accessed by a FTP server or some other kind of file server. The tricky part however is to fight against glibc's internal date representation in this case. Glibc basically uses the "latest and greatest" data representation needed at the time of writing, and this interface changes from time to time. So you either have to closely follow glibc development, or you need to nail down one version that actually "works"... A different approach would be to use ptrace for sniffing at the syscall layer, but this is troublesome right after fork() IIRC. MfG, JBG -- Jan-Benedict Glaw jbglaw@lug-owl.de . +49-172-7608481 _ O _ "Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O fuer einen Freien Staat voll Freier B?rger" | im Internet! | im Irak! O O O ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA)); -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://lists.samba.org/archive/rsync/attachments/20050413/ff09262c/attachment.bin
Justin Banks
2005-Apr-13 16:25 UTC
An idea: rsyncfs, an rsync-based real-time replicated filesystem
Lester Hightower wrote> PeerFS is a many-to-many replication system where all "peers" in the > cluster are read/write. Constant Replicator is a one-to-many system where > only one master is read/write, and every mirror is read-only. Replication > communication between hosts in both systems is via TCP/IP.(disclosure : I wrote a lot/most of CR, and I work at Constant Data) This isn't strictly true, as CR allows bidirectional replication between two hosts, and we have many customers modifying data on destination systems. Bidirectional replication (assuming the replicated datasets overlap) is restricted to two systems, but replication can occur many-to-many as long as the replicated data is unique. In other words, the following is possible: Host A Host B Host C ------ ------ ------ /data <-> /data /stuff -> /junk /foo <- /bar /baz -> /whatever /bitbucket <- /garbage All data on all hosts is read/write, but changes made to a data store that is a destination but not a source will not be replicated to other systems.> The Constant Replicator design appeals to me because it more closely > mirrors my needs, and has fewer "scarey black boxes" in the design. > However, the more I have thought about what is really going on here, the > more I am convinced that my rsyncfs idea is doable.One of the benefits of CR is that it's filesystem agnostic as well as platform agnostic. You can replicate to and from linux, solaris, Windows, OSX, and AIX, although the realtime product is not available for all those platforms. CR also works with data modified by NFS clients. For a linux-only solution that only requires unidirectional one-to-one replication, you're on the right track ;) -justinb -- Justin Banks Constant Data, Inc. http://www.constantdata.com
Steve Bonds
2005-Apr-13 17:53 UTC
An idea: rsyncfs, an rsync-based real-time replicated filesystem
On 4/12/05, Lester Hightower wrote:> The actual replication happens in user-land with rsync as the transport. > I think rsync will have to be tweaked a little to make this work, but > given all the features already in rsync I don't think this will be a big > deal. I envision an rsync running on Host A like: > > # rsync --constant --from0 --files-from-fifo=/vol/0/chglog.fifo ... > > that will be communicating with an "rsync --constant ..." on the other > end. The --constant flag is my way of stating that both rsyncs should > become daemons and plan to "constantly" exchange syncing information until > killed -- that is, this is a "constant" rsync, not just one run.Lester: Something like this is very high on my list of products I wish I had. I frequently use rsync to replicate data on a near real-time basis. My biggest pain point here is replicating filesystems with many (millions) of small files. The time rsync spends traversing these directories is immense. There have been discussions in the past of making an rsync that would replicate the contents of a raw device directly, saving the time spent checking each small file: http://lists.samba.org/archive/rsync/2002-August/003545.html http://lists.samba.org/archive/rsync/2003-October/007466.html It seems that the consensus from the list at those times is that rsync is not the best utility for this since it's designed to transfer many files rather than just one really big "file" (the contents of the device.) Despite the fact that the above discussions are almost 18 months ago, I have seen no sign of the rsync-a-device utility. If it exists, this might be the solution to what you propose-- and it would work on more than Linux. To achieve your goal with this proposed utility you would simply do something like this: + for each device ++ make a snapshot if your LVM supports it ++ transfer the diffs to the remote device + go back and do it all again If the appropriate permissions were in place this could be done entirely in user-mode, which is a great advantage for portability. As you touched on in your original message, knowing what's changed since the last run would be very helpful in reducing the amount of data that needs to be read on the source side. In my experience, sequential reads like this, even on large devices, don't take a huge amount of time compared with accessing large numbers of files. If there were only a few files on a mostly-empty volume the performance difference would be more substantial. ;-) Another thought to eliminate the kernel dependency is to combine the inode-walk done by the "dump" utility with the rsync algorithm to reduce the file data transferred. The inode walk would be filesystem-specific, but could be done in user space using existing interfaces. -- Steve