thr3ads.net - rsync - An idea: rsyncfs, an rsync-based real-time replicated filesystem [Apr 2005]

If this information is useful, please help other people find it:
Share via:

Lester Hightower

2005-Apr-13 03:58 UTC

An idea: rsyncfs, an rsync-based real-time replicated filesystem

This is only my second email to the rsync mailing list.  My first was sent
under the title "Re: TODO hardlink performance optimizations" on Jan
3,
2004.  The response of the rsync developers to that email was remarkable
(in my opinion).  I felt that the rsync performance enhancements that
resulted from the ensuing discussion and code improvements were so
consequential that I dared not post again until I felt I had another topic
worthy of the time and consideration of this superb group.

This idea might be a little far fetched, but if so I guess that means that
I am just ignorant enough in this area to be willing to pitch a dumb idea.
Sometimes ignorance is fruitful.  If the list consensus is that it's a bad
idea, or outside the goals of the rsync project, please feel free to just
let me know that.  I have thick skin and am fine with picking up my toys
and leaving the playground until I have some other idea of possible value
to contribute.   :)


So, here goes:


Continuity of service, high availability, disaster recovery, etc. are all
hot topics in today's world.  Rsync is an excellent tool in the service
continuity arsenal.  However, routine rsyncs of large volumes with high
file counts can cause service/performance issues on production servers
tasked with their normal load plus servicing rsyncs for mirror slaves.

Recently, I have learned of, and my group has done some evaluating of,
real-time peer-to-peer filesystem replication technologies.  Two
commercial products in particular, PeerFS and Constant Replicator, have
promising designs, but neither is open source and both are from smaller
companies without proven track-records.

http://www.radiantdata.com/
http://www.constantdata.com/products/cr.php

These are _not_ cluster filesystems, where "cluster" would imply a
single
filesystem that N number of hosts can simultaneously access, but rather
methods to keep N copies of a filesystem synchronized (in real time)
across N number of Linux hosts. (**important distinction**)

To help articulate my idea for rsyncfs, I feel it important to describe
the two methods taken by PeerFS and Constant Replicator and to use that as
a springboard to describe rsyncfs.  Both embed themselves into the Linux
kernel (as modules), and I assume into/under the VFS, but I am not sure.

PeerFS is a peer-to-peer filesystem. It uses its own on-disk filesystem,
so one uses PeerFS _instead_ of another filesystem like ext3 or reiser.
Constant Replicator sits between the kernel and a normal filesystem, so
one uses Constant Replicator on top of a "normal" Linux filesystem
like
ext3.  Here are some simple block diagrams to illustrate how I think each
is architected:

      Host A         Host B                 Host A         Host B

   +----------+   +----------+           +----------+   +----------+
   |  Block   |   |  Block   |           |  Block   |   |  Block   |
   |  Device  |   |  Device  |           |  Device  |   |  Device  |
   +----------+   +----------+           +----------+   +----------+
        ^^             ^^                     ^^             ^^
   +----------+   +----------+           +----------+   +----------+
   |  PeerFS  |<->|  PeerFS  |           |  fs/ext3 |   |  fs/ext3 |
   +----------+   +----------+           +----------+   +----------+
        ^^             ^^                     ^^             ^^
   +----------+   +----------+           +----------+   +----------+
   |  kernel  |   |  kernel  |           | Constant |-->| Constant |
   |   VFS    |   |   VFS    |           |Replicator|   |Replicator|
   +----------+   +----------+           +----------+   +----------+
                                              ^^             ^^
                                         +----------+   +----------+
                                         |  kernel  |   |  kernel  |
                                         |   VFS    |   |   VFS    |
                                         +----------+   +----------+

PeerFS is a many-to-many replication system where all "peers" in the
cluster are read/write.  Constant Replicator is a one-to-many system where
only one master is read/write, and every mirror is read-only.  Replication
communication between hosts in both systems is via TCP/IP.

I see benefits to both designs.  I personally don't need the many-to-many
features of PeerFS, and though we tested it and it seemed to work well,
the design scares me.  It just seems that too many issues would arise that
would be impossible to troubleshoot -- NFS file locking haunts me.  Even
with PeerFS, though, you can force a one-to-many replication system, so
then my next angst: PeerFS is a closed source file system on Linux with no
track-record.  As best I can tell it is _not_ journalled and the system
ships with mkfs.peerfs and fsck.peerfs tools.

The Constant Replicator design appeals to me because it more closely
mirrors my needs, and has fewer "scarey black boxes" in the design.
However, the more I have thought about what is really going on here, the
more I am convinced that my rsyncfs idea is doable.

Let me diagram an rsyncfs scenario that I have in my head.  Host A is the
master, Host B the slave in this example:

              Host A         Host B

           +----------+   +----------+
           |  Block   |   |  Block   |
           |  Device  |   |  Device  |
           +----------+   +----------+
                ^^             ^^
           +----------+   +----------+
           |  fs/ext3 |   |  fs/ext3 |
           +----------+   +----------+
                ^^             ^^
           +----------+   +----------+
           |VFS Change|   |  kernel  |
           |  Logger  |   |   VFS    |
           +----------+   +----------+
                ^^
           +----------+
           |  kernel  |
           |   VFS    |
           +----------+

I envision the "VFS Change Logger" as a (hopefully very thin)
middle-ware
that sits between the kernel's VFS interfaces and a real filesystem, like
ext3, reiser, etc.  The "VFS Change Logger" will pass VFS calls to the
underlying filesystem driver, but it will make note of certain types of
calls.  I have these in mind so far, but there are likely others:

  open( ... , "w"), write(fd), unlink( ... ), mkdir( ... ), rmdir( ...
)

The "VFS Change Logger" is then responsible for periodically reporting
the
names of the paths (files and directories) that have changed to a named
pipe (a FIFO) that it places in the root of the mounted filesystem it is
managing.  So, when rsyncfs is managing a filesystem with an empty root
directory, one should expect to see a named pipe called "chglog.fifo",
and
if one cats that named pipe she will see a constant stream of pathnames
that the "VFS Change Logger" has noted changes to.

The actual replication happens in user-land with rsync as the transport.
I think rsync will have to be tweaked a little to make this work, but
given all the features already in rsync I don't think this will be a big
deal.  I envision an rsync running on Host A like:

# rsync --constant --from0 --files-from-fifo=/vol/0/chglog.fifo ...

that will be communicating with an "rsync --constant ..." on the other
end.  The --constant flag is my way of stating that both rsyncs should
become daemons and plan to "constantly" exchange syncing information
until
killed -- that is, this is a "constant" rsync, not just one run.

I believe that this could be a highly efficient method of keeping slave
filesystems in sync, in very-near-real-time, and by leveraging all of the
good work already done and ongoing in rsync.

I also believe that two-way peering (more similar to PeerFS) might be
possible (I'm not as confident in this one yet).  Consider:

              Host A         Host B
                                              Legend:
           +----------+   +----------+        <M> - master
           |  Block   |   |  Block   |        <S> - slave
           |  Device  |   |  Device  |
           +----------+   +----------+        * Note, both boxes have
                ^^             ^^               a Master and a Slave
           +----------+   +----------+          rsync daemon running,
           |  fs/ext3 |   |  fs/ext3 |          each supporting one half
           +----------+   +----------+          of the full-duplex, or
                ^^             ^^               bidirectional, syncing.
           +----------+   +----------+
           |VFS Change|   |VFS Change|
           |  Logger  |   |  Logger  |
           +----------+   +----------+
                ^^             ^^
           +----------+   +----------+
           |  kernel  |   |  kernel  |
           |   VFS    |   |   VFS    |
           +----------+   +----------+

Now consider four rsync daemons, two on each host, establishing two-way
syncing (just add another pair in the other direction).  Without any
"help" the communication would be M-rsync-A tells S-rsync-B to update
/tmp/foo and it does, which modifies B's filesystem, so M-rsync-B tells
S-rsync-A to update /tmp/foo, but it is determined that /tmp/foo matches,
so nothing is done and things stop right there.  In *theory* that sounds
doable, but I smell race conditions in there somewhere.

I don't feel capable of coding this myself or I would probably be sending
code snippets along with this email.  I am hoping that someone with much
more kernel expertise than I might read this email and comment on its
practicality, workability, difficulty, or maybe even be inspired to give
it a go.  I would appreciate feedback from anyone that read this far,
positive or negative.

I also apologize for such a lengthy email, but I wanted to share this idea
with the rsync community, and it just took lots of space to convey it...

Sincerely,

--
Lester Hightower
10East Corp.

Clint Byrum

2005-Apr-13 05:10 UTC

head link

An idea: rsyncfs, an rsync-based real-time replicated filesystem

Lester Hightower wrote:
>I envision the "VFS Change Logger" as a (hopefully very thin)
middle-ware
>that sits between the kernel's VFS interfaces and a real filesystem,
like
>ext3, reiser, etc.  The "VFS Change Logger" will pass VFS calls to
the
>underlying filesystem driver, but it will make note of certain types of
>calls.  I have these in mind so far, but there are likely others:
>
>  open( ... , "w"), write(fd), unlink( ... ), mkdir( ... ), rmdir(
... )
>
>  
>
I like your ideas. There's obviously a need for this type of stuff. I do 
think that your ideas come somewhat close to the Inter-Mezzo filesystem 
for Linux, which was at one time in the mainline kernel, but some time 
ago dropped because it wasn't being maintained. Their approach, I think, 
is too focused on standards and openness, rather than making something 
that works. Maybe your idea will have more focus. :)

http://www.inter-mezzo.org/

Craig Barratt

2005-Apr-13 06:55 UTC

head link

An idea: rsyncfs, an rsync-based real-time replicated filesystem

Interesting ideas.
> I envision the "VFS Change Logger" as a (hopefully very thin)
middle-ware
> that sits between the kernel's VFS interfaces and a real filesystem,
like
> ext3, reiser, etc.  The "VFS Change Logger" will pass VFS calls
to the
> underlying filesystem driver, but it will make note of certain types of
> calls...
If I understand your description correctly, inotify does something
close to this (although I'm not sure where it sits relative to VFS);
see:

    http://www.kernel.org/pub/linux/kernel/people/rml/inotify/

It provides the information via /dev/notify based on ioctl requests.
I vaguely recall inotify doesn't report hardlinks created via link()
(at least based on looking at the utility example).

Inotify will drop events if the application doesn't read them fast
enough.  So a major part of the design is how to deal with that
case, and of course the related problem of how to handle a cold
start (maybe just run rsync, although on a live file system it
is hard to know how much has changed since rsync checked/updated
each file/directory).

Perhaps you would have a background program that slowly reads
(or re-writes the first byte) of each file, so that over time
all the files get checked (although file deletions won't be
mirrored).

Craig

Jan-Benedict Glaw

2005-Apr-13 07:55 UTC

head link

An idea: rsyncfs, an rsync-based real-time replicated filesystem

On Tue, 2005-04-12 23:57:37 -0400, Lester Hightower
<hightowe-rsync-list@10east.com>
wrote in message
<Pine.LNX.4.58.0504122319420.6482@les5.10east.com>:> I envision the "VFS Change Logger" as a (hopefully very thin)
middle-ware
> that sits between the kernel's VFS interfaces and a real filesystem,
like
> ext3, reiser, etc.  The "VFS Change Logger" will pass VFS calls
to the
> underlying filesystem driver, but it will make note of certain types of
> calls.  I have these in mind so far, but there are likely others:
> 
>   open( ... , "w"), write(fd), unlink( ... ), mkdir( ... ),
rmdir( ... )
I implemented such a beast in terms of a LD_PRELOAD library that
intercepts these calls (as well as a whole lot others that may update
time information). Unfortunately, this was proprietary and I don't have
access to the sources. However, I've done that, the idea is there, it
can be done again.

Though, this only works if there's a limited number of applications
accessing the volume. So it won't work well to replicate /home ,  but it
would work quite well for replicating a volume that's only accessed by a
FTP server or some other kind of file server.

The tricky part however is to fight against glibc's internal date
representation in this case. Glibc basically uses the "latest and
greatest" data representation needed at the time of writing, and this
interface changes from time to time. So you either have to closely
follow glibc development, or you need to nail down one version that
actually "works"...

A different approach would be to use ptrace for sniffing at the syscall
layer, but this is troublesome right after fork() IIRC.

MfG, JBG

-- 
Jan-Benedict Glaw       jbglaw@lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg 
_ _ O
 fuer einen Freien Staat voll Freier B?rger" | im Internet! |   im Irak!  
O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url :
http://lists.samba.org/archive/rsync/attachments/20050413/ff09262c/attachment.bin

Justin Banks

2005-Apr-13 16:25 UTC

head link

An idea: rsyncfs, an rsync-based real-time replicated filesystem

Lester Hightower wrote> PeerFS is a many-to-many replication system where all "peers" in
the
> cluster are read/write.  Constant Replicator is a one-to-many system where
> only one master is read/write, and every mirror is read-only.  Replication
> communication between hosts in both systems is via TCP/IP.
(disclosure : I wrote a lot/most of CR, and I work at Constant Data)

This isn't strictly true, as CR allows bidirectional replication between
two hosts, and we have many customers modifying data on destination systems.
Bidirectional replication (assuming the replicated datasets overlap) is 
restricted to two systems, but replication can occur many-to-many as long
as the replicated data is unique. In other words, the following is possible:

Host A			Host B		Host C
------			------		------
/data	     <->	/data
/stuff	      ->			/junk
/foo	     <-				/bar
/baz	      ->        /whatever
/bitbucket   <-         /garbage


All data on all hosts is read/write, but changes made to a data store that
is a destination but not a source will not be replicated to other systems.
> The Constant Replicator design appeals to me because it more closely
> mirrors my needs, and has fewer "scarey black boxes" in the
design.
> However, the more I have thought about what is really going on here, the
> more I am convinced that my rsyncfs idea is doable.
One of the benefits of CR is that it's filesystem agnostic as well as
platform
agnostic. You can replicate to and from linux, solaris, Windows, OSX, and AIX,
although the realtime product is not available for all those platforms. CR
also works with data modified by NFS clients. 

For a linux-only solution that only requires unidirectional one-to-one 
replication, you're on the right track ;)

-justinb

-- 
Justin Banks
Constant Data, Inc.
http://www.constantdata.com

Steve Bonds

2005-Apr-13 17:53 UTC

head link

An idea: rsyncfs, an rsync-based real-time replicated filesystem

On 4/12/05, Lester Hightower wrote:
> The actual replication happens in user-land with rsync as the transport.
> I think rsync will have to be tweaked a little to make this work, but
> given all the features already in rsync I don't think this will be a
big
> deal.  I envision an rsync running on Host A like:
> 
> # rsync --constant --from0 --files-from-fifo=/vol/0/chglog.fifo ...
> 
> that will be communicating with an "rsync --constant ..." on the
other
> end.  The --constant flag is my way of stating that both rsyncs should
> become daemons and plan to "constantly" exchange syncing
information until
> killed -- that is, this is a "constant" rsync, not just one run.
Lester:

Something like this is very high on my list of products I wish I had. 
I frequently use rsync to replicate data on a near real-time basis. 
My biggest pain point here is replicating filesystems with many
(millions) of small files.  The time rsync spends traversing these
directories is immense.

There have been discussions in the past of making an rsync that would
replicate the contents of a raw device directly, saving the time spent
checking each small file:

http://lists.samba.org/archive/rsync/2002-August/003545.html
http://lists.samba.org/archive/rsync/2003-October/007466.html

It seems that the consensus from the list at those times is that rsync
is not the best utility for this since it's designed to transfer many
files rather than just one really big "file" (the contents of the
device.)

Despite the fact that the above discussions are almost 18 months ago,
I have seen no sign of the rsync-a-device utility.  If it exists, this
might be the solution to what you propose-- and it would work on more
than Linux.

To achieve your goal with this proposed utility you would simply do
something like this:

+ for each device
++ make a snapshot if your LVM supports it
++ transfer the diffs to the remote device
+ go back and do it all again

If the appropriate permissions were in place this could be done
entirely in user-mode, which is a great advantage for portability.  As
you touched on in your original message, knowing what's changed since
the last run would be very helpful in reducing the amount of data that
needs to be read on the source side.  In my experience, sequential
reads like this, even on large devices, don't take a huge amount of
time compared with accessing large numbers of files.  If there were
only a few files on a mostly-empty volume the performance difference
would be more substantial.  ;-)

Another thought to eliminate the kernel dependency is to combine the
inode-walk done by the "dump" utility with the rsync algorithm to
reduce the file data transferred.  The inode walk would be
filesystem-specific, but could be done in user space using existing
interfaces.

  -- Steve

Seemingly Similar Threads

Search for more possibly parallel threads

rsync - Apr 2005 - An idea: rsyncfs, an rsync-based real-time replicated filesystem

An idea: rsyncfs, an rsync-based real-time replicated filesystem

An idea: rsyncfs, an rsync-based real-time replicated filesystem

An idea: rsyncfs, an rsync-based real-time replicated filesystem

An idea: rsyncfs, an rsync-based real-time replicated filesystem

An idea: rsyncfs, an rsync-based real-time replicated filesystem

An idea: rsyncfs, an rsync-based real-time replicated filesystem

Seemingly Similar Threads