Hi there, I''ve got a question, that i''m sure''s been addressed somewhere, so sorry if i''m asking the same question twice, but here goes: I''ve currently got two linux machines running drbd ( remote device mirror ) and it''s working perfectly, but i''d love to use ZFS (* i <heart> ZFS *) but alas, i don''t seem to see any information on remote mirroring other than a blog i''ve found about using NFS to export the device, the page : http://blogs.sun.com/roller/page/chrisg?entry=zfs_remote_replication http://blogs.sun.com/roller/page/chrisg?entry=more_with_the_zfs_external The page has an idea that seems somewhat fiddly, and i''d rather not trust it on a production-type enviroment, anyone have any more ''info'' for me? P -- Patrick ---------------------------------------- patrick <at> eefy <dot> net
On Mon, 2006-05-08 at 14:27 +0200, Patrick wrote:> Hi there,Hello.> I''ve got a question, that i''m sure''s been addressed somewhere, so > sorry if i''m asking the same question twice, but here goes: > > I''ve currently got two linux machines running drbd ( remote device > mirror ) and it''s working perfectly, but i''d love to use ZFS (* i > <heart> ZFS *) but alas, i don''t seem to see any information on remote > mirroring other than a blog i''ve found about using NFS to export the > device, the page : > > http://blogs.sun.com/roller/page/chrisg?entry=zfs_remote_replication > http://blogs.sun.com/roller/page/chrisg?entry=more_with_the_zfs_external > > The page has an idea that seems somewhat fiddly, and i''d rather not > trust it on a production-type enviroment, anyone have any more ''info'' > for me?Well, this can be a pretty deep topic for a monday morning :-) In my experience, the approach and solution for "remote mirroring" really depends on two things: 1. are you doing disaster recovery, versus mirroring diversity? 2. how far apart are the mirrored devices? Without answering those questions first, you will risk a suboptimal solution. -- richard
> Hello.Howdy! :)> Well, this can be a pretty deep topic for a monday morning :-)Well, monday evening for some of us ;)> In my experience, the approach and solution for "remote mirroring" > really depends on two things: > 1. are you doing disaster recovery, versus mirroring diversity?I''m not actually sure, i''m currently mirroring from the one disk device to the other over the network to cater for hardware failures, and ''''software'''' failures ( such as a kernel panic and such ) the idea was to have an ''offline'' machine that would have a full copy of the data. Although how usefull that would be in the real world is still to be decided, however currently i''ve got it clipped into a few other bits like Heartbeat and such, so it''ll do a full failover and move, but i''m probably going to remove that due to the ''extra layer of complexity creating more complex problems'' so I suppose that''d put me into the ''disaster recovery'' class.> 2. how far apart are the mirrored devices?about 15cm-20cm ( via crossover on seperate interfaces, not network osmosis ) ( v20z''s btw. )> Without answering those questions first, you will risk a suboptimal > solution.Anything i missed? Patrick
On Mon, May 08, 2006 at 02:27:05PM +0200, Patrick wrote:> Hi there, > > I''ve got a question, that i''m sure''s been addressed somewhere, so > sorry if i''m asking the same question twice, but here goes: > > I''ve currently got two linux machines running drbd ( remote device > mirror ) and it''s working perfectly, but i''d love to use ZFS (* i > <heart> ZFS *) but alas, i don''t seem to see any information on remote > mirroring other than a blog i''ve found about using NFS to export the > device, the page : > > http://blogs.sun.com/roller/page/chrisg?entry=zfs_remote_replication > http://blogs.sun.com/roller/page/chrisg?entry=more_with_the_zfs_external > > The page has an idea that seems somewhat fiddly, and i''d rather not > trust it on a production-type enviroment, anyone have any more ''info'' > for me?There are two basic methods of doing remote replication: remote mirroring and asynchronous updates. Remote mirroring can be done through iSCSI/zvols, though the failover case is a little awkward (you''ll be doing ZFS on top of zvols for local access). It also implies synchronous operation for every write, which would slow down local access. The remote data is also not available (even read-only) until you perform a failover, because mirroring occurs at the block level and the upper layers cannot keep in sync. Asynchronous remote replication can be done today with ''zfs send'' and zfs receive'', though it needs some more work to be truly useful. It has the properties that it doesn''t tax local activity, but your data will be slightly out of sync (depending on how often you sync your data, preferably a few minutes). Among the things we''re working on to make this easier: - recursive snapshots (''zfs snapshot -r'') - recursive send (''zfs send -r'') - ability to send properties (''zfs send -p'') - read-only receive on the remote end (currently the fs has to be unmounted) - ability to receive into root dataset Once these are are implemented, it should be fairly easy to construct your own cron job to do regular remote replication using any transport you''d like (probably SSH). The tricky part comes when local churn outpaces regular replication. Do you want to guarantee your time delta (by slowing down local access somehow), or hope that the remote end will catch up (disabling more attempts in the process). - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Tue, 2006-05-09 at 00:49 +0200, Patrick wrote:> > In my experience, the approach and solution for "remote mirroring" > > really depends on two things: > > 1. are you doing disaster recovery, versus mirroring diversity? > > I''m not actually sure, i''m currently mirroring from the one disk > device to the other over the network to cater for hardware failures, > and ''''software'''' failures ( such as a kernel panic and such ) the idea > was to have an ''offline'' machine that would have a full copy of the > data. Although how usefull that would be in the real world is still to > be decided, however currently i''ve got it clipped into a few other > bits like Heartbeat and such, so it''ll do a full failover and move, > but i''m probably going to remove that due to the ''extra layer of > complexity creating more complex problems'' > > so I suppose that''d put me into the ''disaster recovery'' class.No. You are describing a more common mirroring diversity setup.> > 2. how far apart are the mirrored devices? > > about 15cm-20cm ( via crossover on seperate interfaces, not network > osmosis ) ( v20z''s btw. )Definitely not disaster recovery... if a tornado hit one box, it would likely also hit the other. Also, in disaster recovery scenarios we often consider a required time delay between committing data on the primary versus the secondary, in order to protect against accidental data loss (eg. rm *) There are a couple of ways to do this, some of which aren''t quite ready for release and aren''t part of the OpenSolaris tree, yet. The Sun StorageTek Availability Suite software provides block-device level replication and should work out of the box with ZFS (I dunno for sure, but since they operate at different levels in the stack, they should interoperate ok). http://www.sun.com/storagetek/management_software/data_protection/availability/ For a real cluster solution, which it sounds like you don''t want, Sun Cluster will have failover file system support for ZFS in an upcoming release, pretty much like currently exists with other file systems (UFS, QFS, VxFS). It seems to me that drbd is a compromise between the previous two. I would say that they are doing the easy parts, but putting off the hard parts. For a more point-in-time snapshot solution, you could use zfs send/receive. There has also been a discussion on the requirements for a multi-node ZFS implementation. The window for comments may still be open. See the ZFS discuss forum archive for the threads. -- richard
Eric Schrock wrote:>... >Asynchronous remote replication can be done today with ''zfs send'' and >zfs receive'', though it needs some more work to be truly useful. It has >the properties that it doesn''t tax local activity, but your data will be >slightly out of sync (depending on how often you sync your data, >preferably a few minutes >Is it possible to add "tail -f" like properties to ''zfs send''? I suppose what I''m thinking of for ''zfs send -f'' would be to send down all of the transactions that update a ZFS data set, both the metadata and the data. The catch here would be to start the ''zfs send -f'' at the same time as the filesystem came online so that there weren''t any transactional gaps. Thoughts? Darren
On Tue, May 09, 2006 at 01:33:33PM -0700, Darren Reed wrote:> Eric Schrock wrote: > > >... > >Asynchronous remote replication can be done today with ''zfs send'' and > >zfs receive'', though it needs some more work to be truly useful. It has > >the properties that it doesn''t tax local activity, but your data will be > >slightly out of sync (depending on how often you sync your data, > >preferably a few minutes > > > > Is it possible to add "tail -f" like properties to ''zfs send''? > > I suppose what I''m thinking of for ''zfs send -f'' would be to send > down all of the transactions that update a ZFS data set, both the > metadata and the data.''zfs send'' always sends all the changes, including metadata and data.> The catch here would be to start the ''zfs send -f'' at the same time > as the filesystem came online so that there weren''t any transactional > gaps.You can always simply run ''zfs snapshot; zfs send -i ... | ssh ...'' in a loop. This is an implementation of best-effort remote replication. Perhaps you''re looking for a more real-time remote replication. See: 5036182 want remote replication (intent-log based) Another possible remote replication implementation would allow the administrator to put a bound on how much the remote side can be out of date (eg. by an amount of time, or amount of modified data). This could be implemented by using ''zfs send -i'', with some hooks to stall changes to the filesystem if the ''zfs send -i'' gets too far behind. --matt
On Tue, May 09, 2006 at 01:33:33PM -0700, Darren Reed wrote:> Eric Schrock wrote: > > >... > >Asynchronous remote replication can be done today with ''zfs send'' and > >zfs receive'', though it needs some more work to be truly useful. It has > >the properties that it doesn''t tax local activity, but your data will be > >slightly out of sync (depending on how often you sync your data, > >preferably a few minutes > > > > Is it possible to add "tail -f" like properties to ''zfs send''? > > I suppose what I''m thinking of for ''zfs send -f'' would be to send > down all of the transactions that update a ZFS data set, both the > metadata and the data. > > The catch here would be to start the ''zfs send -f'' at the same time > as the filesystem came online so that there weren''t any transactional > gaps. > > Thoughts?+1 Add to this some churn/replication throttling and you may not want just a command-line interface but a library also. E.g., if the stdout/remote connection of zfs send -f blocked for long/broke then zfs should snapshot at the latest TXG and hold on to that snapshot until the output could drain and/or connection be restored, then resume by sending the incremental from the current TXG to that snapshot... Nico --
Matthew Ahrens wrote:>On Tue, May 09, 2006 at 01:33:33PM -0700, Darren Reed wrote: > > >>Eric Schrock wrote: >> >> >> >>>... >>>Asynchronous remote replication can be done today with ''zfs send'' and >>>zfs receive'', though it needs some more work to be truly useful. It has >>>the properties that it doesn''t tax local activity, but your data will be >>>slightly out of sync (depending on how often you sync your data, >>>preferably a few minutes >>> >>> >>> >>Is it possible to add "tail -f" like properties to ''zfs send''? >> >>I suppose what I''m thinking of for ''zfs send -f'' would be to send >>down all of the transactions that update a ZFS data set, both the >>metadata and the data. >> >> > >''zfs send'' always sends all the changes, including metadata and data. > > > >>The catch here would be to start the ''zfs send -f'' at the same time >>as the filesystem came online so that there weren''t any transactional >>gaps. >> >> > >You can always simply run ''zfs snapshot; zfs send -i ... | ssh ...'' in a >loop. This is an implementation of best-effort remote replication. > >Perhaps you''re looking for a more real-time remote replication. See: > > 5036182 want remote replication (intent-log based) > >Yes, I think this is what I was thinking of. If I could add a vote to it via opensolaris.org, I would :) Would it be possible to specify more than 1 remote destination for a single replication? Or could you chain together replication, so that I have: hosta# zfs send -f | rsh hostb zfs receive -f hostb# zfs send -f | rsh hostc zfs receive -f The second would feed of the intent-log updates from the input of the first, allowing for cascaded remote replication.>Another possible remote replication implementation would allow the >administrator to put a bound on how much the remote side can be out of >date (eg. by an amount of time, or amount of modified data). This could >be implemented by using ''zfs send -i'', with some hooks to stall changes >to the filesystem if the ''zfs send -i'' gets too far behind. > >That''s a great idea too :) Darren
On Tue, 9 May 2006, Nicolas Williams wrote:> On Tue, May 09, 2006 at 01:33:33PM -0700, Darren Reed wrote: > > Eric Schrock wrote: > > > > >... > > >Asynchronous remote replication can be done today with ''zfs send'' and > > >zfs receive'', though it needs some more work to be truly useful. It has > > >the properties that it doesn''t tax local activity, but your data will be > > >slightly out of sync (depending on how often you sync your data, > > >preferably a few minutes > > > > > > > Is it possible to add "tail -f" like properties to ''zfs send''? > > > > I suppose what I''m thinking of for ''zfs send -f'' would be to send > > down all of the transactions that update a ZFS data set, both the > > metadata and the data. > > > > The catch here would be to start the ''zfs send -f'' at the same time > > as the filesystem came online so that there weren''t any transactional > > gaps. > > > > Thoughts? > > +1 > > Add to this some churn/replication throttling and you may not want just > a command-line interface but a library also. > > E.g., if the stdout/remote connection of zfs send -f blocked for > long/broke then zfs should snapshot at the latest TXG and hold on to > that snapshot until the output could drain and/or connection be > restored, then resume by sending the incremental from the current TXG to > that snapshot...While I agree that zfs send is incredibly useful, after reading this post I''m asking myself: a) This already sounds like we''re descending the slippery slope of ''checkpointing'' - which is an incredibly hard problem to solve and involves considerable hardware/software resources to achieve. The only successfull implementation (arguably) that does checkpointing, that I know about, is the Burroughs B7700 stack-based mainframe - where every process is a stack and checkpointing consisted of taking a snapshot of the stack that represents the processes and moving it to other (mirror) hardware. And much of this is implemented in hardware to solve the excessively high "costs" of such operations. b) You can never sucessfully checkpoint an application via data replication. Why? Because, at some point you''re trying to take a snapshot of a process (or related processes) that modifies multiple files that represent inter-related data. That is what we have relational databases for and the concept of: begin_transaction do blah op a do blah op b do baah op c end_transaction If anything goes wrong with operation a, b or c, you want to backout the entire transaction. If remote data replication could be implemented successfully, you would not need begin_transaction ... end_transaction semantics or (to spend the $s on) an RDBMS. Or stated in different terms: if remote replication resolved the issue of maintaining application state, then one could simply replicate the underlying files that represented an Oracle or mySQL database and you''re done with application/site failover. Buzzzzz ... loser. Not possible. The real issue is where do you draw the line? And how do you manage user expectations if the user is convinced that by mirroring the active filesystem, they have achieved site diversity/failover? Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
On Tue, May 09, 2006 at 04:56:05PM -0500, Al Hopper wrote:> While I agree that zfs send is incredibly useful, after reading this post > I''m asking myself: > > a) This already sounds like we''re descending the slippery slope of > ''checkpointing'' - which is an incredibly hard problem to solve and[...] This is replication, not checkpointing.> b) You can never sucessfully checkpoint an application via data > replication. Why? Because, at some point you''re trying to take a > snapshot of a process (or related processes) that modifies multiple files > that represent inter-related data. That is what we have relational > databases for and the concept of:But if zfs send/receive can also send snapshots, that is, if I create a snapshot on a filesystem under replication the same snapshot should show up at the replica, then we place the checkpointing burden on the application/administrator.> The real issue is where do you draw the line?I think live replication could be incredibly useful, lags and all, because not everything you might replicate involves complex state, much that does supports journalling/rollback anyways, and you can do your own snapshotting to deal with checkpointing, in which case live replication merely spreads the cost of replication around, instead of making it bursty (but may be less efficient since some of the churn will be redundant).> And how do you manage user expectations if the user is convinced that by > mirroring the active filesystem, they have achieved site > diversity/failover?How is remote replication in this regard different from local mirroring? Nico --