Matt Beebe
2008-Sep-11 17:01 UTC
[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication
When using AVS''s "Async replication with memory queue", am I guaranteed a consistent ZFS on the distant end? The assumed failure case is that the replication broke, and now I''m trying to promote the secondary replicate with what might be stale data. Recognizing in advance that some of the data would be (obviously) stale, my concern is whether or not ZFS stayed consistent, or does AVS know how to "bundle" ZFS''s atomic writes properly? -Matt -- This message posted from opensolaris.org
Miles Nordin
2008-Sep-11 18:52 UTC
[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication
>>>>> "mb" == Matt Beebe <matthew.beebe at high-eng.com> writes:mb> When using AVS''s "Async replication with memory queue", am I mb> guaranteed a consistent ZFS on the distant end? The assumed mb> failure case is that the replication broke, and now I''m trying mb> to promote the secondary replicate with what might be stale mb> data. Recognizing in advance that some of the data would be mb> (obviously) stale, mb> my concern is whether or not ZFS stayed consistent, or does mb> AVS know how to "bundle" ZFS''s atomic writes properly? Assuming the ZFS claims of ``always consistent on disk'''' are true (or are fixed to be true), all that''s required is to write the updates in time order. simoncr was saying in the thread that Maurice quoted: http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30 that during a partial-resync after a loss of connectivity AVS writes in LBA order while DRBD writes in time order. The thread was about resyncing and restoring replication, not about broken async replication. The DRBD virtue here is if you start a resync and want to abandon it---if the resync took a long time, or the network failed permanently half way through resync---something like that. With DRBD it''s possible to give up, discard the unsync''d data, and bring up the cluster on the partially-updated sync-target. With AVS and LBA-order resync, you have the ``give up'''' option only before you begin the resync: the proposed sync target doesn''t have the latest data on it, but it''s mountable. You lose some protection by agreeing to start a sync: after you begin, the sync target is totally inconsistent and unmountable until the sync completes successfully. so, if the sync source node were destroyed or a crappy network connection went fully down during the resync, you lose everything! DRBD''s way sounds like a clear and very simple win at first, but makes me ask: 1. DRBD cannot _really_ write in time order because (a) it would mean making a write barrier between each sector and (b) there isn''t a fixed time order to begin with because block layers and even some disks allow multiple outstanding commands. Does he mean DRBD stores the write barriers in its dirty-log and implements them during resync? In this case, the target will NOT be a point-in-time copy of a past source volume, it''ll just be ``correct'''' w.r.t. the barrier rules. I could imagine this working in a perfect world,...or, at least, a well-tested well-integrated world. In our world, that strategy could make for an interesting test of filesystem bugs w.r.t. write barriers---are they truly issuing all the barriers needed for formal correctness, or are they unknowingly dependent on the 95th-percentile habits of real-world disks? What if you have some mistake that is blocking write barriers entirely (like LVM2/devicemapper)---on real disks it might just cause some database corruption, but DRBD implementing this rule precisely could imagineably degrade to the AVS case, and write two days of stale data in LBA order because it hasn''t seen a write barrier in two days! 2. on DRBD''s desired feature list is: to replicate sets of disks rather than individual disks, keeping them all in sync. ZFS probably tends to: (a) write Green blocks (b) issue barriers to all disks in a vdev (c) write Orange blocks (d) wait until the last disk has acknowledged its barrier (e) write Red blocks After this pattern it''s true pool-wide (not disk-wide) that no Red blocks will be written on any disk unless all Green blocks have been written to all disks. AIUI, DRBD can''t preserve this right now. It resynchronizes disks independently, not in sets. Getting back to your question, I''d guess that running in async mode is like you are constantly resynchronizing, and an ordinary cluster failover in async mode is equivalent to an interrupted resync. so, AVS doesn''t implement (1) during a regular resync. But maybe for a cluster that''s online in async mode it DOES implement (1)? HOWEVER, even if AVS implemented a (1)-like DRBD policy when it''s in ``async'''' mode (I don''t know that it does), I can''t imagine that it would manage (2) correctly. Does AVS have any concept of ``async disk sets'''', where write barriers have a meaning across disks? I can''t imagine it existing without a configuration knob for it. And ZFS needs (2). I would expect AVS ``sync'''' mode to provide (1) and (2), so the question is only about ``async'''' mode failovers. so...based on my reasoning, it''s UNSAFE to use AVS in async mode for ZFS replication on any pool which needs more than 1 device to have ``sufficient replicas''''. A single device would meet that requirement, and so would a pool containing a single mirror vdev with two devices. I''ve no particular knowledge of AVS at all though, besides what we''ve all read here. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080911/e2366463/attachment.bin>
Jim Dunham
2008-Sep-12 21:31 UTC
[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication
Miles,>>>>>> "mb" == Matt Beebe <matthew.beebe at high-eng.com> writes: > > mb> When using AVS''s "Async replication with memory queue", am I > mb> guaranteed a consistent ZFS on the distant end? The assumed > mb> failure case is that the replication broke, and now I''m trying > mb> to promote the secondary replicate with what might be stale > mb> data. Recognizing in advance that some of the data would be > mb> (obviously) stale, > > mb> my concern is whether or not ZFS stayed consistent, or does > mb> AVS know how to "bundle" ZFS''s atomic writes properly? > > Assuming the ZFS claims of ``always consistent on disk'''' are true (or > are fixed to be true), all that''s required is to write the updates in > time order. > > simoncr was saying in the thread that Maurice quoted: > > http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30 > > that during a partial-resync after a loss of connectivity AVS writes > in LBA order while DRBD writes in time order. The thread was about > resyncing and restoring replication, not about broken async > replication. > > The DRBD virtue here is if you start a resync and want to abandon > it---if the resync took a long time, or the network failed permanently > half way through resync---something like that. With DRBD it''s > possible to give up, discard the unsync''d data, and bring up the > cluster on the partially-updated sync-target. > > With AVS and LBA-order resync, you have the ``give up'''' option only > before you begin the resync: the proposed sync target doesn''t have the > latest data on it, but it''s mountable. You lose some protection by > agreeing to start a sync: after you begin, the sync target is totally > inconsistent and unmountable until the sync completes successfully. > so, if the sync source node were destroyed or a crappy network > connection went fully down during the resync, you lose everything!To address this issue there is a feature call ndr_ii. This is an automatic snapshot taken before resynchronization starts, so that on the remote node there is always a write-order consistent volume available. If replication stops, is taking too long, etc., the snapshot can be restored, so that one does not lose everything.> DRBD''s way sounds like a clear and very simple win at first, but makes > me ask: > > 1. DRBD cannot _really_ write in time order because (a) it would mean > making a write barrier between each sector and (b) there isn''t a > fixed time order to begin with because block layers and even some > disks allow multiple outstanding commands. > > Does he mean DRBD stores the write barriers in its dirty-log and > implements them during resync? In this case, the target will NOT > be a point-in-time copy of a past source volume, it''ll just be > ``correct'''' w.r.t. the barrier rules. I could imagine this > working in a perfect world,...or, at least, a well-tested > well-integrated world. > > In our world, that strategy could make for an interesting test of > filesystem bugs w.r.t. write barriers---are they truly issuing all > the barriers needed for formal correctness, or are they > unknowingly dependent on the 95th-percentile habits of real-world > disks? What if you have some mistake that is blocking write > barriers entirely (like LVM2/devicemapper)---on real disks it > might just cause some database corruption, but DRBD implementing > this rule precisely could imagineably degrade to the AVS case, and > write two days of stale data in LBA order because it hasn''t seen a > write barrier in two days! > > 2. on DRBD''s desired feature list is: to replicate sets of disks > rather than individual disks, keeping them all in sync. ZFS > probably tends to: > > (a) write Green blocks > (b) issue barriers to all disks in a vdev > (c) write Orange blocks > (d) wait until the last disk has acknowledged its barrier > (e) write Red blocks > > After this pattern it''s true pool-wide (not disk-wide) that no Red > blocks will be written on any disk unless all Green blocks have > been written to all disks. > > AIUI, DRBD can''t preserve this right now. It resynchronizes disks > independently, not in sets. > > Getting back to your question, I''d guess that running in async mode is > like you are constantly resynchronizing, and an ordinary cluster > failover in async mode is equivalent to an interrupted resync. > > so, AVS doesn''t implement (1) during a regular resync. But maybe for > a cluster that''s online in async mode it DOES implement (1)? > > HOWEVER, even if AVS implemented a (1)-like DRBD policy when it''s in > ``async'''' mode (I don''t know that it does), I can''t imagine that it > would manage (2) correctly. Does AVS have any concept of ``async disk > sets'''', where write barriers have a meaning across disks?AVS has the concept of I/O consistency groups, where all disks of a multi-volume filesystem (ZFS, QFS) or database (Oracle, Sybase) are kept write-order consistent when using either sync or async replication.> I can''t imagine it existing without a configuration knob for it. > And ZFS > needs (2). > > I would expect AVS ``sync'''' mode to provide (1) and (2), so the > question is only about ``async'''' mode failovers. > > so...based on my reasoning, it''s UNSAFE to use AVS in async mode for > ZFS replication on any pool which needs more than 1 device to have > ``sufficient replicas''''. A single device would meet that requirement, > and so would a pool containing a single mirror vdev with two devices. > > I''ve no particular knowledge of AVS at all though, besides what we''ve > all read here.I can surely help with this: http://docs.sun.com/app/docs?p=coll%2FAVS4.0> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.
>>>>> "jd" == Jim Dunham <James.Dunham at Sun.COM> writes:jd> If at the time the SNDR replica is deleted the set was jd> actively replicating, along with ZFS actively writing to the jd> ZFS storage pool, I/O consistency will be lost, leaving ZFS jd> storage pool in an indeterministic state on the remote node. jd> To address this issue, prior to deleting the replicas, the jd> replica should be placed into logging mode first. What if you stop the replication by breaking the network connection between primary and replica? consistent or inconsistent? it sounds fishy, like ``we''re always-consistent-on-disk with ZFS, but please use ''zpool offline'' to avoid disastrous pool corruption.'''' jd> ndr_ii. This is an automatic snapshot taken before jd> resynchronization starts, yeah that sounds fine, possibly better than DRBD in one way because it might allow the resync to go faster. From the PDF''s it sounds like async replication isn''t done the same way as the resync, it''s done safely, and that it''s even possible for async replication to accumulate hours of backlog in a ``disk queue'''' without losing write ordering so long as you use the ``blocking mode'''' variant of async. ii might also be good for debugging a corrupt ZFS, so you can tinker with it but still roll back to the original corrupt copy. I''ll read about it---I''m guessing I will need to prepare ahead of time if I want ii available in the toolbox after a disaster. jd> AVS has the concept of I/O consistency groups, where all disks jd> of a multi-volume filesystem (ZFS, QFS) or database (Oracle, jd> Sybase) are kept write-order consistent when using either sync jd> or async replication. Awesome, so long as people know to use it. so I guess that''s the answer for the OP: use consistency groups! The one thing I worry about is, before, AVS was used between RAID and filesystem, which is impossible now because that inter-layer area n olonger exists. If you put the individual device members of a redundant zpool vdev into an AVS consistency group, what will AVS do when one of the devices fails? Does it continue replicating the working devices and ignore the failed one? This would sacrifice redundancy at the DR site. UFS-AVS-RAID would not do that in the same situation. Or hide the failed device from ZFS and slow things down by sending all read/writes of the failed device to the remote mirror? This would slwo down the primary site. UFS-AVS-RAID would not do that in the same situation. The latter ZFS-AVS behavior might be rescueable, if ZFS had the statistical read-preference feature. but writes would still be massively slowed with this scenario, while in UFS-AVS-RAID they would not be. To get back the level of control one used to have for writes, you''d need a different zpool-level way to achieve the intent of the AVS sync/async option. Maybe just a slog which is not AVS-replicated would be enough, modulo other ZFS fixes for hiding slow devices. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080916/b42c62ae/attachment.bin>
On Sep 16, 2008, at 5:39 PM, Miles Nordin wrote:>>>>>> "jd" == Jim Dunham <James.Dunham at Sun.COM> writes: > > jd> If at the time the SNDR replica is deleted the set was > jd> actively replicating, along with ZFS actively writing to the > jd> ZFS storage pool, I/O consistency will be lost, leaving ZFS > jd> storage pool in an indeterministic state on the remote node. > > jd> To address this issue, prior to deleting the replicas, the > jd> replica should be placed into logging mode first. > > What if you stop the replication by breaking the network connection > between primary and replica? consistent or inconsistent?Consistent.> it sounds fishy, like ``we''re always-consistent-on-disk with ZFS, but > please use ''zpool offline'' to avoid disastrous pool corruption.''''This is not the case at all. Maintaining I/O consistency of all volumes in a single I/O consistency group, is an attribute of replication. The instant an SNDR replica is deleted, that volume is no longer being replicated, and it becomes inconsistent with all other write-order volumes. By placing all volumes in the I/O consistency group in logging mode, not ''zpool offline'', and then deleting the replica there is no means for any of the remote volumes to become I/O inconsistent. Yes, one will note that there is a group disable command "sndradm -g <group-name> -d", but it was implemented for easy of administration, not for performing a write-order coordinated disable command.> jd> ndr_ii. This is an automatic snapshot taken before > jd> resynchronization starts, > > yeah that sounds fine, possibly better than DRBD in one way because it > might allow the resync to go faster. > > From the PDF''s it sounds like async replication isn''t done the same > way as the resync, it''s done safely, and that it''s even possible for > async replication to accumulate hours of backlog in a ``disk queue'''' > without losing write ordering so long as you use the ``blocking mode'''' > variant of async.Correct reading of the documentation.> ii might also be good for debugging a corrupt ZFS, so you can tinker > with it but still roll back to the original corrupt copy. I''ll read > about it---I''m guessing I will need to prepare ahead of time if I want > ii available in the toolbox after a disaster. > > jd> AVS has the concept of I/O consistency groups, where all disks > jd> of a multi-volume filesystem (ZFS, QFS) or database (Oracle, > jd> Sybase) are kept write-order consistent when using either sync > jd> or async replication. > > Awesome, so long as people know to use it. so I guess that''s the > answer for the OP: use consistency groups!I use the name of the ZFS storage pool, as the name of the SNDR I/O consistency group.> The one thing I worry about is, before, AVS was used between RAID and > filesystem, which is impossible now because that inter-layer area n > olonger exists. If you put the individual device members of a > redundant zpool vdev into an AVS consistency group, what will AVS do > when one of the devices fails?Nothing, as it is ZFS the reacts to the failed device> Does it continue replicating the working devices and ignore the > failed one?In this scenario ZFS knows he device failed, which means ZFS will stop writing to the disk, and thus the replica.> This would sacrifice redundancy at the DR site. UFS-AVS-RAID > would not do that in the same situation. > > Or hide the failed device from ZFS and slow things down by sending all > read/writes of the failed device to the remote mirror? This would > slwo down the primary site. UFS-AVS-RAID would not do that in the > same situation. > > The latter ZFS-AVS behavior might be rescueable, if ZFS had the > statistical read-preference feature. but writes would still be > massively slowed with this scenario, while in UFS-AVS-RAID they would > not be. To get back the level of control one used to have for writes, > you''d need a different zpool-level way to achieve the intent of the > AVS sync/async option. Maybe just a slog which is not AVS-replicated > would be enough, modulo other ZFS fixes for hiding slow devices.ZFS-AVS is not UFS-AVS-RAID, and although one can foresee some downside to replicating ZFS with AVS, there are some big wins. Place SNDR in logging mode, and zpool scrub the secondary volumes for consistency, then resume replication. Compressed ZFS Storage pools, result in compressed replication Encrypted ZFS Storage pools, result in encrypted replication> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.