thr3ads.net - zfs discuss - [zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Matt Beebe

2008-Sep-11 17:01 UTC

[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication

When using AVS''s "Async replication with memory queue", am I
guaranteed a consistent ZFS on the distant end?

The assumed failure case is that the replication broke, and now I''m
trying to promote the secondary replicate with what might be stale data. 
Recognizing in advance that some of the data would be (obviously) stale, my
concern is whether or not ZFS stayed consistent, or does AVS know how to
"bundle" ZFS''s atomic writes properly?

-Matt
--
This message posted from opensolaris.org

Miles Nordin

2008-Sep-11 18:52 UTC

head link

[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication

>>>>> "mb" == Matt Beebe <matthew.beebe at
high-eng.com> writes:
mb> When using AVS''s "Async replication with memory
queue", am I
mb> guaranteed a consistent ZFS on the distant end? The assumed
mb> failure case is that the replication broke, and now I''m
trying
mb> to promote the secondary replicate with what might be stale
mb> data. Recognizing in advance that some of the data would be
mb> (obviously) stale,

mb> my concern is whether or not ZFS stayed consistent, or does
mb> AVS know how to "bundle" ZFS''s atomic writes
properly?

Assuming the ZFS claims of ``always consistent on disk'''' are
true (or
are fixed to be true), all that''s required is to write the updates in
time order.

simoncr was saying in the thread that Maurice quoted:

http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30

that during a partial-resync after a loss of connectivity AVS writes
in LBA order while DRBD writes in time order. The thread was about
resyncing and restoring replication, not about broken async
replication.

The DRBD virtue here is if you start a resync and want to abandon
it---if the resync took a long time, or the network failed permanently
half way through resync---something like that. With DRBD it''s
possible to give up, discard the unsync''d data, and bring up the
cluster on the partially-updated sync-target.

With AVS and LBA-order resync, you have the ``give up'''' option
only
before you begin the resync: the proposed sync target doesn''t have the
latest data on it, but it''s mountable. You lose some protection by
agreeing to start a sync: after you begin, the sync target is totally
inconsistent and unmountable until the sync completes successfully.
so, if the sync source node were destroyed or a crappy network
connection went fully down during the resync, you lose everything!

DRBD''s way sounds like a clear and very simple win at first, but makes
me ask:

1. DRBD cannot _really_ write in time order because (a) it would mean
making a write barrier between each sector and (b) there isn''t a
fixed time order to begin with because block layers and even some
disks allow multiple outstanding commands.

Does he mean DRBD stores the write barriers in its dirty-log and
implements them during resync? In this case, the target will NOT
be a point-in-time copy of a past source volume, it''ll just be
``correct'''' w.r.t. the barrier rules. I could imagine
this
working in a perfect world,...or, at least, a well-tested
well-integrated world.

In our world, that strategy could make for an interesting test of
filesystem bugs w.r.t. write barriers---are they truly issuing all
the barriers needed for formal correctness, or are they
unknowingly dependent on the 95th-percentile habits of real-world
disks? What if you have some mistake that is blocking write
barriers entirely (like LVM2/devicemapper)---on real disks it
might just cause some database corruption, but DRBD implementing
this rule precisely could imagineably degrade to the AVS case, and
write two days of stale data in LBA order because it hasn''t seen a
write barrier in two days!

2. on DRBD''s desired feature list is: to replicate sets of disks
rather than individual disks, keeping them all in sync. ZFS
probably tends to:

(a) write Green blocks
(b) issue barriers to all disks in a vdev
(c) write Orange blocks
(d) wait until the last disk has acknowledged its barrier
(e) write Red blocks

After this pattern it''s true pool-wide (not disk-wide) that no Red
blocks will be written on any disk unless all Green blocks have
been written to all disks.

AIUI, DRBD can''t preserve this right now. It resynchronizes disks
independently, not in sets.

Getting back to your question, I''d guess that running in async mode is
like you are constantly resynchronizing, and an ordinary cluster
failover in async mode is equivalent to an interrupted resync.

so, AVS doesn''t implement (1) during a regular resync. But maybe for
a cluster that''s online in async mode it DOES implement (1)?

HOWEVER, even if AVS implemented a (1)-like DRBD policy when it''s in
``async'''' mode (I don''t know that it does), I
can''t imagine that it
would manage (2) correctly. Does AVS have any concept of ``async disk
sets'''', where write barriers have a meaning across disks? I
can''t
imagine it existing without a configuration knob for it. And ZFS
needs (2).

I would expect AVS ``sync'''' mode to provide (1) and (2), so
the
question is only about ``async'''' mode failovers.

so...based on my reasoning, it''s UNSAFE to use AVS in async mode for
ZFS replication on any pool which needs more than 1 device to have
``sufficient replicas''''. A single device would meet that
requirement,
and so would a pool containing a single mirror vdev with two devices.

I''ve no particular knowledge of AVS at all though, besides what
we''ve
all read here.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080911/e2366463/attachment.bin>

Jim Dunham

2008-Sep-12 21:31 UTC

head link

[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication

Miles,
>>>>>> "mb" == Matt Beebe <matthew.beebe at
high-eng.com> writes:
>
>    mb> When using AVS''s "Async replication with memory
queue", am I
>    mb> guaranteed a consistent ZFS on the distant end?  The assumed
>    mb> failure case is that the replication broke, and now I''m
trying
>    mb> to promote the secondary replicate with what might be stale
>    mb> data.  Recognizing in advance that some of the data would be
>    mb> (obviously) stale,
>
>    mb> my concern is whether or not ZFS stayed consistent, or does
>    mb> AVS know how to "bundle" ZFS''s atomic writes
properly?
>
> Assuming the ZFS claims of ``always consistent on disk''''
are true (or
> are fixed to be true), all that''s required is to write the updates
in
> time order.
>
> simoncr was saying in the thread that Maurice quoted:
>
> http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30
>
> that during a partial-resync after a loss of connectivity AVS writes
> in LBA order while DRBD writes in time order.  The thread was about
> resyncing and restoring replication, not about broken async
> replication.
>
> The DRBD virtue here is if you start a resync and want to abandon
> it---if the resync took a long time, or the network failed permanently
> half way through resync---something like that.  With DRBD it''s
> possible to give up, discard the unsync''d data, and bring up the
> cluster on the partially-updated sync-target.
>
> With AVS and LBA-order resync, you have the ``give up''''
option only
> before you begin the resync: the proposed sync target doesn''t have
the
> latest data on it, but it''s mountable.  You lose some protection
by
> agreeing to start a sync: after you begin, the sync target is totally
> inconsistent and unmountable until the sync completes successfully.
> so, if the sync source node were destroyed or a crappy network
> connection went fully down during the resync, you lose everything!
To address this issue there is a feature call ndr_ii. This is an  
automatic snapshot taken before resynchronization starts, so that on  
the remote node there is always a write-order consistent volume  
available. If replication stops, is taking too long, etc., the  
snapshot can be restored, so that one does not lose everything.
> DRBD''s way sounds like a clear and very simple win at first, but
makes
> me ask:
>
> 1. DRBD cannot _really_ write in time order because (a) it would mean
>    making a write barrier between each sector and (b) there isn''t
a
>    fixed time order to begin with because block layers and even some
>    disks allow multiple outstanding commands.
>
>    Does he mean DRBD stores the write barriers in its dirty-log and
>    implements them during resync?  In this case, the target will NOT
>    be a point-in-time copy of a past source volume, it''ll just be
>    ``correct'''' w.r.t. the barrier rules.  I could imagine
this
>    working in a perfect world,...or, at least, a well-tested
>    well-integrated world.
>
>    In our world, that strategy could make for an interesting test of
>    filesystem bugs w.r.t. write barriers---are they truly issuing all
>    the barriers needed for formal correctness, or are they
>    unknowingly dependent on the 95th-percentile habits of real-world
>    disks?  What if you have some mistake that is blocking write
>    barriers entirely (like LVM2/devicemapper)---on real disks it
>    might just cause some database corruption, but DRBD implementing
>    this rule precisely could imagineably degrade to the AVS case, and
>    write two days of stale data in LBA order because it hasn''t
seen a
>    write barrier in two days!
>
> 2. on DRBD''s desired feature list is: to replicate sets of disks
>    rather than individual disks, keeping them all in sync.  ZFS
>    probably tends to:
>
>    (a) write Green blocks
>    (b) issue barriers to all disks in a vdev
>    (c) write Orange blocks
>    (d) wait until the last disk has acknowledged its barrier
>    (e) write Red blocks
>
>    After this pattern it''s true pool-wide (not disk-wide) that no
Red
>    blocks will be written on any disk unless all Green blocks have
>    been written to all disks.
>
>    AIUI, DRBD can''t preserve this right now.  It resynchronizes
disks
>    independently, not in sets.
>
> Getting back to your question, I''d guess that running in async
mode is
> like you are constantly resynchronizing, and an ordinary cluster
> failover in async mode is equivalent to an interrupted resync.
>
> so, AVS doesn''t implement (1) during a regular resync.  But maybe
for
> a cluster that''s online in async mode it DOES implement (1)?
>
> HOWEVER, even if AVS implemented a (1)-like DRBD policy when it''s
in
> ``async'''' mode (I don''t know that it does), I
can''t imagine that it
> would manage (2) correctly.  Does AVS have any concept of ``async disk
> sets'''', where write barriers have a meaning across disks?
AVS has the concept of I/O consistency groups, where all disks of a  
multi-volume filesystem (ZFS, QFS) or database (Oracle, Sybase) are  
kept write-order consistent when using either sync or async replication.
> I can''t imagine it existing without a configuration knob for it.
> And ZFS
> needs (2).
>
> I would expect AVS ``sync'''' mode to provide (1) and (2),
so the
> question is only about ``async'''' mode failovers.
>
> so...based on my reasoning, it''s UNSAFE to use AVS in async mode
for
> ZFS replication on any pool which needs more than 1 device to have
> ``sufficient replicas''''.  A single device would meet that
requirement,
> and so would a pool containing a single mirror vdev with two devices.
>
> I''ve no particular knowledge of AVS at all though, besides what
we''ve
> all read here.
I can surely help with this: http://docs.sun.com/app/docs?p=coll%2FAVS4.0
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

Miles Nordin

2008-Sep-16 21:39 UTC

head link

[zfs-discuss] ZPOOL Import Problem

>>>>> "jd" == Jim Dunham <James.Dunham at
Sun.COM> writes:
jd> If at the time the SNDR replica is deleted the set was
jd> actively replicating, along with ZFS actively writing to the
jd> ZFS storage pool, I/O consistency will be lost, leaving ZFS
jd> storage pool in an indeterministic state on the remote node.

jd> To address this issue, prior to deleting the replicas, the
jd> replica should be placed into logging mode first.

What if you stop the replication by breaking the network connection
between primary and replica? consistent or inconsistent?

it sounds fishy, like ``we''re always-consistent-on-disk with ZFS, but
please use ''zpool offline'' to avoid disastrous pool
corruption.''''

jd> ndr_ii. This is an automatic snapshot taken before
jd> resynchronization starts,

yeah that sounds fine, possibly better than DRBD in one way because it
might allow the resync to go faster.

From the PDF''s it sounds like async replication isn''t done the
same
way as the resync, it''s done safely, and that it''s even
possible for
async replication to accumulate hours of backlog in a ``disk
queue''''
without losing write ordering so long as you use the ``blocking
mode''''
variant of async.

ii might also be good for debugging a corrupt ZFS, so you can tinker
with it but still roll back to the original corrupt copy. I''ll read
about it---I''m guessing I will need to prepare ahead of time if I want
ii available in the toolbox after a disaster.

jd> AVS has the concept of I/O consistency groups, where all disks
jd> of a multi-volume filesystem (ZFS, QFS) or database (Oracle,
jd> Sybase) are kept write-order consistent when using either sync
jd> or async replication.

Awesome, so long as people know to use it. so I guess that''s the
answer for the OP: use consistency groups!

The one thing I worry about is, before, AVS was used between RAID and
filesystem, which is impossible now because that inter-layer area n
olonger exists. If you put the individual device members of a
redundant zpool vdev into an AVS consistency group, what will AVS do
when one of the devices fails?

Does it continue replicating the working devices and ignore the failed
one? This would sacrifice redundancy at the DR site. UFS-AVS-RAID
would not do that in the same situation.

Or hide the failed device from ZFS and slow things down by sending all
read/writes of the failed device to the remote mirror? This would
slwo down the primary site. UFS-AVS-RAID would not do that in the
same situation.

The latter ZFS-AVS behavior might be rescueable, if ZFS had the
statistical read-preference feature. but writes would still be
massively slowed with this scenario, while in UFS-AVS-RAID they would
not be. To get back the level of control one used to have for writes,
you''d need a different zpool-level way to achieve the intent of the
AVS sync/async option. Maybe just a slog which is not AVS-replicated
would be enough, modulo other ZFS fixes for hiding slow devices.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080916/b42c62ae/attachment.bin>

Jim Dunham

2008-Sep-18 01:08 UTC

head link

[zfs-discuss] ZPOOL Import Problem

On Sep 16, 2008, at 5:39 PM, Miles Nordin wrote:
>>>>>> "jd" == Jim Dunham <James.Dunham at
Sun.COM> writes:
>
>    jd> If at the time the SNDR replica is deleted the set was
>    jd> actively replicating, along with ZFS actively writing to the
>    jd> ZFS storage pool, I/O consistency will be lost, leaving ZFS
>    jd> storage pool in an indeterministic state on the remote node.
>
>    jd> To address this issue, prior to deleting the replicas, the
>    jd> replica should be placed into logging mode first.
>
> What if you stop the replication by breaking the network connection
> between primary and replica?  consistent or inconsistent?
Consistent.
> it sounds fishy, like ``we''re always-consistent-on-disk with ZFS,
but
> please use ''zpool offline'' to avoid disastrous pool
corruption.''''
This is not the case at all.

Maintaining I/O consistency of all volumes in a single I/O consistency  
group, is an attribute of replication. The instant an SNDR replica is  
deleted, that volume is no longer being replicated, and it becomes  
inconsistent with all other write-order volumes. By placing all  
volumes in the I/O consistency group in logging mode, not ''zpool  
offline'', and then deleting the replica there is no means for any of  
the remote volumes to become I/O inconsistent.

Yes, one will note that there is a group disable command "sndradm -g  
<group-name> -d", but it was implemented for easy of administration,
not for performing a write-order coordinated disable command.
>    jd> ndr_ii. This is an automatic snapshot taken before
>    jd> resynchronization starts,
>
> yeah that sounds fine, possibly better than DRBD in one way because it
> might allow the resync to go faster.
>
> From the PDF''s it sounds like async replication isn''t
done the same
> way as the resync, it''s done safely, and that it''s even
possible for
> async replication to accumulate hours of backlog in a ``disk
queue''''
> without losing write ordering so long as you use the ``blocking
mode''''
> variant of async.
Correct reading of the documentation.
> ii might also be good for debugging a corrupt ZFS, so you can tinker
> with it but still roll back to the original corrupt copy.  I''ll
read
> about it---I''m guessing I will need to prepare ahead of time if I
want
> ii available in the toolbox after a disaster.
>
>    jd> AVS has the concept of I/O consistency groups, where all disks
>    jd> of a multi-volume filesystem (ZFS, QFS) or database (Oracle,
>    jd> Sybase) are kept write-order consistent when using either sync
>    jd> or async replication.
>
> Awesome, so long as people know to use it.  so I guess that''s the
> answer for the OP: use consistency groups!
I use the name of the ZFS storage pool, as the name of the SNDR I/O  
consistency group.
> The one thing I worry about is, before, AVS was used between RAID and
> filesystem, which is impossible now because that inter-layer area n
> olonger exists.  If you put the individual device members of a
> redundant zpool vdev into an AVS consistency group, what will AVS do
> when one of the devices fails?
Nothing, as it is ZFS the reacts to the failed device
> Does it continue replicating the working devices and ignore the  
> failed one?
In this scenario ZFS knows he device failed, which means ZFS will stop  
writing to the disk, and thus the replica.

> This would sacrifice redundancy at the DR site.  UFS-AVS-RAID
> would not do that in the same situation.
>
> Or hide the failed device from ZFS and slow things down by sending all
> read/writes of the failed device to the remote mirror?  This would
> slwo down the primary site.  UFS-AVS-RAID would not do that in the
> same situation.
>
> The latter ZFS-AVS behavior might be rescueable, if ZFS had the
> statistical read-preference feature.  but writes would still be
> massively slowed with this scenario, while in UFS-AVS-RAID they would
> not be.  To get back the level of control one used to have for writes,
> you''d need a different zpool-level way to achieve the intent of
the
> AVS sync/async option.  Maybe just a slog which is not AVS-replicated
> would be enough, modulo other ZFS fixes for hiding slow devices.
ZFS-AVS is not UFS-AVS-RAID, and although one can foresee some  
downside to replicating ZFS with AVS, there are some big wins.

Place SNDR in logging mode, and zpool scrub the secondary volumes for  
consistency, then resume replication.
Compressed ZFS Storage pools, result in compressed replication
Encrypted ZFS Storage pools, result in encrypted replication

>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

zfs discuss - Sep 2008 - Will ZFS stay consistent with AVS/ZFS and async replication

[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication

[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication

[zfs-discuss] Will ZFS stay consistent with AVS/ZFS and async replication

[zfs-discuss] ZPOOL Import Problem

[zfs-discuss] ZPOOL Import Problem