thr3ads.net - zfs discuss - [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.) [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Steve McKinty

2007-Dec-13 17:06 UTC

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

I have a couple of questions and concerns about using ZFS in an environment
where the underlying LUNs are replicated at a block level using products like
HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted the
explanation to be clear.

(I do realise that there are other possibilities such as zfs send/recv and there
are technical and business pros and cons for the various options. I
don''t want to start a ''which is best'' argument :) )

The CoW design of ZFS means that it goes to great lengths to always maintain
on-disk self-consistency, and ZFS can make certain assumptions about state (e.g
not needing fsck) based on that.  This is the basis of my questions.

1) First issue relates to the ?berblock.  Updates to it are assumed to be
atomic, but if the replication block size is smaller than the ?berblock then we
can''t guarantee that the whole ?berblock is replicated as an entity. 
That could in theory result in a corrupt ?berblock at the
secondary. 

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS
just use an alternate ?berblock and rewrite the damaged one transparently?

2) Assuming that the replication maintains write-ordering, the secondary site
will always have valid and self-consistent data, although it may be out-of-date
compared to the primary if the replication is asynchronous, depending on link
latency, buffering, etc.

Normally most replication systems do maintain write ordering, [i]except[/i] for
one specific scenario.  If the replication is interrupted, for example secondary
site down or unreachable due to a comms problem, the primary site will keep a
list of changed blocks.  When contact between the sites is re-established there
will be a period of ''catch-up'' resynchronization.  In most, if
not all, cases this is done on a simple block-order basis.  Write-ordering is
lost until the two sites are once again in sync and routine replication
restarts.

I can see this has having major ZFS impact.  It would be possible for
intermediate blocks to be replicated before the data blocks they point to, and
in the worst case an updated ?berblock could be replicated before the block
chains that it references have been copied.  This breaks the assumption that the
on-disk format is always self-consistent.

If a disaster happened during the ''catch-up'', and the
partially-resynchronized LUNs were imported into a zpool at the secondary site,
what would/could happen? Refusal to accept the whole zpool? Rejection just of
the files affected? System panic? How could recovery from this situation be
achieved?

Obviously all filesystems can suffer with this scenario, but ones that expect
less from their underlying storage (like UFS) can be fscked, and although data
that was being updated is potentially corrupt, existing data should still be OK
and usable.  My concern is that ZFS will handle this scenario less well.

There are ways to mitigate this, of course, the most obvious being to take a
snapshot of the (valid) secondary before starting resync, as a fallback.  This
isn''t always easy to do, especially since the resync is usually
automatic; there is no clear trigger to use for the snapshot. It may also be
difficult to synchronize the snapshot of all LUNs in a pool. I''d like
to better understand the risks/behaviour of ZFS before starting to work on
mitigation strategies.

Thanks

Steve
 
 
This message posted from opensolaris.org

Richard Elling

2007-Dec-13 20:17 UTC

head link

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Steve McKinty wrote:> I have a couple of questions and concerns about using ZFS in an environment
where the underlying LUNs are replicated at a block level using products like
HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted the
explanation to be clear.
>
> (I do realise that there are other possibilities such as zfs send/recv and
there are technical and business pros and cons for the various options. I
don''t want to start a ''which is best'' argument :) )
>
> The CoW design of ZFS means that it goes to great lengths to always
maintain on-disk self-consistency, and ZFS can make certain assumptions about
state (e.g not needing fsck) based on that.  This is the basis of my questions.
>
> 1) First issue relates to the ?berblock.  Updates to it are assumed to be
atomic, but if the replication block size is smaller than the ?berblock then we
can''t guarantee that the whole ?berblock is replicated as an entity. 
That could in theory result in a corrupt ?berblock at the
> secondary. 
>   
The uberblock contains a circular queue of updates.  For all practical
purposes, this is COW.  The updates I measure are usually 1 block
(or, to put it another way, I don''t recall seeing more than 1 block
being
updated... I''d have to recheck my data)
> Will this be caught and handled by the normal ZFS checksumming? If so, does
ZFS just use an alternate ?berblock and rewrite the damaged one transparently?
>
>   
The checksum should catch it.  To be safe, there are 4 copies of the 
uberblock.
> 2) Assuming that the replication maintains write-ordering, the secondary
site will always have valid and self-consistent data, although it may be
out-of-date compared to the primary if the replication is asynchronous,
depending on link latency, buffering, etc.
>
> Normally most replication systems do maintain write ordering, [i]except[/i]
for one specific scenario.  If the replication is interrupted, for example
secondary site down or unreachable due to a comms problem, the primary site will
keep a list of changed blocks.  When contact between the sites is re-established
there will be a period of ''catch-up'' resynchronization.  In
most, if not all, cases this is done on a simple block-order basis. 
Write-ordering is lost until the two sites are once again in sync and routine
replication restarts.
>
> I can see this has having major ZFS impact.  It would be possible for
intermediate blocks to be replicated before the data blocks they point to, and
in the worst case an updated ?berblock could be replicated before the block
chains that it references have been copied.  This breaks the assumption that the
on-disk format is always self-consistent.
>
> If a disaster happened during the ''catch-up'', and the
partially-resynchronized LUNs were imported into a zpool at the secondary site,
what would/could happen? Refusal to accept the whole zpool? Rejection just of
the files affected? System panic? How could recovery from this situation be
achieved?
>   
I think all of these reactions to the double-failure mode are possible.
The version of ZFS used will also have an impact as the later versions
are more resilient.  I think that in most cases, only the affected files
will be impacted.  zpool scrub will ensure that everything is consistent
and mark those files which fail to checksum properly.
> Obviously all filesystems can suffer with this scenario, but ones that
expect less from their underlying storage (like UFS) can be fscked, and although
data that was being updated is potentially corrupt, existing data should still
be OK and usable.  My concern is that ZFS will handle this scenario less well.
>   
...databases too...
It might be easier to analyze this from the perspective of the transaction
group than an individual file.  Since ZFS is COW, you may have a
state where a transaction group is incomplete, but the previous data
state should be consistent.
> There are ways to mitigate this, of course, the most obvious being to take
a snapshot of the (valid) secondary before starting resync, as a fallback.  This
isn''t always easy to do, especially since the resync is usually
automatic; there is no clear trigger to use for the snapshot. It may also be
difficult to synchronize the snapshot of all LUNs in a pool. I''d like
to better understand the risks/behaviour of ZFS before starting to work on
mitigation strategies.
>
>   
I don''t see how snapshots would help.  The inherent transaction group 
commits
should be sufficient.  Or, to look at this another way, a snapshot is 
really just
a metadata change.

I am more worried about how the storage admin sets up the LUN groups.
The human factor can really ruin my day...
 -- richard

can you guess?

2007-Dec-13 21:16 UTC

head link

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Great questions.
> 1) First issue relates to the ?berblock.  Updates to
> it are assumed to be atomic, but if the replication
> block size is smaller than the ?berblock then we
> can''t guarantee that the whole ?berblock is
> replicated as an entity.  That could in theory result
> in a corrupt ?berblock at the
> secondary. 
> 
> Will this be caught and handled by the normal ZFS
> checksumming? If so, does ZFS just use an alternate
> ?berblock and rewrite the damaged one transparently?
ZFS already has to deal with potential uberblock partial writes if it contains
multiple disk sectors (and it might be prudent even if it doesn''t, as
Richard''s response seems to suggest).  Common ways of dealing with this
problem include dumping it into the log (in which case the log with its own
internal recovery procedure becomes the real root of all evil) or cycling around
at least two locations per mirror copy (Richard''s response suggests
that there are considerably more, and that perhaps each one is written in
quadruplicate) such that the previous uberblock would still be available if the
new write tanked.  ZFS-style snapshots complicate both approaches unless special
provisions are taken - e.g., copying the current uberblock on each snapshot and
hanging a list of these snapshot uberblock addresses off the current uberblock,
though even that might run into interesting complications under the scenario
which you describe below.  Just using the ''queue'' that Richard
describes to accumulate snapshot uberblocks would limit the number of concurrent
snapshots to less than the size of that queue.

In any event, as long as writes to the secondary copy don''t continue
after a write failure of the kind that you describe has occurred (save for the
kind of catch-up procedure that you mention later), ZFS''s internal
facilities should not be confused by encountering a partial uberblock update at
the secondary, any more than they''d be confused by encountering it on
an unreplicated system after restart.
> 
> 2) Assuming that the replication maintains
> write-ordering, the secondary site will always have
> valid and self-consistent data, although it may be
> out-of-date compared to the primary if the
> replication is asynchronous, depending on link
> latency, buffering, etc. 
> 
> Normally most replication systems do maintain write
> ordering, [i]except[/i] for one specific scenario.
> If the replication is interrupted, for example
> secondary site down or unreachable due to a comms
> problem, the primary site will keep a list of
> changed blocks.  When contact between the sites is
> re-established there will be a period of ''catch-up''
> resynchronization.  In most, if not all, cases this
> is done on a simple block-order basis.
> Write-ordering is lost until the two sites are once
>  again in sync and routine replication restarts. 
> 
> I can see this has having major ZFS impact.  It would
> be possible for intermediate blocks to be replicated
> before the data blocks they point to, and in the
> worst case an updated ?berblock could be replicated
> before the block chains that it references have been
> copied.  This breaks the assumption that the on-disk
> format is always self-consistent. 
> 
> If a disaster happened during the ''catch-up'', and the
> partially-resynchronized LUNs were imported into a
> zpool at the secondary site, what would/could happen?
> Refusal to accept the whole zpool? Rejection just of
> the files affected? System panic? How could recovery
> from this situation be achieved?
My inclination is to say "By repopulating your environment from
backups":  it is not reasonable to expect *any* file system to operate
correctly, or to attempt any kind of comprehensive recovery (other than via
something like fsck, with no guarantee of how much you''ll get back),
when the underlying hardware transparently reorders updates which the file
system has explicitly ordered when it presented them.

But you may well be correct in suspecting that there''s more potential
for
data loss should this occur in a ZFS environment than in update-in-place
environments where only portions of the tree structure that were explicitly
changed during the connection hiatus would likely be affected by such a recovery
interruption (though even there if a directory changed enough to change its
block structure on disk you could be in more trouble).
> 
> Obviously all filesystems can suffer with this
> scenario, but ones that expect less from their
> underlying storage (like UFS) can be fscked, and
> although data that was being updated is potentially
> corrupt, existing data should still be OK and usable.
> My concern is that ZFS will handle this scenario
>  less well. 
> 
> There are ways to mitigate this, of course, the most
> obvious being to take a snapshot of the (valid)
> secondary before starting resync, as a fallback.
You''re talking about an HDS- or EMC-level snapshot, right?
> This isn''t always easy to do, especially since the
> resync is usually automatic; there is no clear
> trigger to use for the snapshot. It may also be
> difficult to synchronize the snapshot of all LUNs in
> a pool. I''d like to better understand the
> risks/behaviour of ZFS before starting to work on
>  mitigation strategies. 
It strikes me as irresponsible for a high-end storage product such as you
describe neither to order its recovery in the same manner that it orders its
normal operation nor to protect that recovery such that it can be virtually
guaranteed to complete successfully (e.g., by taking a destination snapshot as
you suggest or by first copying and mirroring the entire set of update blocks to
the destination).  Are you *sure* they don''t?

- bill
 
 
This message posted from opensolaris.org

Ricardo M. Correia

2007-Dec-14 03:31 UTC

head link

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Steve McKinty wrote:> 1) First issue relates to the ?berblock.  Updates to it are assumed to be
atomic, but if the replication block size is smaller than the ?berblock then we
can''t guarantee that the whole ?berblock is replicated as an entity. 
That could in theory result in a corrupt ?berblock at the
> secondary. 
>
> Will this be caught and handled by the normal ZFS checksumming? If so, does
ZFS just use an alternate ?berblock and rewrite the damaged one transparently?
>   
Yes, ZFS uberblocks are self-checksummed with SHA-256 and when opening 
the pool it uses the latest valid uberblock that it can find. So that is 
not a problem.
> 2) Assuming that the replication maintains write-ordering, the secondary
site will always have valid and self-consistent data, although it may be
out-of-date compared to the primary if the replication is asynchronous,
depending on link latency, buffering, etc.
>
> Normally most replication systems do maintain write ordering, [i]except[/i]
for one specific scenario.  If the replication is interrupted, for example
secondary site down or unreachable due to a comms problem, the primary site will
keep a list of changed blocks.  When contact between the sites is re-established
there will be a period of ''catch-up'' resynchronization.  In
most, if not all, cases this is done on a simple block-order basis. 
Write-ordering is lost until the two sites are once again in sync and routine
replication restarts.
>
> I can see this has having major ZFS impact.  It would be possible for
intermediate blocks to be replicated before the data blocks they point to, and
in the worst case an updated ?berblock could be replicated before the block
chains that it references have been copied.  This breaks the assumption that the
on-disk format is always self-consistent.
>
> If a disaster happened during the ''catch-up'', and the
partially-resynchronized LUNs were imported into a zpool at the secondary site,
what would/could happen? Refusal to accept the whole zpool? Rejection just of
the files affected? System panic? How could recovery from this situation be
achieved?
>   
I believe your understanding is correct. If you expect such a 
double-failure, you cannot rely on being able to recover your pool at 
the secondary site.

The newest uberblocks would be among the first blocks to be replicated 
(2 of the uberblock arrays are situated at the start of the vdev) and 
your whole block tree might be inaccessible if the latest Meta Object 
Set blocks were not also replicated. You might be lucky and be able to 
mount your filesystems because ZFS keeps 3 separate copies of the most 
important metadata and it tries to keep apart each copy by about 1/8th 
of the disk, but even then I wouldn''t count on it.

If ZFS can''t open the pool due to this kind of corruption, you would
get
the following message:

status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.

At this point, you could try zeroing out the first 2 uberblock arrays so 
that ZFS tries using an older uberblock from the last 2 arrays, but this 
might not work. As the message says, the only reliable way to recover 
from this is restoring your pool from backups.
> There are ways to mitigate this, of course, the most obvious being to take
a snapshot of the (valid) secondary before starting resync, as a fallback.  This
isn''t always easy to do, especially since the resync is usually
automatic; there is no clear trigger to use for the snapshot. It may also be
difficult to synchronize the snapshot of all LUNs in a pool. I''d like
to better understand the risks/behaviour of ZFS before starting to work on
mitigation strategies.
>   
If the replication process was interrupted for a sufficiently long time 
and disaster strikes at the primary site *during resync*, I don''t think
snapshots would save you even if you had took them at the right time.
Snapshots might increase your chances of recovery (by making ZFS not 
free and reuse blocks), but AFAIK there wouldn''t be any guarantee that 
you''d be able to recover anything whatsoever since the most important 
pool metadata is not part of the snapshots.

Regards,
Ricardo

-- 
<http://www.sun.com> 	

*Ricardo Manuel Correia*

Lustre Engineering
*Sun Microsystems, Inc.*
Portugal

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071214/3b04b114/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071214/3b04b114/attachment.gif>

Jim Dunham

2007-Dec-14 19:27 UTC

head link

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Steve,
> I have a couple of questions and concerns about using ZFS in an  
> environment where the underlying LUNs are replicated at a block  
> level using products like HDS TrueCopy or EMC SRDF.  Apologies in  
> advance for the length, but I wanted the explanation to be clear.
>
> (I do realise that there are other possibilities such as zfs send/ 
> recv and there are technical and business pros and cons for the  
> various options. I don''t want to start a ''which is
best'' argument :) )
>
> The CoW design of ZFS means that it goes to great lengths to always  
> maintain on-disk self-consistency, and ZFS can make certain  
> assumptions about state (e.g not needing fsck) based on that.  This  
> is the basis of my questions.
>
> 1) First issue relates to the ?berblock.  Updates to it are assumed  
> to be atomic, but if the replication block size is smaller than the  
> ?berblock then we can''t guarantee that the whole ?berblock is  
> replicated as an entity.  That could in theory result in a corrupt  
> ?berblock at the
> secondary.
>
> Will this be caught and handled by the normal ZFS checksumming? If  
> so, does ZFS just use an alternate ?berblock and rewrite the  
> damaged one transparently?
>
> 2) Assuming that the replication maintains write-ordering, the  
> secondary site will always have valid and self-consistent data,  
> although it may be out-of-date compared to the primary if the  
> replication is asynchronous, depending on link latency, buffering,  
> etc.
>
> Normally most replication systems do maintain write ordering, [i] 
> except[/i] for one specific scenario.  If the replication is  
> interrupted, for example secondary site down or unreachable due to  
> a comms problem, the primary site will keep a list of changed  
> blocks.  When contact between the sites is re-established there  
> will be a period of ''catch-up'' resynchronization.  In
most, if not
> all, cases this is done on a simple block-order basis.  Write- 
> ordering is lost until the two sites are once again in sync and  
> routine replication restarts.
>
> I can see this has having major ZFS impact.  It would be possible  
> for intermediate blocks to be replicated before the data blocks  
> they point to, and in the worst case an updated ?berblock could be  
> replicated before the block chains that it references have been  
> copied.  This breaks the assumption that the on-disk format is  
> always self-consistent.
For most implementations of resynchronization, not only are changes  
resilvered in a block-ordered basis, resynchronization is also done  
in a single pass over the volume(s). To address the fact that  
resynchronization happens while additional changes are also being  
replicated, the concept of a resynchronization point is kept. As this  
resynchronization point traverse the volume from beginning to end, I/ 
Os occurring before, or at this point need to be replicated inline,  
whereas I/Os occurring after this point need to marked such that they  
will be replicated later in block order. You are quite correct in  
that the data is not consistent.
> If a disaster happened during the ''catch-up'', and the
partially-
> resynchronized LUNs were imported into a zpool at the secondary  
> site, what would/could happen? Refusal to accept the whole zpool?  
> Rejection just of the files affected? System panic? How could  
> recovery from this situation be achieved?
The state of the partially-resynchronized LUNs are much worse than  
you know. During active resynchronization, the remote volume contains  
a mixture of prior write-order consistent data, resilvered block- 
order data, plus new replicated data. Essentially the partially- 
resynchronized LUNs are totally inconsistent until such a times as  
the single pass over all data is 100% complete.

For some, but not all replication software, if the ''catch-up''
resynchronization failed, read access to the LUNs should be  
prevented, or a least read access while the LUNs are configured as  
remote mirrors. Availability Suite''s Remote Mirror software (SNDR)  
marks such volumes as "need synchronization" and fails all  
application read and write I/Os.
> Obviously all filesystems can suffer with this scenario, but ones  
> that expect less from their underlying storage (like UFS) can be  
> fscked, and although data that was being updated is potentially  
> corrupt, existing data should still be OK and usable.  My concern  
> is that ZFS will handle this scenario less well.
>
> There are ways to mitigate this, of course, the most obvious being  
> to take a snapshot of the (valid) secondary before starting resync,  
> as a fallback.  This isn''t always easy to do, especially since the
> resync is usually automatic; there is no clear trigger to use for  
> the snapshot. It may also be difficult to synchronize the snapshot  
> of all LUNs in a pool. I''d like to better understand the risks/ 
> behaviour of ZFS before starting to work on mitigation strategies.
Since Availability Suite is both Remote Mirroring and Point-in-Time  
Copy software, the software can be configured to automatically take a  
snapshot prior to re-synchronization, and automatically delete the  
snapshot if completed successfully. The use of I/O consistency groups  
assure that not only are the replicas write-order consistent during  
replication, but also that snapshots taken prior to re- 
synchronization are consistent too.

> Thanks
>
> Steve
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Storage Platform Software Group
Sun Microsystems, Inc.
wk: 781.442.4042

http://blogs.sun.com/avs
http://www.opensolaris.org/os/project/avs/
http://www.opensolaris.org/os/project/iscsitgt/
http://www.opensolaris.org/os/community/storage/

Steve McKinty

2007-Dec-17 14:56 UTC

head link

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Hi all,

Thanks for the replies.

To clear up one issue, when I mentioned snapshots, I [i]was[/i] talking about
snapshots performed by the array and/or replication software. In order to take a
ZFS snapshot I would need access to the secondary copies of the data, and that
is not generally possible while replication is configured.

Thanks, Jim, for the info on AVS snapshot, I''ll look at that.

As for other products, I''m almost certain the TrueCopy resyncs in block
order, but Universal Replicator''s journal-based pull model may maintain
write-ordering under such circumstances. I''m waiting for a reply from
EMC about SRDF, but the information I have so far is that it, too, reverts to
block-ordering during resync.

It does look as if my concerns about ZFS are valid, though. It will require
special handling.

Thanks

Steve
 
 
This message posted from opensolaris.org

zfs discuss - Dec 2007 - ZFS with array-level block replication (TrueCopy, SRDF, etc.)

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)