Steve McKinty
2007-Dec-13 17:06 UTC
[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
I have a couple of questions and concerns about using ZFS in an environment where the underlying LUNs are replicated at a block level using products like HDS TrueCopy or EMC SRDF. Apologies in advance for the length, but I wanted the explanation to be clear. (I do realise that there are other possibilities such as zfs send/recv and there are technical and business pros and cons for the various options. I don''t want to start a ''which is best'' argument :) ) The CoW design of ZFS means that it goes to great lengths to always maintain on-disk self-consistency, and ZFS can make certain assumptions about state (e.g not needing fsck) based on that. This is the basis of my questions. 1) First issue relates to the ?berblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the ?berblock then we can''t guarantee that the whole ?berblock is replicated as an entity. That could in theory result in a corrupt ?berblock at the secondary. Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate ?berblock and rewrite the damaged one transparently? 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of ''catch-up'' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated ?berblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. If a disaster happened during the ''catch-up'', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? Obviously all filesystems can suffer with this scenario, but ones that expect less from their underlying storage (like UFS) can be fscked, and although data that was being updated is potentially corrupt, existing data should still be OK and usable. My concern is that ZFS will handle this scenario less well. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn''t always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I''d like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. Thanks Steve This message posted from opensolaris.org
Richard Elling
2007-Dec-13 20:17 UTC
[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Steve McKinty wrote:> I have a couple of questions and concerns about using ZFS in an environment where the underlying LUNs are replicated at a block level using products like HDS TrueCopy or EMC SRDF. Apologies in advance for the length, but I wanted the explanation to be clear. > > (I do realise that there are other possibilities such as zfs send/recv and there are technical and business pros and cons for the various options. I don''t want to start a ''which is best'' argument :) ) > > The CoW design of ZFS means that it goes to great lengths to always maintain on-disk self-consistency, and ZFS can make certain assumptions about state (e.g not needing fsck) based on that. This is the basis of my questions. > > 1) First issue relates to the ?berblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the ?berblock then we can''t guarantee that the whole ?berblock is replicated as an entity. That could in theory result in a corrupt ?berblock at the > secondary. >The uberblock contains a circular queue of updates. For all practical purposes, this is COW. The updates I measure are usually 1 block (or, to put it another way, I don''t recall seeing more than 1 block being updated... I''d have to recheck my data)> Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate ?berblock and rewrite the damaged one transparently? > >The checksum should catch it. To be safe, there are 4 copies of the uberblock.> 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. > > Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of ''catch-up'' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. > > I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated ?berblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. > > If a disaster happened during the ''catch-up'', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? >I think all of these reactions to the double-failure mode are possible. The version of ZFS used will also have an impact as the later versions are more resilient. I think that in most cases, only the affected files will be impacted. zpool scrub will ensure that everything is consistent and mark those files which fail to checksum properly.> Obviously all filesystems can suffer with this scenario, but ones that expect less from their underlying storage (like UFS) can be fscked, and although data that was being updated is potentially corrupt, existing data should still be OK and usable. My concern is that ZFS will handle this scenario less well. >...databases too... It might be easier to analyze this from the perspective of the transaction group than an individual file. Since ZFS is COW, you may have a state where a transaction group is incomplete, but the previous data state should be consistent.> There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn''t always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I''d like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. > >I don''t see how snapshots would help. The inherent transaction group commits should be sufficient. Or, to look at this another way, a snapshot is really just a metadata change. I am more worried about how the storage admin sets up the LUN groups. The human factor can really ruin my day... -- richard
can you guess?
2007-Dec-13 21:16 UTC
[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Great questions.> 1) First issue relates to the ?berblock. Updates to > it are assumed to be atomic, but if the replication > block size is smaller than the ?berblock then we > can''t guarantee that the whole ?berblock is > replicated as an entity. That could in theory result > in a corrupt ?berblock at the > secondary. > > Will this be caught and handled by the normal ZFS > checksumming? If so, does ZFS just use an alternate > ?berblock and rewrite the damaged one transparently?ZFS already has to deal with potential uberblock partial writes if it contains multiple disk sectors (and it might be prudent even if it doesn''t, as Richard''s response seems to suggest). Common ways of dealing with this problem include dumping it into the log (in which case the log with its own internal recovery procedure becomes the real root of all evil) or cycling around at least two locations per mirror copy (Richard''s response suggests that there are considerably more, and that perhaps each one is written in quadruplicate) such that the previous uberblock would still be available if the new write tanked. ZFS-style snapshots complicate both approaches unless special provisions are taken - e.g., copying the current uberblock on each snapshot and hanging a list of these snapshot uberblock addresses off the current uberblock, though even that might run into interesting complications under the scenario which you describe below. Just using the ''queue'' that Richard describes to accumulate snapshot uberblocks would limit the number of concurrent snapshots to less than the size of that queue. In any event, as long as writes to the secondary copy don''t continue after a write failure of the kind that you describe has occurred (save for the kind of catch-up procedure that you mention later), ZFS''s internal facilities should not be confused by encountering a partial uberblock update at the secondary, any more than they''d be confused by encountering it on an unreplicated system after restart.> > 2) Assuming that the replication maintains > write-ordering, the secondary site will always have > valid and self-consistent data, although it may be > out-of-date compared to the primary if the > replication is asynchronous, depending on link > latency, buffering, etc. > > Normally most replication systems do maintain write > ordering, [i]except[/i] for one specific scenario. > If the replication is interrupted, for example > secondary site down or unreachable due to a comms > problem, the primary site will keep a list of > changed blocks. When contact between the sites is > re-established there will be a period of ''catch-up'' > resynchronization. In most, if not all, cases this > is done on a simple block-order basis. > Write-ordering is lost until the two sites are once > again in sync and routine replication restarts. > > I can see this has having major ZFS impact. It would > be possible for intermediate blocks to be replicated > before the data blocks they point to, and in the > worst case an updated ?berblock could be replicated > before the block chains that it references have been > copied. This breaks the assumption that the on-disk > format is always self-consistent. > > If a disaster happened during the ''catch-up'', and the > partially-resynchronized LUNs were imported into a > zpool at the secondary site, what would/could happen? > Refusal to accept the whole zpool? Rejection just of > the files affected? System panic? How could recovery > from this situation be achieved?My inclination is to say "By repopulating your environment from backups": it is not reasonable to expect *any* file system to operate correctly, or to attempt any kind of comprehensive recovery (other than via something like fsck, with no guarantee of how much you''ll get back), when the underlying hardware transparently reorders updates which the file system has explicitly ordered when it presented them. But you may well be correct in suspecting that there''s more potential for data loss should this occur in a ZFS environment than in update-in-place environments where only portions of the tree structure that were explicitly changed during the connection hiatus would likely be affected by such a recovery interruption (though even there if a directory changed enough to change its block structure on disk you could be in more trouble).> > Obviously all filesystems can suffer with this > scenario, but ones that expect less from their > underlying storage (like UFS) can be fscked, and > although data that was being updated is potentially > corrupt, existing data should still be OK and usable. > My concern is that ZFS will handle this scenario > less well. > > There are ways to mitigate this, of course, the most > obvious being to take a snapshot of the (valid) > secondary before starting resync, as a fallback.You''re talking about an HDS- or EMC-level snapshot, right?> This isn''t always easy to do, especially since the > resync is usually automatic; there is no clear > trigger to use for the snapshot. It may also be > difficult to synchronize the snapshot of all LUNs in > a pool. I''d like to better understand the > risks/behaviour of ZFS before starting to work on > mitigation strategies.It strikes me as irresponsible for a high-end storage product such as you describe neither to order its recovery in the same manner that it orders its normal operation nor to protect that recovery such that it can be virtually guaranteed to complete successfully (e.g., by taking a destination snapshot as you suggest or by first copying and mirroring the entire set of update blocks to the destination). Are you *sure* they don''t? - bill This message posted from opensolaris.org
Ricardo M. Correia
2007-Dec-14 03:31 UTC
[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Steve McKinty wrote:> 1) First issue relates to the ?berblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the ?berblock then we can''t guarantee that the whole ?berblock is replicated as an entity. That could in theory result in a corrupt ?berblock at the > secondary. > > Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate ?berblock and rewrite the damaged one transparently? >Yes, ZFS uberblocks are self-checksummed with SHA-256 and when opening the pool it uses the latest valid uberblock that it can find. So that is not a problem.> 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. > > Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of ''catch-up'' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. > > I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated ?berblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. > > If a disaster happened during the ''catch-up'', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? >I believe your understanding is correct. If you expect such a double-failure, you cannot rely on being able to recover your pool at the secondary site. The newest uberblocks would be among the first blocks to be replicated (2 of the uberblock arrays are situated at the start of the vdev) and your whole block tree might be inaccessible if the latest Meta Object Set blocks were not also replicated. You might be lucky and be able to mount your filesystems because ZFS keeps 3 separate copies of the most important metadata and it tries to keep apart each copy by about 1/8th of the disk, but even then I wouldn''t count on it. If ZFS can''t open the pool due to this kind of corruption, you would get the following message: status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. At this point, you could try zeroing out the first 2 uberblock arrays so that ZFS tries using an older uberblock from the last 2 arrays, but this might not work. As the message says, the only reliable way to recover from this is restoring your pool from backups.> There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn''t always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I''d like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. >If the replication process was interrupted for a sufficiently long time and disaster strikes at the primary site *during resync*, I don''t think snapshots would save you even if you had took them at the right time. Snapshots might increase your chances of recovery (by making ZFS not free and reuse blocks), but AFAIK there wouldn''t be any guarantee that you''d be able to recover anything whatsoever since the most important pool metadata is not part of the snapshots. Regards, Ricardo -- <http://www.sun.com> *Ricardo Manuel Correia* Lustre Engineering *Sun Microsystems, Inc.* Portugal -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071214/3b04b114/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071214/3b04b114/attachment.gif>
Jim Dunham
2007-Dec-14 19:27 UTC
[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Steve,> I have a couple of questions and concerns about using ZFS in an > environment where the underlying LUNs are replicated at a block > level using products like HDS TrueCopy or EMC SRDF. Apologies in > advance for the length, but I wanted the explanation to be clear. > > (I do realise that there are other possibilities such as zfs send/ > recv and there are technical and business pros and cons for the > various options. I don''t want to start a ''which is best'' argument :) ) > > The CoW design of ZFS means that it goes to great lengths to always > maintain on-disk self-consistency, and ZFS can make certain > assumptions about state (e.g not needing fsck) based on that. This > is the basis of my questions. > > 1) First issue relates to the ?berblock. Updates to it are assumed > to be atomic, but if the replication block size is smaller than the > ?berblock then we can''t guarantee that the whole ?berblock is > replicated as an entity. That could in theory result in a corrupt > ?berblock at the > secondary. > > Will this be caught and handled by the normal ZFS checksumming? If > so, does ZFS just use an alternate ?berblock and rewrite the > damaged one transparently? > > 2) Assuming that the replication maintains write-ordering, the > secondary site will always have valid and self-consistent data, > although it may be out-of-date compared to the primary if the > replication is asynchronous, depending on link latency, buffering, > etc. > > Normally most replication systems do maintain write ordering, [i] > except[/i] for one specific scenario. If the replication is > interrupted, for example secondary site down or unreachable due to > a comms problem, the primary site will keep a list of changed > blocks. When contact between the sites is re-established there > will be a period of ''catch-up'' resynchronization. In most, if not > all, cases this is done on a simple block-order basis. Write- > ordering is lost until the two sites are once again in sync and > routine replication restarts. > > I can see this has having major ZFS impact. It would be possible > for intermediate blocks to be replicated before the data blocks > they point to, and in the worst case an updated ?berblock could be > replicated before the block chains that it references have been > copied. This breaks the assumption that the on-disk format is > always self-consistent.For most implementations of resynchronization, not only are changes resilvered in a block-ordered basis, resynchronization is also done in a single pass over the volume(s). To address the fact that resynchronization happens while additional changes are also being replicated, the concept of a resynchronization point is kept. As this resynchronization point traverse the volume from beginning to end, I/ Os occurring before, or at this point need to be replicated inline, whereas I/Os occurring after this point need to marked such that they will be replicated later in block order. You are quite correct in that the data is not consistent.> If a disaster happened during the ''catch-up'', and the partially- > resynchronized LUNs were imported into a zpool at the secondary > site, what would/could happen? Refusal to accept the whole zpool? > Rejection just of the files affected? System panic? How could > recovery from this situation be achieved?The state of the partially-resynchronized LUNs are much worse than you know. During active resynchronization, the remote volume contains a mixture of prior write-order consistent data, resilvered block- order data, plus new replicated data. Essentially the partially- resynchronized LUNs are totally inconsistent until such a times as the single pass over all data is 100% complete. For some, but not all replication software, if the ''catch-up'' resynchronization failed, read access to the LUNs should be prevented, or a least read access while the LUNs are configured as remote mirrors. Availability Suite''s Remote Mirror software (SNDR) marks such volumes as "need synchronization" and fails all application read and write I/Os.> Obviously all filesystems can suffer with this scenario, but ones > that expect less from their underlying storage (like UFS) can be > fscked, and although data that was being updated is potentially > corrupt, existing data should still be OK and usable. My concern > is that ZFS will handle this scenario less well. > > There are ways to mitigate this, of course, the most obvious being > to take a snapshot of the (valid) secondary before starting resync, > as a fallback. This isn''t always easy to do, especially since the > resync is usually automatic; there is no clear trigger to use for > the snapshot. It may also be difficult to synchronize the snapshot > of all LUNs in a pool. I''d like to better understand the risks/ > behaviour of ZFS before starting to work on mitigation strategies.Since Availability Suite is both Remote Mirroring and Point-in-Time Copy software, the software can be configured to automatically take a snapshot prior to re-synchronization, and automatically delete the snapshot if completed successfully. The use of I/O consistency groups assure that not only are the replicas write-order consistent during replication, but also that snapshots taken prior to re- synchronization are consistent too.> Thanks > > Steve > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/
Steve McKinty
2007-Dec-17 14:56 UTC
[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Hi all, Thanks for the replies. To clear up one issue, when I mentioned snapshots, I [i]was[/i] talking about snapshots performed by the array and/or replication software. In order to take a ZFS snapshot I would need access to the secondary copies of the data, and that is not generally possible while replication is configured. Thanks, Jim, for the info on AVS snapshot, I''ll look at that. As for other products, I''m almost certain the TrueCopy resyncs in block order, but Universal Replicator''s journal-based pull model may maintain write-ordering under such circumstances. I''m waiting for a reply from EMC about SRDF, but the information I have so far is that it, too, reverts to block-ordering during resync. It does look as if my concerns about ZFS are valid, though. It will require special handling. Thanks Steve This message posted from opensolaris.org