thr3ads.net - zfs discuss - [zfs-discuss] ZFS Replication Question [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Paul Pilcher

2008-Oct-09 11:33 UTC

[zfs-discuss] ZFS Replication Question

All;

I have a question about ZFS and how it protects data integrity in the 
context of a replication scenario.

First, ZFS is designed such that all data on disk is in a consistent 
state.  Likewise, all data in a ZFS snapshot on disk is in a consistent 
state.  Further, ZFS, by virtue of its 256 bit checksums is capable of 
finding and repairing data corruption should it occur.

In the case of a snapshot send/receive, what happens if the snap is 
corrupted when sent to a remote system?  Would ZFS identify this?  If it 
does identify the corruption what is done to recover from this type of 
error?

Thanks;

Paul

Richard Elling

2008-Oct-09 16:17 UTC

head link

[zfs-discuss] ZFS Replication Question

Paul Pilcher wrote:> All;
>
> I have a question about ZFS and how it protects data integrity in the 
> context of a replication scenario.
>
> First, ZFS is designed such that all data on disk is in a consistent 
> state.  Likewise, all data in a ZFS snapshot on disk is in a consistent 
> state.  Further, ZFS, by virtue of its 256 bit checksums is capable of 
> finding and repairing data corruption should it occur.
>
> In the case of a snapshot send/receive, what happens if the snap is 
> corrupted when sent to a remote system?  Would ZFS identify this?  
Yes.
> If it 
> does identify the corruption what is done to recover from this type of 
> error?
>   
Nothing.  See RFE CR 6736837, improve send/receive fault tolerance
http://bugs.opensolaris.org/view_bug.do?bug_id=6736837

As it stands today, you will need to resend.
 -- richard

Jim Dunham

2008-Oct-10 23:31 UTC

head link

[zfs-discuss] ZFS Replication Question

Paul,
> I have a question about ZFS and how it protects data integrity in the
> context of a replication scenario.
>
> First, ZFS is designed such that all data on disk is in a consistent
> state.  Likewise, all data in a ZFS snapshot on disk is in a  
> consistent
> state.  Further, ZFS, by virtue of its 256 bit checksums is capable of
> finding and repairing data corruption should it occur.
>
> In the case of a snapshot send/receive, what happens if the snap is
> corrupted when sent to a remote system?  Would ZFS identify this?   
> If it
> does identify the corruption what is done to recover from this type of
> error?
With host-based or controller-based replication of the physical  
volumes in a storage pool, recovery of ZFS detected corruption is  
possible.

For host-based replication see: http://www.opensolaris.org/os/project/avs/

1).	Start by provisioning the replication of an entire ZFS storage pool
2).	Waiting until the replicated storage pool is fully synchronized
3).	Pause replication
4). 	zpool import the replicated storage pool on the secondary node
5).	Perform a zpool scrub of the replicated storage pool
6).	zpool export the replicated storage pool
7).	If no scrub errors were detected, repeat starting at step 2, at an  
interval of ones choosing.
8).	Use dd (or some other tool) to write over all secondary node  
blocks that are reported as scrub errors
	CAUTION: Do not zpool import this storage pool until step 4 has been  
reached.
9).	Resume replication
10).	Repeat at step 2, verifying that scrub errors are now resolved.

An assumption being made is that the zpool scrub errors are some form  
of end-to-end replication error, not physical media errors on the  
secondary node. If they are media errors, repeating the above  
operation over and over, would report the same defective blocks. It is  
also possible that at step 8, that dd or some other tool would also  
report errors if the media is really bad.

One may be concerned about using the dd at step 8, as it seems a  
little unstructured. Note that all write I/Os are scoreboarded, even  
ones that may have been performed incorrectly. So if one makes a  
mistake at this step, replication will fix it!

Finally, this last point brings up a good thing about using this  
method to fix zpool scrub errors. One can test this against a  
perfectly good replicated ZFS storage pool. Of course using a non- 
production storage pool, instead of production data is also a good  
first step.
> Thanks;
>
> Paul
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Storage Platform Software Group
Sun Microsystems, Inc.

Possibly Parallel Threads

Search for more maybe matching threads

zfs discuss - Oct 2008 - ZFS Replication Question

[zfs-discuss] ZFS Replication Question

[zfs-discuss] ZFS Replication Question

[zfs-discuss] ZFS Replication Question

Possibly Parallel Threads