thr3ads.net - zfs discuss - [zfs-discuss] Read Only Zpool: ZFS and Replication [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Ben Rockwood

2007-Feb-05 08:17 UTC

[zfs-discuss] Read Only Zpool: ZFS and Replication

I''ve been playing with replication of a ZFS Zpool using the recently
released AVS.  I''m pleased with things, but just replicating the data
is only part of the problem.  The big question is: can I have a zpool open in 2
places?

What I really want is a Zpool on node1 open and writable (production storage)
and a replicated to node2 where its open for read-only access (standby storage).

This is an old problem.  I''m not sure its remotely possible.  Its bad
enough with UFS, but ZFS maintains a hell of a lot more meta-data.  How is node2
supposed to know that a snapshot has been created for instance.  With UFS you
can at least get by some of these problems using directio, but thats not an
option with a zpool.

I know this is a fairly remedial issue to bring up... but if I think about what
I want Thumper-to-Thumper replication to look like, I want 2 usable storage
systems.  As I see it now the secondary storage (node2) is useless untill you
break replication and import the pool, do your thing, and then re-sync storage
to re-enable replication.

Am I missing something?  I''m hoping there is an option I''m not
aware of.

benr.
 
 
This message posted from opensolaris.org

Jim Dunham

2007-Feb-05 12:19 UTC

head link

[zfs-discuss] Read Only Zpool: ZFS and Replication

Ben,> I''ve been playing with replication of a ZFS Zpool using the
recently released AVS.  I''m pleased with things, but just replicating
the data is only part of the problem.  The big question is: can I have a zpool
open in 2 places?
>   
No. The ability to have a zpool open in two place would required "shared 
ZFS". The semantics of remote replication can be viewed to that of two 
Solaris hosts looking at the same SAN or dual-ported storage. Today, ZFS 
detects this with both SNDR and shared storage, as part of "zpool 
import", warning that the pool is active elsewhere.
> What I really want is a Zpool on node1 open and writable (production
storage) and a replicated to node2 where its open for read-only access (standby
storage).
>   
The best you can do for this to use the II portion of Availability Suite 
to take a snapshot of the active SNDR replica on the remote node, 
getting a snapshot of the ZFS filesystem being replicated. In this case, 
ZFS on the remote node will see and detect replicated disk blocks 
changing in the zpool it is reading from.
> This is an old problem.  I''m not sure its remotely possible.  Its
bad enough with UFS, but ZFS maintains a hell of a lot more meta-data.  How is
node2 supposed to know that a snapshot has been created for instance.  With UFS
you can at least get by some of these problems using directio, but thats not an
option with a zpool.
>
> I know this is a fairly remedial issue to bring up... but if I think about
what I want Thumper-to-Thumper replication to look like, I want 2 usable storage
systems.  As I see it now the secondary storage (node2) is useless untill you
break replication and import the pool, do your thing, and then re-sync storage
to re-enable replication.
>
> Am I missing something?  I''m hoping there is an option
I''m not aware of.
>   
No. Also just to be clear, after you " ... do your thing, and then 
re-sync storage ... " the re-sync is keep all of the data on the SNDR 
primary OR keep all the data on the SNDR secondary.There is no means to 
combine writes that occurred in two separate ZFS filesystems, back into 
one filesystem. The remote ZFS filesystem is essentially a clone of the 
original filesystem, and once a write I/O occurs to either side, the two 
filesystems take on a life of their own.

Of course this is not unique to the ZFS filesystem, as the same is true 
for all others, and this underlying storage behavior is not unique to 
SNDR as it happens with other host-based replication and 
controller-based replication.

Jim
> benr.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Robert Milkowski

2007-Feb-05 12:55 UTC

head link

[zfs-discuss] Read Only Zpool: ZFS and Replication

Hello Ben,

Monday, February 5, 2007, 9:17:01 AM, you wrote:

BR> I''ve been playing with replication of a ZFS Zpool using the
BR> recently released AVS.  I''m pleased with things, but just
BR> replicating the data is only part of the problem.  The big
BR> question is: can I have a zpool open in 2 places?  

BR> What I really want is a Zpool on node1 open and writable
BR> (production storage) and a replicated to node2 where its open for
BR> read-only access (standby storage).

BR> This is an old problem.  I''m not sure its remotely possible. 
Its
BR> bad enough with UFS, but ZFS maintains a hell of a lot more
BR> meta-data.  How is node2 supposed to know that a snapshot has been
BR> created for instance.  With UFS you can at least get by some of
BR> these problems using directio, but thats not an option with a zpool.

BR> I know this is a fairly remedial issue to bring up... but if I
BR> think about what I want Thumper-to-Thumper replication to look
BR> like, I want 2 usable storage systems.  As I see it now the
BR> secondary storage (node2) is useless untill you break replication
BR> and import the pool, do your thing, and then re-sync storage to re-enable
replication.

BR> Am I missing something?  I''m hoping there is an option
I''m not aware of.


You can''t mount rw on one node and ro on another (not to mention that
zfs doesn''t offer you to import RO pools right now). You can mount the
same file system like UFS in RO on both nodes but not ZFS (no ro import).

I belive what you really need is ''zfs send continuos'' feature.
We are developing something like this right now.
I expect we can give more details really soon now.





-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Jim Dunham

2007-Feb-05 15:50 UTC

head link

[zfs-discuss] Read Only Zpool: ZFS and Replication

Robert,> Hello Ben,
>
> Monday, February 5, 2007, 9:17:01 AM, you wrote:
>
> BR> I''ve been playing with replication of a ZFS Zpool using the
> BR> recently released AVS.  I''m pleased with things, but just
> BR> replicating the data is only part of the problem.  The big
> BR> question is: can I have a zpool open in 2 places?  
>
> BR> What I really want is a Zpool on node1 open and writable
> BR> (production storage) and a replicated to node2 where its open for
> BR> read-only access (standby storage).
>
> BR> This is an old problem.  I''m not sure its remotely
possible.  Its
> BR> bad enough with UFS, but ZFS maintains a hell of a lot more
> BR> meta-data.  How is node2 supposed to know that a snapshot has been
> BR> created for instance.  With UFS you can at least get by some of
> BR> these problems using directio, but thats not an option with a zpool.
>
> BR> I know this is a fairly remedial issue to bring up... but if I
> BR> think about what I want Thumper-to-Thumper replication to look
> BR> like, I want 2 usable storage systems.  As I see it now the
> BR> secondary storage (node2) is useless untill you break replication
> BR> and import the pool, do your thing, and then re-sync storage to
re-enable replication.
>
> BR> Am I missing something?  I''m hoping there is an option
I''m not aware of.
>
>
> You can''t mount rw on one node and ro on another (not to mention
that
> zfs doesn''t offer you to import RO pools right now). You can mount
the
> same file system like UFS in RO on both nodes but not ZFS (no ro import).
>   One can not just mount a filesystem in RO mode if SNDR or any other 
host-based or controller-based replication is underneath. For all 
filesystems that I know of,  expect of course shared-reader QFS, this 
will fail given time.

Even if one has the means to mount a filesystem with DIRECTIO 
(no-caching), READ-ONLY (no-writes), it does not prevent a filesystem 
from looking at the contents of block "A" and then acting on block
"B".
The reason being is that during replication at time T1 both blocks "A"
&
"B" could be written and be consistent with each other. Next the file 
system reads block "A". Now replication at time T2 updates blocks
"A" &
"B", also consistent with each other. Next the file system reads block
"B" and panics due to an inconsistency only it sees between old
"A" and
new "B". I know this for a fact, since a forced "zpool import -f 
<name>", is a common instance of this exact failure, due most likely 
checksum failures between metadata blocks "A" & "B".

Of course using an instantly accessible II snapshot of an SNDR secondary 
volume would work just fine, since the data being read is now 
point-in-time consistent, and static.

- Jim
> I belive what you really need is ''zfs send continuos''
feature.
> We are developing something like this right now.
> I expect we can give more details really soon now.
>
>
>

Ben Rockwood

2007-Feb-05 23:49 UTC

head link

[zfs-discuss] Read Only Zpool: ZFS and Replication

Jim Dunham wrote:> Robert,
>> Hello Ben,
>>
>> Monday, February 5, 2007, 9:17:01 AM, you wrote:
>>
>> BR> I''ve been playing with replication of a ZFS Zpool using
the
>> BR> recently released AVS.  I''m pleased with things, but
just
>> BR> replicating the data is only part of the problem.  The big
>> BR> question is: can I have a zpool open in 2 places? 
>> BR> What I really want is a Zpool on node1 open and writable
>> BR> (production storage) and a replicated to node2 where its open
for
>> BR> read-only access (standby storage).
>>
>> BR> This is an old problem.  I''m not sure its remotely
possible.  Its
>> BR> bad enough with UFS, but ZFS maintains a hell of a lot more
>> BR> meta-data.  How is node2 supposed to know that a snapshot has
been
>> BR> created for instance.  With UFS you can at least get by some of
>> BR> these problems using directio, but thats not an option with a
zpool.
>>
>> BR> I know this is a fairly remedial issue to bring up... but if I
>> BR> think about what I want Thumper-to-Thumper replication to look
>> BR> like, I want 2 usable storage systems.  As I see it now the
>> BR> secondary storage (node2) is useless untill you break
replication
>> BR> and import the pool, do your thing, and then re-sync storage to 
>> re-enable replication.
>>
>> BR> Am I missing something?  I''m hoping there is an option
I''m not
>> aware of.
>>
>>
>> You can''t mount rw on one node and ro on another (not to
mention that
>> zfs doesn''t offer you to import RO pools right now). You can
mount the
>> same file system like UFS in RO on both nodes but not ZFS (no ro 
>> import).
>>   
> One can not just mount a filesystem in RO mode if SNDR or any other 
> host-based or controller-based replication is underneath. For all 
> filesystems that I know of,  expect of course shared-reader QFS, this 
> will fail given time.
>
> Even if one has the means to mount a filesystem with DIRECTIO 
> (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem 
> from looking at the contents of block "A" and then acting on
block
> "B". The reason being is that during replication at time T1 both 
> blocks "A" & "B" could be written and be consistent
with each other.
> Next the file system reads block "A". Now replication at time T2 
> updates blocks "A" & "B", also consistent with each
other. Next the
> file system reads block "B" and panics due to an inconsistency
only it
> sees between old "A" and new "B". I know this for a
fact, since a
> forced "zpool import -f <name>", is a common instance of
this exact
> failure, due most likely checksum failures between metadata blocks
"A"
> & "B".
Ya, that bit me last night.  ''zpool import'' shows the pool
fine, but
when you force the import you panic:

Feb  5 07:14:10 uma ^Mpanic[cpu0]/thread=fffffe8001072c80: 
Feb  5 07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write on
<unknown> off 0: zio fffffe80c54ed380 [L0 unallocated] 400L/200P
DVA[0]=<0:360000000:200> DVA[1]=<0:9c0003800:200>
DVA[2]=<0:20004e00:200> fletcher4 lzjb LE contiguous birth=57416 fill=0
cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5
Feb  5 07:14:11 uma unix: [ID 100000 kern.notice] 
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072a40
zfs:zio_done+140 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072a60
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ab0
zfs:zio_wait_for_children+5d ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ad0
zfs:zio_wait_children_done+20 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072af0
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b40
zfs:zio_vdev_io_assess+129 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b60
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bb0
zfs:vdev_mirror_io_done+2af ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bd0
zfs:zio_vdev_io_done+26 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c60
genunix:taskq_thread+1a7 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c70
unix:thread_start+8 ()
Feb  5 07:14:11 uma unix: [ID 100000 kern.notice] 

So without using II, whats the best method of bring up the secondary 
storage?  Is just dropping the primary into logging acceptable?

benr.

Jim Dunham

2007-Feb-06 03:46 UTC

head link

[zfs-discuss] Read Only Zpool: ZFS and Replication

Ben Rockwood wrote:> Jim Dunham wrote:
>> Robert,
>>> Hello Ben,
>>>
>>> Monday, February 5, 2007, 9:17:01 AM, you wrote:
>>>
>>> BR> I''ve been playing with replication of a ZFS Zpool
using the
>>> BR> recently released AVS.  I''m pleased with things,
but just
>>> BR> replicating the data is only part of the problem.  The big
>>> BR> question is: can I have a zpool open in 2 places? BR>
What I
>>> really want is a Zpool on node1 open and writable
>>> BR> (production storage) and a replicated to node2 where its
open for
>>> BR> read-only access (standby storage).
>>>
>>> BR> This is an old problem.  I''m not sure its remotely
possible.  Its
>>> BR> bad enough with UFS, but ZFS maintains a hell of a lot more
>>> BR> meta-data.  How is node2 supposed to know that a snapshot
has been
>>> BR> created for instance.  With UFS you can at least get by some
of
>>> BR> these problems using directio, but thats not an option with
a
>>> zpool.
>>>
>>> BR> I know this is a fairly remedial issue to bring up... but if
I
>>> BR> think about what I want Thumper-to-Thumper replication to
look
>>> BR> like, I want 2 usable storage systems.  As I see it now the
>>> BR> secondary storage (node2) is useless untill you break
replication
>>> BR> and import the pool, do your thing, and then re-sync storage
to
>>> re-enable replication.
>>>
>>> BR> Am I missing something?  I''m hoping there is an
option I''m not
>>> aware of.
>>>
>>>
>>> You can''t mount rw on one node and ro on another (not to
mention that
>>> zfs doesn''t offer you to import RO pools right now). You
can mount the
>>> same file system like UFS in RO on both nodes but not ZFS (no ro 
>>> import).
>>>   
>> One can not just mount a filesystem in RO mode if SNDR or any other 
>> host-based or controller-based replication is underneath. For all 
>> filesystems that I know of,  expect of course shared-reader QFS, this 
>> will fail given time.
>>
>> Even if one has the means to mount a filesystem with DIRECTIO 
>> (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem 
>> from looking at the contents of block "A" and then acting on
block
>> "B". The reason being is that during replication at time T1
both
>> blocks "A" & "B" could be written and be
consistent with each other.
>> Next the file system reads block "A". Now replication at time
T2
>> updates blocks "A" & "B", also consistent with
each other. Next the
>> file system reads block "B" and panics due to an
inconsistency only
>> it sees between old "A" and new "B". I know this
for a fact, since a
>> forced "zpool import -f <name>", is a common instance
of this exact
>> failure, due most likely checksum failures between metadata blocks 
>> "A" & "B".
>
> Ya, that bit me last night.  ''zpool import'' shows the
pool fine, but
> when you force the import you panic:
>
> Feb  5 07:14:10 uma ^Mpanic[cpu0]/thread=fffffe8001072c80: Feb  5 
> 07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write 
> on <unknown> off 0: zio fffffe80c54ed380 [L0 unallocated] 400L/200P 
> DVA[0]=<0:360000000:200> DVA[1]=<0:9c0003800:200> 
> DVA[2]=<0:20004e00:200> fletcher4 lzjb LE contiguous birth=57416 
> fill=0 cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5
> Feb  5 07:14:11 uma unix: [ID 100000 kern.notice] Feb  5 07:14:11 uma 
> genunix: [ID 655072 kern.notice] fffffe8001072a40 zfs:zio_done+140 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072a60 
> zfs:zio_next_stage+68 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ab0 
> zfs:zio_wait_for_children+5d ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ad0 
> zfs:zio_wait_children_done+20 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072af0 
> zfs:zio_next_stage+68 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b40 
> zfs:zio_vdev_io_assess+129 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b60 
> zfs:zio_next_stage+68 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bb0 
> zfs:vdev_mirror_io_done+2af ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bd0 
> zfs:zio_vdev_io_done+26 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c60 
> genunix:taskq_thread+1a7 ()
> Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c70 
> unix:thread_start+8 ()
> Feb  5 07:14:11 uma unix: [ID 100000 kern.notice]
>
> So without using II, whats the best method of bring up the secondary 
> storage?  Is just dropping the primary into logging acceptable?Yes, placing SNDR in logging mode stops the replication of writes.

Also performing a "zpool export" on the primary node, and waiting 
(sndradm -w) until all writes are replicated, means that on the SNDR 
secondary node, a zpool import can be done without using the "-f", as
a
forced imported is not need, since the zpool export operation got 
replicated.

Be sure to remember to "zpool export" on the remote node, before 
resuming replication on the primary node, or another panic will likely 
occur.

Jim>
> benr.

Matthew Ahrens

2007-Feb-09 21:27 UTC

head link

[zfs-discuss] Read Only Zpool: ZFS and Replication

Ben Rockwood wrote:> What I really want is a Zpool on node1 open and writable (production
> storage) and a replicated to node2 where its open for read-only
> access (standby storage).
We intend to solve this problem by using zfs send/recv.  You can script 
up a "poor man''s" send/recv solution today but we''re
working on making
it better.

--matt

zfs discuss - Feb 2007 - Read Only Zpool: ZFS and Replication

[zfs-discuss] Read Only Zpool: ZFS and Replication

[zfs-discuss] Read Only Zpool: ZFS and Replication

[zfs-discuss] Read Only Zpool: ZFS and Replication

[zfs-discuss] Read Only Zpool: ZFS and Replication

[zfs-discuss] Read Only Zpool: ZFS and Replication

[zfs-discuss] Read Only Zpool: ZFS and Replication

[zfs-discuss] Read Only Zpool: ZFS and Replication