I''ve been playing with replication of a ZFS Zpool using the recently released AVS. I''m pleased with things, but just replicating the data is only part of the problem. The big question is: can I have a zpool open in 2 places? What I really want is a Zpool on node1 open and writable (production storage) and a replicated to node2 where its open for read-only access (standby storage). This is an old problem. I''m not sure its remotely possible. Its bad enough with UFS, but ZFS maintains a hell of a lot more meta-data. How is node2 supposed to know that a snapshot has been created for instance. With UFS you can at least get by some of these problems using directio, but thats not an option with a zpool. I know this is a fairly remedial issue to bring up... but if I think about what I want Thumper-to-Thumper replication to look like, I want 2 usable storage systems. As I see it now the secondary storage (node2) is useless untill you break replication and import the pool, do your thing, and then re-sync storage to re-enable replication. Am I missing something? I''m hoping there is an option I''m not aware of. benr. This message posted from opensolaris.org
Ben,> I''ve been playing with replication of a ZFS Zpool using the recently released AVS. I''m pleased with things, but just replicating the data is only part of the problem. The big question is: can I have a zpool open in 2 places? >No. The ability to have a zpool open in two place would required "shared ZFS". The semantics of remote replication can be viewed to that of two Solaris hosts looking at the same SAN or dual-ported storage. Today, ZFS detects this with both SNDR and shared storage, as part of "zpool import", warning that the pool is active elsewhere.> What I really want is a Zpool on node1 open and writable (production storage) and a replicated to node2 where its open for read-only access (standby storage). >The best you can do for this to use the II portion of Availability Suite to take a snapshot of the active SNDR replica on the remote node, getting a snapshot of the ZFS filesystem being replicated. In this case, ZFS on the remote node will see and detect replicated disk blocks changing in the zpool it is reading from.> This is an old problem. I''m not sure its remotely possible. Its bad enough with UFS, but ZFS maintains a hell of a lot more meta-data. How is node2 supposed to know that a snapshot has been created for instance. With UFS you can at least get by some of these problems using directio, but thats not an option with a zpool. > > I know this is a fairly remedial issue to bring up... but if I think about what I want Thumper-to-Thumper replication to look like, I want 2 usable storage systems. As I see it now the secondary storage (node2) is useless untill you break replication and import the pool, do your thing, and then re-sync storage to re-enable replication. > > Am I missing something? I''m hoping there is an option I''m not aware of. >No. Also just to be clear, after you " ... do your thing, and then re-sync storage ... " the re-sync is keep all of the data on the SNDR primary OR keep all the data on the SNDR secondary.There is no means to combine writes that occurred in two separate ZFS filesystems, back into one filesystem. The remote ZFS filesystem is essentially a clone of the original filesystem, and once a write I/O occurs to either side, the two filesystems take on a life of their own. Of course this is not unique to the ZFS filesystem, as the same is true for all others, and this underlying storage behavior is not unique to SNDR as it happens with other host-based replication and controller-based replication. Jim> benr. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hello Ben, Monday, February 5, 2007, 9:17:01 AM, you wrote: BR> I''ve been playing with replication of a ZFS Zpool using the BR> recently released AVS. I''m pleased with things, but just BR> replicating the data is only part of the problem. The big BR> question is: can I have a zpool open in 2 places? BR> What I really want is a Zpool on node1 open and writable BR> (production storage) and a replicated to node2 where its open for BR> read-only access (standby storage). BR> This is an old problem. I''m not sure its remotely possible. Its BR> bad enough with UFS, but ZFS maintains a hell of a lot more BR> meta-data. How is node2 supposed to know that a snapshot has been BR> created for instance. With UFS you can at least get by some of BR> these problems using directio, but thats not an option with a zpool. BR> I know this is a fairly remedial issue to bring up... but if I BR> think about what I want Thumper-to-Thumper replication to look BR> like, I want 2 usable storage systems. As I see it now the BR> secondary storage (node2) is useless untill you break replication BR> and import the pool, do your thing, and then re-sync storage to re-enable replication. BR> Am I missing something? I''m hoping there is an option I''m not aware of. You can''t mount rw on one node and ro on another (not to mention that zfs doesn''t offer you to import RO pools right now). You can mount the same file system like UFS in RO on both nodes but not ZFS (no ro import). I belive what you really need is ''zfs send continuos'' feature. We are developing something like this right now. I expect we can give more details really soon now. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert,> Hello Ben, > > Monday, February 5, 2007, 9:17:01 AM, you wrote: > > BR> I''ve been playing with replication of a ZFS Zpool using the > BR> recently released AVS. I''m pleased with things, but just > BR> replicating the data is only part of the problem. The big > BR> question is: can I have a zpool open in 2 places? > > BR> What I really want is a Zpool on node1 open and writable > BR> (production storage) and a replicated to node2 where its open for > BR> read-only access (standby storage). > > BR> This is an old problem. I''m not sure its remotely possible. Its > BR> bad enough with UFS, but ZFS maintains a hell of a lot more > BR> meta-data. How is node2 supposed to know that a snapshot has been > BR> created for instance. With UFS you can at least get by some of > BR> these problems using directio, but thats not an option with a zpool. > > BR> I know this is a fairly remedial issue to bring up... but if I > BR> think about what I want Thumper-to-Thumper replication to look > BR> like, I want 2 usable storage systems. As I see it now the > BR> secondary storage (node2) is useless untill you break replication > BR> and import the pool, do your thing, and then re-sync storage to re-enable replication. > > BR> Am I missing something? I''m hoping there is an option I''m not aware of. > > > You can''t mount rw on one node and ro on another (not to mention that > zfs doesn''t offer you to import RO pools right now). You can mount the > same file system like UFS in RO on both nodes but not ZFS (no ro import). >One can not just mount a filesystem in RO mode if SNDR or any other host-based or controller-based replication is underneath. For all filesystems that I know of, expect of course shared-reader QFS, this will fail given time. Even if one has the means to mount a filesystem with DIRECTIO (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem from looking at the contents of block "A" and then acting on block "B". The reason being is that during replication at time T1 both blocks "A" & "B" could be written and be consistent with each other. Next the file system reads block "A". Now replication at time T2 updates blocks "A" & "B", also consistent with each other. Next the file system reads block "B" and panics due to an inconsistency only it sees between old "A" and new "B". I know this for a fact, since a forced "zpool import -f <name>", is a common instance of this exact failure, due most likely checksum failures between metadata blocks "A" & "B". Of course using an instantly accessible II snapshot of an SNDR secondary volume would work just fine, since the data being read is now point-in-time consistent, and static. - Jim> I belive what you really need is ''zfs send continuos'' feature. > We are developing something like this right now. > I expect we can give more details really soon now. > > >
Jim Dunham wrote:> Robert, >> Hello Ben, >> >> Monday, February 5, 2007, 9:17:01 AM, you wrote: >> >> BR> I''ve been playing with replication of a ZFS Zpool using the >> BR> recently released AVS. I''m pleased with things, but just >> BR> replicating the data is only part of the problem. The big >> BR> question is: can I have a zpool open in 2 places? >> BR> What I really want is a Zpool on node1 open and writable >> BR> (production storage) and a replicated to node2 where its open for >> BR> read-only access (standby storage). >> >> BR> This is an old problem. I''m not sure its remotely possible. Its >> BR> bad enough with UFS, but ZFS maintains a hell of a lot more >> BR> meta-data. How is node2 supposed to know that a snapshot has been >> BR> created for instance. With UFS you can at least get by some of >> BR> these problems using directio, but thats not an option with a zpool. >> >> BR> I know this is a fairly remedial issue to bring up... but if I >> BR> think about what I want Thumper-to-Thumper replication to look >> BR> like, I want 2 usable storage systems. As I see it now the >> BR> secondary storage (node2) is useless untill you break replication >> BR> and import the pool, do your thing, and then re-sync storage to >> re-enable replication. >> >> BR> Am I missing something? I''m hoping there is an option I''m not >> aware of. >> >> >> You can''t mount rw on one node and ro on another (not to mention that >> zfs doesn''t offer you to import RO pools right now). You can mount the >> same file system like UFS in RO on both nodes but not ZFS (no ro >> import). >> > One can not just mount a filesystem in RO mode if SNDR or any other > host-based or controller-based replication is underneath. For all > filesystems that I know of, expect of course shared-reader QFS, this > will fail given time. > > Even if one has the means to mount a filesystem with DIRECTIO > (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem > from looking at the contents of block "A" and then acting on block > "B". The reason being is that during replication at time T1 both > blocks "A" & "B" could be written and be consistent with each other. > Next the file system reads block "A". Now replication at time T2 > updates blocks "A" & "B", also consistent with each other. Next the > file system reads block "B" and panics due to an inconsistency only it > sees between old "A" and new "B". I know this for a fact, since a > forced "zpool import -f <name>", is a common instance of this exact > failure, due most likely checksum failures between metadata blocks "A" > & "B".Ya, that bit me last night. ''zpool import'' shows the pool fine, but when you force the import you panic: Feb 5 07:14:10 uma ^Mpanic[cpu0]/thread=fffffe8001072c80: Feb 5 07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write on <unknown> off 0: zio fffffe80c54ed380 [L0 unallocated] 400L/200P DVA[0]=<0:360000000:200> DVA[1]=<0:9c0003800:200> DVA[2]=<0:20004e00:200> fletcher4 lzjb LE contiguous birth=57416 fill=0 cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5 Feb 5 07:14:11 uma unix: [ID 100000 kern.notice] Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072a40 zfs:zio_done+140 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072a60 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ab0 zfs:zio_wait_for_children+5d () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ad0 zfs:zio_wait_children_done+20 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072af0 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b40 zfs:zio_vdev_io_assess+129 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b60 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bb0 zfs:vdev_mirror_io_done+2af () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bd0 zfs:zio_vdev_io_done+26 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c60 genunix:taskq_thread+1a7 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c70 unix:thread_start+8 () Feb 5 07:14:11 uma unix: [ID 100000 kern.notice] So without using II, whats the best method of bring up the secondary storage? Is just dropping the primary into logging acceptable? benr.
Ben Rockwood wrote:> Jim Dunham wrote: >> Robert, >>> Hello Ben, >>> >>> Monday, February 5, 2007, 9:17:01 AM, you wrote: >>> >>> BR> I''ve been playing with replication of a ZFS Zpool using the >>> BR> recently released AVS. I''m pleased with things, but just >>> BR> replicating the data is only part of the problem. The big >>> BR> question is: can I have a zpool open in 2 places? BR> What I >>> really want is a Zpool on node1 open and writable >>> BR> (production storage) and a replicated to node2 where its open for >>> BR> read-only access (standby storage). >>> >>> BR> This is an old problem. I''m not sure its remotely possible. Its >>> BR> bad enough with UFS, but ZFS maintains a hell of a lot more >>> BR> meta-data. How is node2 supposed to know that a snapshot has been >>> BR> created for instance. With UFS you can at least get by some of >>> BR> these problems using directio, but thats not an option with a >>> zpool. >>> >>> BR> I know this is a fairly remedial issue to bring up... but if I >>> BR> think about what I want Thumper-to-Thumper replication to look >>> BR> like, I want 2 usable storage systems. As I see it now the >>> BR> secondary storage (node2) is useless untill you break replication >>> BR> and import the pool, do your thing, and then re-sync storage to >>> re-enable replication. >>> >>> BR> Am I missing something? I''m hoping there is an option I''m not >>> aware of. >>> >>> >>> You can''t mount rw on one node and ro on another (not to mention that >>> zfs doesn''t offer you to import RO pools right now). You can mount the >>> same file system like UFS in RO on both nodes but not ZFS (no ro >>> import). >>> >> One can not just mount a filesystem in RO mode if SNDR or any other >> host-based or controller-based replication is underneath. For all >> filesystems that I know of, expect of course shared-reader QFS, this >> will fail given time. >> >> Even if one has the means to mount a filesystem with DIRECTIO >> (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem >> from looking at the contents of block "A" and then acting on block >> "B". The reason being is that during replication at time T1 both >> blocks "A" & "B" could be written and be consistent with each other. >> Next the file system reads block "A". Now replication at time T2 >> updates blocks "A" & "B", also consistent with each other. Next the >> file system reads block "B" and panics due to an inconsistency only >> it sees between old "A" and new "B". I know this for a fact, since a >> forced "zpool import -f <name>", is a common instance of this exact >> failure, due most likely checksum failures between metadata blocks >> "A" & "B". > > Ya, that bit me last night. ''zpool import'' shows the pool fine, but > when you force the import you panic: > > Feb 5 07:14:10 uma ^Mpanic[cpu0]/thread=fffffe8001072c80: Feb 5 > 07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write > on <unknown> off 0: zio fffffe80c54ed380 [L0 unallocated] 400L/200P > DVA[0]=<0:360000000:200> DVA[1]=<0:9c0003800:200> > DVA[2]=<0:20004e00:200> fletcher4 lzjb LE contiguous birth=57416 > fill=0 cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5 > Feb 5 07:14:11 uma unix: [ID 100000 kern.notice] Feb 5 07:14:11 uma > genunix: [ID 655072 kern.notice] fffffe8001072a40 zfs:zio_done+140 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072a60 > zfs:zio_next_stage+68 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ab0 > zfs:zio_wait_for_children+5d () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072ad0 > zfs:zio_wait_children_done+20 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072af0 > zfs:zio_next_stage+68 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b40 > zfs:zio_vdev_io_assess+129 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072b60 > zfs:zio_next_stage+68 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bb0 > zfs:vdev_mirror_io_done+2af () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072bd0 > zfs:zio_vdev_io_done+26 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c60 > genunix:taskq_thread+1a7 () > Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fffffe8001072c70 > unix:thread_start+8 () > Feb 5 07:14:11 uma unix: [ID 100000 kern.notice] > > So without using II, whats the best method of bring up the secondary > storage? Is just dropping the primary into logging acceptable?Yes, placing SNDR in logging mode stops the replication of writes. Also performing a "zpool export" on the primary node, and waiting (sndradm -w) until all writes are replicated, means that on the SNDR secondary node, a zpool import can be done without using the "-f", as a forced imported is not need, since the zpool export operation got replicated. Be sure to remember to "zpool export" on the remote node, before resuming replication on the primary node, or another panic will likely occur. Jim> > benr.
Ben Rockwood wrote:> What I really want is a Zpool on node1 open and writable (production > storage) and a replicated to node2 where its open for read-only > access (standby storage).We intend to solve this problem by using zfs send/recv. You can script up a "poor man''s" send/recv solution today but we''re working on making it better. --matt