Project Overview: I propose the creation of a project on opensolaris.org, to bring to the community two Solaris host-based data services; namely volume snapshot and volume replication. These two data services exist today as the Sun StorageTek Availability Suite, a Solaris 8, 9 & 10, unbundled product set, consisting of Instant Image (II) and Network Data Replicator (SNDR). Project Description: Although Availability Suite is typically known as just two data services (II & SNDR), there is an underlying Solaris I/O filter driver framework which supports these two data services. This framework provides the means to stack one or more block-based, pseudo device drivers on to any pre-provisioned cb_ops structure, [ http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs ], thereby shunting all cb_ops I/O into the top of a developed filter driver, (for driver specific processing), then out the bottom of this filter driver, back into the original cb_ops entry points. Availability Suite was developed to interpose itself on the I/O stack of a block device, providing a filter driver framework with the means to intercept any I/O originating from an upstream file system, database or application layer I/O. This framework provided the means for Availability Suite to support snapshot and remote replication data services for UFS, QFS, VxFS, and more recently the ZFS file system, plus various databases like Oracle, Sybase and PostgreSQL, and also application I/Os. By providing a filter driver at this point in the Solaris I/O stack, it allows for any number of data services to be implemented, without regard to the underlying block storage that they will be configured on. Today, as a snapshot and/or replication solution, the framework allows both the source and destination block storage device to not only differ in physical characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical characteristics such as in RAID type, volume managed storage (i.e., SVM, VxVM), lofi, zvols, even ram disks. Community Involvement: By providing this filter-driver framework, two working filter drivers (II & SNDR), and an extensive collection of supporting software and utilities, it is envisioned that those individuals and companies that adopt OpenSolaris as a viable storage platform, will also utilize and enhance the existing II & SNDR data services, plus have offered to them the means in which to develop their own block-based filter driver(s), further enhancing the use and adoption on OpenSolaris. A very timely example that is very applicable to Availability Suite and the OpenSolaris community, is the recent announcement of the Project Proposal: lofi [ compression & encryption ] - http://www.opensolaris.org/jive/click.jspa&messageID=26841. By leveraging both the Availability Suite and the lofi OpenSolaris projects, it would be highly probable to not only offer compression & encryption to lofi devices (as already proposed), but by collectively leveraging these two project, creating the means to support file systems, databases and applications, across all block-based storage devices. Since Availability Suite has strong technical ties to storage, please look for email discussion for this project at: <storage-discuss at opensolaris dot org> A complete set of Availability Suite administration guides can be found at: http://docs.sun.com/app/docs?p=coll%2FAVS4.0 Project Lead: Jim Dunham http://www.opensolaris.org/viewProfile.jspa?username=jdunham Availability Suite - New Solaris Storage Group This message posted from opensolaris.org
Jason J. W. Williams
2007-Jan-27 00:15 UTC
[zfs-discuss] Project Proposal: Availability Suite
Could the replication engine eventually be integrated more tightly with ZFS? That would be slick alternative to send/recv. Best Regards, Jason On 1/26/07, Jim Dunham <James.Dunham at sun.com> wrote:> Project Overview: > > I propose the creation of a project on opensolaris.org, to bring to the community two Solaris host-based data services; namely volume snapshot and volume replication. These two data services exist today as the Sun StorageTek Availability Suite, a Solaris 8, 9 & 10, unbundled product set, consisting of Instant Image (II) and Network Data Replicator (SNDR). > > Project Description: > > Although Availability Suite is typically known as just two data services (II & SNDR), there is an underlying Solaris I/O filter driver framework which supports these two data services. This framework provides the means to stack one or more block-based, pseudo device drivers on to any pre-provisioned cb_ops structure, [ http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs ], thereby shunting all cb_ops I/O into the top of a developed filter driver, (for driver specific processing), then out the bottom of this filter driver, back into the original cb_ops entry points. > > Availability Suite was developed to interpose itself on the I/O stack of a block device, providing a filter driver framework with the means to intercept any I/O originating from an upstream file system, database or application layer I/O. This framework provided the means for Availability Suite to support snapshot and remote replication data services for UFS, QFS, VxFS, and more recently the ZFS file system, plus various databases like Oracle, Sybase and PostgreSQL, and also application I/Os. By providing a filter driver at this point in the Solaris I/O stack, it allows for any number of data services to be implemented, without regard to the underlying block storage that they will be configured on. Today, as a snapshot and/or replication solution, the framework allows both the source and destination block storage device to not only differ in physical characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical characteristics such as in RAID type, volume managed storage (i.e., SVM, VxVM), lofi, zvols, even ram disks. > > Community Involvement: > > By providing this filter-driver framework, two working filter drivers (II & SNDR), and an extensive collection of supporting software and utilities, it is envisioned that those individuals and companies that adopt OpenSolaris as a viable storage platform, will also utilize and enhance the existing II & SNDR data services, plus have offered to them the means in which to develop their own block-based filter driver(s), further enhancing the use and adoption on OpenSolaris. > > A very timely example that is very applicable to Availability Suite and the OpenSolaris community, is the recent announcement of the Project Proposal: lofi [ compression & encryption ] - http://www.opensolaris.org/jive/click.jspa&messageID=26841. By leveraging both the Availability Suite and the lofi OpenSolaris projects, it would be highly probable to not only offer compression & encryption to lofi devices (as already proposed), but by collectively leveraging these two project, creating the means to support file systems, databases and applications, across all block-based storage devices. > > Since Availability Suite has strong technical ties to storage, please look for email discussion for this project at: <storage-discuss at opensolaris dot org> > > A complete set of Availability Suite administration guides can be found at: http://docs.sun.com/app/docs?p=coll%2FAVS4.0 > > > Project Lead: > > Jim Dunham http://www.opensolaris.org/viewProfile.jspa?username=jdunham > > Availability Suite - New Solaris Storage Group > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Jason J. W. Williams wrote:> Could the replication engine eventually be integrated more tightly > with ZFS?Not it in the present form. The architecture and implementation of Availability Suite is driven off block-based replication at the device level (/dev/rdsk/...), something that allows the product to replicate any Solaris file system, database, etc., without any knowledge of what it is actually replicating. To pursue ZFS replication in the manner of Availability Suite, one needs to see what replication looks like from an abstract point of view. So simplistically, remote replication is like the letter ''h'', where the left side of the letter is the complete I/O path on the primary node, the horizontal part of the letter is the remote replication network link, and the right side of the letter is only the bottom half of the complete I/O path on the secondary node. Next ZFS would have to have its functional I/O path split into two halves, a top and bottom piece. Next we configure replication, the letter ''h'', between two given nodes, running both a top and bottom piece of ZFS on the source node, and just the bottom half of ZFS on the secondary node. Today, the SNDR component of Availability Suite works like the letter ''h'' today, where we split the Solaris I/O stack into a top and bottom half. The top half is that software (file system, database or application I/O) that directs its I/Os to the bottom half (raw device, volume manager or block device). So all that needs to be done is to design and build a new variant of the letter ''h'', and find the place to separate ZFS into two pieces. - Jim Dunham> > That would be slick alternative to send/recv. > > Best Regards, > Jason > > On 1/26/07, Jim Dunham <James.Dunham at sun.com> wrote: >> Project Overview: >> >> I propose the creation of a project on opensolaris.org, to bring to >> the community two Solaris host-based data services; namely volume >> snapshot and volume replication. These two data services exist today >> as the Sun StorageTek Availability Suite, a Solaris 8, 9 & 10, >> unbundled product set, consisting of Instant Image (II) and Network >> Data Replicator (SNDR). >> >> Project Description: >> >> Although Availability Suite is typically known as just two data >> services (II & SNDR), there is an underlying Solaris I/O filter >> driver framework which supports these two data services. This >> framework provides the means to stack one or more block-based, pseudo >> device drivers on to any pre-provisioned cb_ops structure, [ >> http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs >> ], thereby shunting all cb_ops I/O into the top of a developed filter >> driver, (for driver specific processing), then out the bottom of this >> filter driver, back into the original cb_ops entry points. >> >> Availability Suite was developed to interpose itself on the I/O stack >> of a block device, providing a filter driver framework with the means >> to intercept any I/O originating from an upstream file system, >> database or application layer I/O. This framework provided the means >> for Availability Suite to support snapshot and remote replication >> data services for UFS, QFS, VxFS, and more recently the ZFS file >> system, plus various databases like Oracle, Sybase and PostgreSQL, >> and also application I/Os. By providing a filter driver at this point >> in the Solaris I/O stack, it allows for any number of data services >> to be implemented, without regard to the underlying block storage >> that they will be configured on. Today, as a snapshot and/or >> replication solution, the framework allows both the source and >> destination block storage device to not only differ in physical >> characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical >> characteristics such as in RAID type, volume managed storage (i.e., >> SVM, VxVM), lofi, zvols, even ram disks. >> >> Community Involvement: >> >> By providing this filter-driver framework, two working filter drivers >> (II & SNDR), and an extensive collection of supporting software and >> utilities, it is envisioned that those individuals and companies that >> adopt OpenSolaris as a viable storage platform, will also utilize and >> enhance the existing II & SNDR data services, plus have offered to >> them the means in which to develop their own block-based filter >> driver(s), further enhancing the use and adoption on OpenSolaris. >> >> A very timely example that is very applicable to Availability Suite >> and the OpenSolaris community, is the recent announcement of the >> Project Proposal: lofi [ compression & encryption ] - >> http://www.opensolaris.org/jive/click.jspa&messageID=26841. By >> leveraging both the Availability Suite and the lofi OpenSolaris >> projects, it would be highly probable to not only offer compression & >> encryption to lofi devices (as already proposed), but by collectively >> leveraging these two project, creating the means to support file >> systems, databases and applications, across all block-based storage >> devices. >> >> Since Availability Suite has strong technical ties to storage, please >> look for email discussion for this project at: <storage-discuss at >> opensolaris dot org> >> >> A complete set of Availability Suite administration guides can be >> found at: http://docs.sun.com/app/docs?p=coll%2FAVS4.0 >> >> >> Project Lead: >> >> Jim Dunham http://www.opensolaris.org/viewProfile.jspa?username=jdunham >> >> Availability Suite - New Solaris Storage Group >> >> >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>
Jason J. W. Williams
2007-Jan-29 19:50 UTC
[zfs-discuss] Project Proposal: Availability Suite
Thank you for the detailed explanation. It is very helpful to understand the issue. Is anyone successfully using SNDR with ZFS yet? Best Regards, Jason On 1/26/07, Jim Dunham <James.Dunham at sun.com> wrote:> Jason J. W. Williams wrote: > > Could the replication engine eventually be integrated more tightly > > with ZFS? > Not it in the present form. The architecture and implementation of > Availability Suite is driven off block-based replication at the device > level (/dev/rdsk/...), something that allows the product to replicate > any Solaris file system, database, etc., without any knowledge of what > it is actually replicating. > > To pursue ZFS replication in the manner of Availability Suite, one needs > to see what replication looks like from an abstract point of view. So > simplistically, remote replication is like the letter ''h'', where the > left side of the letter is the complete I/O path on the primary node, > the horizontal part of the letter is the remote replication network > link, and the right side of the letter is only the bottom half of the > complete I/O path on the secondary node. > > Next ZFS would have to have its functional I/O path split into two > halves, a top and bottom piece. Next we configure replication, the > letter ''h'', between two given nodes, running both a top and bottom piece > of ZFS on the source node, and just the bottom half of ZFS on the > secondary node. > > Today, the SNDR component of Availability Suite works like the letter > ''h'' today, where we split the Solaris I/O stack into a top and bottom > half. The top half is that software (file system, database or > application I/O) that directs its I/Os to the bottom half (raw device, > volume manager or block device). > > So all that needs to be done is to design and build a new variant of the > letter ''h'', and find the place to separate ZFS into two pieces. > > - Jim Dunham > > > > > That would be slick alternative to send/recv. > > > > Best Regards, > > Jason > > > > On 1/26/07, Jim Dunham <James.Dunham at sun.com> wrote: > >> Project Overview: > >> > >> I propose the creation of a project on opensolaris.org, to bring to > >> the community two Solaris host-based data services; namely volume > >> snapshot and volume replication. These two data services exist today > >> as the Sun StorageTek Availability Suite, a Solaris 8, 9 & 10, > >> unbundled product set, consisting of Instant Image (II) and Network > >> Data Replicator (SNDR). > >> > >> Project Description: > >> > >> Although Availability Suite is typically known as just two data > >> services (II & SNDR), there is an underlying Solaris I/O filter > >> driver framework which supports these two data services. This > >> framework provides the means to stack one or more block-based, pseudo > >> device drivers on to any pre-provisioned cb_ops structure, [ > >> http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs > >> ], thereby shunting all cb_ops I/O into the top of a developed filter > >> driver, (for driver specific processing), then out the bottom of this > >> filter driver, back into the original cb_ops entry points. > >> > >> Availability Suite was developed to interpose itself on the I/O stack > >> of a block device, providing a filter driver framework with the means > >> to intercept any I/O originating from an upstream file system, > >> database or application layer I/O. This framework provided the means > >> for Availability Suite to support snapshot and remote replication > >> data services for UFS, QFS, VxFS, and more recently the ZFS file > >> system, plus various databases like Oracle, Sybase and PostgreSQL, > >> and also application I/Os. By providing a filter driver at this point > >> in the Solaris I/O stack, it allows for any number of data services > >> to be implemented, without regard to the underlying block storage > >> that they will be configured on. Today, as a snapshot and/or > >> replication solution, the framework allows both the source and > >> destination block storage device to not only differ in physical > >> characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical > >> characteristics such as in RAID type, volume managed storage (i.e., > >> SVM, VxVM), lofi, zvols, even ram disks. > >> > >> Community Involvement: > >> > >> By providing this filter-driver framework, two working filter drivers > >> (II & SNDR), and an extensive collection of supporting software and > >> utilities, it is envisioned that those individuals and companies that > >> adopt OpenSolaris as a viable storage platform, will also utilize and > >> enhance the existing II & SNDR data services, plus have offered to > >> them the means in which to develop their own block-based filter > >> driver(s), further enhancing the use and adoption on OpenSolaris. > >> > >> A very timely example that is very applicable to Availability Suite > >> and the OpenSolaris community, is the recent announcement of the > >> Project Proposal: lofi [ compression & encryption ] - > >> http://www.opensolaris.org/jive/click.jspa&messageID=26841. By > >> leveraging both the Availability Suite and the lofi OpenSolaris > >> projects, it would be highly probable to not only offer compression & > >> encryption to lofi devices (as already proposed), but by collectively > >> leveraging these two project, creating the means to support file > >> systems, databases and applications, across all block-based storage > >> devices. > >> > >> Since Availability Suite has strong technical ties to storage, please > >> look for email discussion for this project at: <storage-discuss at > >> opensolaris dot org> > >> > >> A complete set of Availability Suite administration guides can be > >> found at: http://docs.sun.com/app/docs?p=coll%2FAVS4.0 > >> > >> > >> Project Lead: > >> > >> Jim Dunham http://www.opensolaris.org/viewProfile.jspa?username=jdunham > >> > >> Availability Suite - New Solaris Storage Group > >> > >> > >> This message posted from opensolaris.org > >> _______________________________________________ > >> zfs-discuss mailing list > >> zfs-discuss at opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >> > >
Jason,> Thank you for the detailed explanation. It is very helpful to > understand the issue. Is anyone successfully using SNDR with ZFS yet?Of the opportunities I''ve been involved with the answer is yes, but so far I''ve not seen SNDR with ZFS in a production environment, but that does not mean they don''t exists. It was not until late June ''06, that AVS 4.0, Solaris 10 and ZFS were generally available, and to date AVS has not been made available for the Solaris Express, Community Release, but it will be real soon. While I have your attention, there are two issues between ZFS and AVS that needs mentioning. 1). When ZFS is given an entire LUN to place in a ZFS storage pool, ZFS detect this, enabling SCSI write-caching on the LUN, and also opens the LUN with exclusive access, preventing other data services (like AVS) from accessing this device. The work-around is to manually format the LUN, typically placing all the blocks into a single partition, then just place this partition into the ZFS storage pool. ZFS detect this, not owning the entire LUN, and doesn''t enable write-caching, which means it also doesn''t open the LUN with exclusive access, and therefore AVS and ZFS can share the same LUN. I thought about submitting an RFE to have ZFS provide a means to override this restriction, but I am not 100% certain that a ZFS filesystem directly accessing a write-cached enabled LUN is the same thing as a replicated ZFS filesystem accessing a write-cached enabled LUN. Even though AVS is write-order consistent, there are disaster recovery scenarios, when enacted, where block-order, verses write-order I/Os are issued. 2). One has to be very cautious in using "zpool import -f .... " (forced import), especially on a LUN or LUNs in which SNDR is actively replicating into. If ZFS complains that the storage pool was not cleanly exported when issuing a "zpool import ...", and one attempts a "zpool import -f ....", without checking the active replication state, they are sure to panic Solaris. Of course this failure scenario is no different then accessing a LUN or LUNs on dual-ported, or SAN based storage when another Solaris host is still accessing the ZFS filesystem, or controller based replication, as they are all just different operational scenarios of the same issue, data blocks changing out from underneath the ZFS filesystem, and its CRC checking mechanisms. Jim> > Best Regards, > Jason
Jason J. W. Williams
2007-Jan-30 06:10 UTC
[zfs-discuss] Project Proposal: Availability Suite
Hi Jim, Thank you very much for the heads up. Unfortunately, we need the write-cache enabled for the application I was thinking of combining this with. Sounds like SNDR and ZFS need some more soak time together before you can use both to their full potential together? Best Regards, Jason On 1/29/07, Jim Dunham <James.Dunham at sun.com> wrote:> Jason, > > Thank you for the detailed explanation. It is very helpful to > > understand the issue. Is anyone successfully using SNDR with ZFS yet? > Of the opportunities I''ve been involved with the answer is yes, but so > far I''ve not seen SNDR with ZFS in a production environment, but that > does not mean they don''t exists. It was not until late June ''06, that > AVS 4.0, Solaris 10 and ZFS were generally available, and to date AVS > has not been made available for the Solaris Express, Community Release, > but it will be real soon. > > While I have your attention, there are two issues between ZFS and AVS > that needs mentioning. > > 1). When ZFS is given an entire LUN to place in a ZFS storage pool, ZFS > detect this, enabling SCSI write-caching on the LUN, and also opens the > LUN with exclusive access, preventing other data services (like AVS) > from accessing this device. The work-around is to manually format the > LUN, typically placing all the blocks into a single partition, then just > place this partition into the ZFS storage pool. ZFS detect this, not > owning the entire LUN, and doesn''t enable write-caching, which means it > also doesn''t open the LUN with exclusive access, and therefore AVS and > ZFS can share the same LUN. > > I thought about submitting an RFE to have ZFS provide a means to > override this restriction, but I am not 100% certain that a ZFS > filesystem directly accessing a write-cached enabled LUN is the same > thing as a replicated ZFS filesystem accessing a write-cached enabled > LUN. Even though AVS is write-order consistent, there are disaster > recovery scenarios, when enacted, where block-order, verses write-order > I/Os are issued. > > 2). One has to be very cautious in using "zpool import -f .... " (forced > import), especially on a LUN or LUNs in which SNDR is actively > replicating into. If ZFS complains that the storage pool was not cleanly > exported when issuing a "zpool import ...", and one attempts a "zpool > import -f ....", without checking the active replication state, they are > sure to panic Solaris. Of course this failure scenario is no different > then accessing a LUN or LUNs on dual-ported, or SAN based storage when > another Solaris host is still accessing the ZFS filesystem, or > controller based replication, as they are all just different operational > scenarios of the same issue, data blocks changing out from underneath > the ZFS filesystem, and its CRC checking mechanisms. > > Jim > > > > > Best Regards, > > Jason > >
Jason J. W. Williams wrote:> Hi Jim, > > Thank you very much for the heads up. Unfortunately, we need the > write-cache enabled for the application I was thinking of combining > this with. Sounds like SNDR and ZFS need some more soak time together > before you can use both to their full potential together?Well...there is the fact that SNDR works with other FS other then ZFS. (Yes, I know this is the ZFS list.) Working around architectural issues for ZFS and ZFS alone might cause issues for others. I think the best of both worlds approach would be to let SNDR plug-in to ZFS along the same lines the crypto stuff will be able to plug in, different compression types, etc. There once was a slide that showed how that worked....or I''m hallucinating again.
On Fri, Jan 26, 2007 at 05:15:28PM -0700, Jason J. W. Williams wrote:> Could the replication engine eventually be integrated more tightly > with ZFS? That would be slick alternative to send/recv.But a continuous zfs send/recv would be cool too. In fact, I think ZFS tightly integrated with SNDR wouldn''t be that much different from a continuous zfs send/recv. Nico --
Nicolas Williams wrote:> On Fri, Jan 26, 2007 at 05:15:28PM -0700, Jason J. W. Williams wrote: > >> Could the replication engine eventually be integrated more tightly >> with ZFS? That would be slick alternative to send/recv. >> > > But a continuous zfs send/recv would be cool too. In fact, I think ZFS > tightly integrated with SNDR wouldn''t be that much different from a > continuous zfs send/recv.Even better with snapshots, and scoreboarding, and synch vs asynch and and and and .....
On Fri, Feb 02, 2007 at 03:17:17PM -0500, Torrey McMahon wrote:> Nicolas Williams wrote: > >But a continuous zfs send/recv would be cool too. In fact, I think ZFS > >tightly integrated with SNDR wouldn''t be that much different from a > >continuous zfs send/recv. > > Even better with snapshots, and scoreboarding, and synch vs asynch and > and and and .....Right. I hadn''t thought of that. A replication system that is well integrated with ZFS should have very similar properties whether designed as a journalling scheme or as a scoreboarding scheme. A continuous zfs send/recv as I imagine it would be like journalling while ZFS+SNDR would be more like scoreboarding. Unlike traditional journalling replication, a continuous ZFS send/recv scheme could deal with resource constraints by taking a snapshot and throttling replication until resources become available again. Replication throttling would mean losing some transaction history, but since we don''t expose that right now, nothing would be lost. Scoreboarding (what SNDR does) should perform better in general, but in the case of COW filesystems and databases ISTM that it should be a wash unless it''s properly integrated with the COW system, and that''s what makes me think scoreboarding and journalling approach each other at the limit when integrated with ZFS. In general I would expect journalling to have better reliability semantics (since you always know exactly the last transaction that was successfully replicated). Nico --
Nicolas Williams wrote:> On Fri, Feb 02, 2007 at 03:17:17PM -0500, Torrey McMahon wrote: > >> Nicolas Williams wrote: >> >>> But a continuous zfs send/recv would be cool too. In fact, I think ZFS >>> tightly integrated with SNDR wouldn''t be that much different from a >>> continuous zfs send/recv. >>> >> Even better with snapshots, and scoreboarding, and synch vs asynch and >> and and and ..... >> > > Right. I hadn''t thought of that. A replication system that is well > integrated with ZFS should have very similar properties whether designed > as a journalling scheme or as a scoreboarding scheme.Here''s an other thing to think about: ZFS is a COW filesystem. Even if I''m changing one piece of data over and over, which in the past might be a set of blocks, I''m going to be writing out new blocks on disk. Many replication strategies take into account the fact that even through your data is changing quite a bit the actual block storage level changes are much smaller.
On Feb 2, 2007, at 15:35, Nicolas Williams wrote:> Unlike traditional journalling replication, a continuous ZFS send/recv > scheme could deal with resource constraints by taking a snapshot and > throttling replication until resources become available again. > Replication throttling would mean losing some transaction history, but > since we don''t expose that right now, nothing would be lost. > > Scoreboarding (what SNDR does) should perform better in general, > but in > the case of COW filesystems and databases ISTM that it should be a > wash > unless it''s properly integrated with the COW system, and that''s what > makes me think scoreboarding and journalling approach each other at > the > limit when integrated with ZFS.hmm .. a COW scoreboard .. visions of Clustra with the notion of "each node is an atomic failure unit" spring to mind .. of course in this light, there''s not much of a difference between just replication and global synchronization .. very interesting .. --- .je
Jonathan Edwards wrote:> > On Feb 2, 2007, at 15:35, Nicolas Williams wrote: > >> Unlike traditional journalling replication, a continuous ZFS send/recv >> scheme could deal with resource constraints by taking a snapshot and >> throttling replication until resources become available again. >> Replication throttling would mean losing some transaction history, but >> since we don''t expose that right now, nothing would be lost. >> >> Scoreboarding (what SNDR does) should perform better in general, but in >> the case of COW filesystems and databases ISTM that it should be a wash >> unless it''s properly integrated with the COW system, and that''s what >> makes me think scoreboarding and journalling approach each other at the >> limit when integrated with ZFS. > > hmm .. a COW scoreboard .. visions of Clustra with the notion of "each > node > is an atomic failure unit" spring to mind .. of course in this light, > there''s not > much of a difference between just replication and global > synchronization ..But would you want a COW scoreboard or a transaction log? Or would there be a difference? Is it Friday yet? I think we need to start drinking on this one. ;)
My two (everyman''s) cents - could something like this be modeled after MySQL replication or even something like DRBD (drbd.org) ? Seems like possibly the same idea. On 1/26/07, Jim Dunham <James.Dunham at sun.com> wrote:> Project Overview: > ...
Hello Nicolas, Friday, February 2, 2007, 9:01:20 PM, you wrote: NW> On Fri, Jan 26, 2007 at 05:15:28PM -0700, Jason J. W. Williams wrote:>> Could the replication engine eventually be integrated more tightly >> with ZFS? That would be slick alternative to send/recv.NW> But a continuous zfs send/recv would be cool too. In fact, I think ZFS NW> tightly integrated with SNDR wouldn''t be that much different from a NW> continuous zfs send/recv. It would. zfs send/recv is more flexible in what it''s expecting on the remote side. You can have uncompressed file system on sending sinde, and compressed one on remote, not to mention other properties and different raid type and pool/fs size. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Fri, 2 Feb 2007, Torrey McMahon wrote:> Jason J. W. Williams wrote: >> Hi Jim, >> >> Thank you very much for the heads up. Unfortunately, we need the >> write-cache enabled for the application I was thinking of combining >> this with. Sounds like SNDR and ZFS need some more soak time together >> before you can use both to their full potential together? > > Well...there is the fact that SNDR works with other FS other then ZFS. (Yes, > I know this is the ZFS list.) Working around architectural issues for ZFS and > ZFS alone might cause issues for others.SNDR has some issues with logging UFS as well. If you start a SNDR live copy on an active logging UFS (not _writelocked_), the UFS log state may not be copied consistently. If you want a live remote replication facility, it _NEEDS_ to talk to the filesystem somehow. There must be a callback mechanism that the filesystem could use to tell the replicator "and from exactly now on you start replicating". The only entity which can truly give this signal is the filesystem itself. And no, that _not_ when the filesystem does a "flush write cache" ioctl. Or when the user has just issued a "sync" command or similar. For ZFS, it''d be when a ZIL transaction is closed (as I understand it), for UFS it''d be when the UFS log is fully rolled. There''s no notification to external entities when these two events happen. SNDR tries its best to achieve this detection, but without actually _stopping_ all I/O (on UFS: writelocking), there''s a window of vulnerability still open. And SNDR/II don''t stop filesystem I/O - by basic principle. That''s how they''re sold/advertised/intended to be used. I''m all willing to see SNDR/II go open - we could finally work these issues ! FrankH.> > I think the best of both worlds approach would be to let SNDR plug-in to ZFS > along the same lines the crypto stuff will be able to plug in, different > compression types, etc. There once was a slide that showed how that > worked....or I''m hallucinating again. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Frank,> On Fri, 2 Feb 2007, Torrey McMahon wrote: > >> Jason J. W. Williams wrote: >>> Hi Jim, >>> >>> Thank you very much for the heads up. Unfortunately, we need the >>> write-cache enabled for the application I was thinking of combining >>> this with. Sounds like SNDR and ZFS need some more soak time together >>> before you can use both to their full potential together? >> >> Well...there is the fact that SNDR works with other FS other then >> ZFS. (Yes, I know this is the ZFS list.) Working around architectural >> issues for ZFS and ZFS alone might cause issues for others. > > SNDR has some issues with logging UFS as well. If you start a SNDR > live copy on an active logging UFS (not _writelocked_), the UFS log > state may not be copied consistently.Treading "very" carefully, UFS logging may have issues with being replicated, not the other way around. SNDR replication (after synchronizing) maintains a write-order consistent volume, thus if there is an issue with UFS logging being able to access an SNDR secondary, then UFS logging will also have issues with accessing a volume after Solaris crashes. The end result of Solaris crashing, or SNDR replication stopping, is a write-ordered, crash-consistent volume. Given that both UFS logging and SNDR are (near) perfect (or there would be a flood of escalations), this issue in all cases I''ve seen to date, is that the SNDR primary volume being replicated is mounted with UFS logging enable, but the SNDR secondary is not mounted with UFS logging enabled. Once this condition happens, the problem can be resolved by fixing /etc/vfstab to correct the inconsistent mount options, and then performing an SNDR update sync.> > If you want a live remote replication facility, it _NEEDS_ to talk to > the filesystem somehow. There must be a callback mechanism that the > filesystem could use to tell the replicator "and from exactly now on > you start replicating". The only entity which can truly give this > signal is the filesystem itself.There is an RFE against SNDR for something called "in-line PIT". I hope that this work will get done soon.> > And no, that _not_ when the filesystem does a "flush write cache" > ioctl. Or when the user has just issued a "sync" command or similar. > For ZFS, it''d be when a ZIL transaction is closed (as I understand > it), for UFS it''d be when the UFS log is fully rolled. There''s no > notification to external entities when these two events happen.Because ZFS is always on-disk consistent, this is not an issue. So far in ALL my testing with replicating ZFS with SNDR, I have not seen ZFS fail! Of course be careful to not confuse my stated position with another closely related scenario. That being accessing ZFS on the remote node via a forced import "zpool import -f <name>", with active SNDR replication, as ZFS is sure to panic the system. ZFS, unlike other filesystems has 0% tolerance to corrupted metadata. Jim> SNDR tries its best to achieve this detection, but without actually > _stopping_ all I/O (on UFS: writelocking), there''s a window of > vulnerability still open. > And SNDR/II don''t stop filesystem I/O - by basic principle. That''s how > they''re sold/advertised/intended to be used. > > I''m all willing to see SNDR/II go open - we could finally work these > issues ! > > FrankH. > >> >> I think the best of both worlds approach would be to let SNDR plug-in >> to ZFS along the same lines the crypto stuff will be able to plug in, >> different compression types, etc. There once was a slide that showed >> how that worked....or I''m hallucinating again. >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Mon, 5 Feb 2007, Jim Dunham wrote:> Frank, >> On Fri, 2 Feb 2007, Torrey McMahon wrote: >> >>> Jason J. W. Williams wrote: >>>> Hi Jim, >>>> >>>> Thank you very much for the heads up. Unfortunately, we need the >>>> write-cache enabled for the application I was thinking of combining >>>> this with. Sounds like SNDR and ZFS need some more soak time together >>>> before you can use both to their full potential together? >>> >>> Well...there is the fact that SNDR works with other FS other then ZFS. >>> (Yes, I know this is the ZFS list.) Working around architectural issues >>> for ZFS and ZFS alone might cause issues for others. >> >> SNDR has some issues with logging UFS as well. If you start a SNDR live >> copy on an active logging UFS (not _writelocked_), the UFS log state may >> not be copied consistently. > > Treading "very" carefully, UFS logging may have issues with being replicated, > not the other way around. SNDR replication (after synchronizing) maintains a > write-order consistent volume, thus if there is an issue with UFS logging > being able to access an SNDR secondary, then UFS logging will also have > issues with accessing a volume after Solaris crashes. The end result of > Solaris crashing, or SNDR replication stopping, is a write-ordered, > crash-consistent volume.Except that you''re not getting user data consistency - because UFS logging only does the write-ordered crash consistency for metadata. In other words, it''s possible with UFS logging to see metadata changes (file growth/shrink, filling of holes in sparse files) that do not match the file contents - AFTER crash recovery. To get full consistency of data and metadata across crashes / replication termination, with a replicator underneath, the filesystem needs a way of telling the replicator "and now start/stop replicating please". For the filesystem to barrier. I''m not saying SNDR isn''t doing a good job. I''m just saying it could do a perfect job if it integrated in this way with the filesystem on top. If there were ''start/stop'' hooks. II is a different matter again. It had, for some time, don''t know if that''s still true, a window where it would EIO writes when enabling the image. Neither UFS logging nor ZFS very much like being told "this critical write of yours errored out". FrankH.> > Given that both UFS logging and SNDR are (near) perfect (or there would be a > flood of escalations), this issue in all cases I''ve seen to date, is that the > SNDR primary volume being replicated is mounted with UFS logging enable, but > the SNDR secondary is not mounted with UFS logging enabled. Once this > condition happens, the problem can be resolved by fixing /etc/vfstab to > correct the inconsistent mount options, and then performing an SNDR update > sync. > >> >> If you want a live remote replication facility, it _NEEDS_ to talk to the >> filesystem somehow. There must be a callback mechanism that the filesystem >> could use to tell the replicator "and from exactly now on you start >> replicating". The only entity which can truly give this signal is the >> filesystem itself. > > There is an RFE against SNDR for something called "in-line PIT". I hope that > this work will get done soon. > >> >> And no, that _not_ when the filesystem does a "flush write cache" ioctl. Or >> when the user has just issued a "sync" command or similar. >> For ZFS, it''d be when a ZIL transaction is closed (as I understand it), for >> UFS it''d be when the UFS log is fully rolled. There''s no notification to >> external entities when these two events happen. > > Because ZFS is always on-disk consistent, this is not an issue. So far in ALL > my testing with replicating ZFS with SNDR, I have not seen ZFS fail! > > Of course be careful to not confuse my stated position with another closely > related scenario. That being accessing ZFS on the remote node via a forced > import "zpool import -f <name>", with active SNDR replication, as ZFS is > sure to panic the system. ZFS, unlike other filesystems has 0% tolerance to > corrupted metadata. > > Jim > > >> SNDR tries its best to achieve this detection, but without actually >> _stopping_ all I/O (on UFS: writelocking), there''s a window of >> vulnerability still open. >> And SNDR/II don''t stop filesystem I/O - by basic principle. That''s how >> they''re sold/advertised/intended to be used. >> >> I''m all willing to see SNDR/II go open - we could finally work these issues >> ! >> >> FrankH. >> >>> >>> I think the best of both worlds approach would be to let SNDR plug-in to >>> ZFS along the same lines the crypto stuff will be able to plug in, >>> different compression types, etc. There once was a slide that showed how >>> that worked....or I''m hallucinating again. >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >