If we get two x4500s, and look at AVS, would it be possible to: 1) Setup AVS to replicate zfs, and zvol (ufs) from 01 -> 02 ? Supported by Sol 10 5/08 ? Assuming 1, if we setup a home-made IP fail-over so that; should 01 go down, all clients are redirected to 02. 2) Fail-back, are there methods in AVS to handle fail-back? Since 02 has been used, it will have newer/modified files, and will need to replicate backwards until synchronised, before fail-back can occur. We did ask our vendor, but we were just told that AVS does not support x4500. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
lundman at gmo.jp said:> We did ask our vendor, but we were just told that AVS does not support > x4500.You might have to use the open-source version of AVS, but it''s not clear if that requires OpenSolaris or if it will run on Solaris-10. Here''s a description of how to set it up between two X4500''s: http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless Regards, Marion
Jorgen Lundman wrote:> We did ask our vendor, but we were just told that AVS does not support > x4500.The officially supported AVS works on the X4500 since the X4500 came out. But, although Jim Dunham and others will tell you otherwise, I absolutely can *not* recommend using it on this hardware with ZFS, especially with the larger disk sizes. At least not for important, or even business critical data - in such a case, using X41x0 servers with J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much better and more reliable option, for basically the same price. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
On Thu, Sep 4, 2008 at 12:19 AM, Ralf Ramge <ralf.ramge at webde.de> wrote:> > Jorgen Lundman wrote: > > > We did ask our vendor, but we were just told that AVS does not support > > x4500. > > > The officially supported AVS works on the X4500 since the X4500 came > out. But, although Jim Dunham and others will tell you otherwise, I > absolutely can *not* recommend using it on this hardware with ZFS, > especially with the larger disk sizes. At least not for important, or > even business critical data - in such a case, using X41x0 servers with > J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much > better and more reliable option, for basically the same price. > > > > > -- > > Ralf Ramge > Senior Solaris Administrator, SCNA, SCSAI did some Googling, but I saw some limitations sharing your ZFS pool via NFS while using HAStorage Cluster product as well. Do similar limitations exist for sharing via the built in CIFS in OpenSolaris as well? Here: http://docs.sun.com/app/docs/doc/820-2565/z4000275997776?a=view " Zettabyte File System (ZFS) Restrictions If you are using the zettabyte file system (ZFS) as the exported file system, you must set the sharenfs property to off. To set the sharenfs property to off, run the following command. $ zfs set sharenfs=off file_system/volume To verify if the sharenfs property is set to off, run the following command. $ zfs get sharenfs file_system/volume " -- Brent Jones brent at servuhome.net
Brent Jones wrote:> I did some Googling, but I saw some limitations sharing your ZFS pool > via NFS while using HAStorage Cluster product as well.[...] > If you are using the zettabyte file system (ZFS) as the exported file> system, you must set the sharenfs property to off.That''s not a limitation, just looks like one. The cluster''s resource type called "SUNW.nfs" decides if a file system is shared or not. And it does this with the usual "share" and "unshare" commands in a separate dfstab file. The ZFS sharenfs flag is set to "off" to avoid conflicts. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
zfs-discuss-bounces at opensolaris.org wrote on 09/04/2008 02:19:23 AM:> Jorgen Lundman wrote: > > > We did ask our vendor, but we were just told that AVS does not support > > x4500. > > > The officially supported AVS works on the X4500 since the X4500 came > out. But, although Jim Dunham and others will tell you otherwise, I > absolutely can *not* recommend using it on this hardware with ZFS, > especially with the larger disk sizes. At least not for important, or > even business critical data - in such a case, using X41x0 servers with > J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much > better and more reliable option, for basically the same price. >Ralf, War wounds? Could you please expand on the why a bit more? -Wade
On Thu, Sep 4, 2008 at 10:09 AM, <Wade.Stuart at fallon.com> wrote:> zfs-discuss-bounces at opensolaris.org wrote on 09/04/2008 02:19:23 AM: > >> Jorgen Lundman wrote: >> >> > We did ask our vendor, but we were just told that AVS does not support >> > x4500. >> >> >> The officially supported AVS works on the X4500 since the X4500 came >> out. But, although Jim Dunham and others will tell you otherwise, I >> absolutely can *not* recommend using it on this hardware with ZFS, >> especially with the larger disk sizes. At least not for important, or >> even business critical data - in such a case, using X41x0 servers with >> J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much >> better and more reliable option, for basically the same price. >> > > Ralf, > > War wounds? Could you please expand on the why a bit more?+1 I''d also be interested in more details. Thanks, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
On Thu, Sep 4, 2008 at 7:38 PM, Al Hopper <al at logical-approach.com> wrote:> On Thu, Sep 4, 2008 at 10:09 AM, <Wade.Stuart at fallon.com> wrote: >> zfs-discuss-bounces at opensolaris.org wrote on 09/04/2008 02:19:23 AM: >> >>> Jorgen Lundman wrote: >>> >>> > We did ask our vendor, but we were just told that AVS does not support >>> > x4500. >>> >>> >>> The officially supported AVS works on the X4500 since the X4500 came >>> out. But, although Jim Dunham and others will tell you otherwise, I >>> absolutely can *not* recommend using it on this hardware with ZFS, >>> especially with the larger disk sizes. At least not for important, or >>> even business critical data - in such a case, using X41x0 servers with >>> J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much >>> better and more reliable option, for basically the same price. >>> >> >> Ralf, >> >> War wounds? Could you please expand on the why a bit more? > > +1 I''d also be interested in more details. > > Thanks, > > -- > Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com > Voice: 972.379.2133 Timezone: US CDT > OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 > http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Story time! -- Brent Jones brent at servuhome.net
Wade.Stuart at fallon.com wrote:> War wounds? Could you please expand on the why a bit more?- ZFS is not aware of AVS. On the secondary node, you''ll always have to force the `zfs import` due to the unnoticed changes of metadata (zpool in use). No mechanism to prevent data loss exists, e.g. zpools can be imported when the replicator is *not* in logging mode. - AVS is not ZFS aware. For instance, if ZFS resilves a mirrored disk, e.g. after replacing a drive, the complete disk is sent over the network to the secondary node, even though the replicated data on the secondary is intact. That''s a lot of fun with today''s disk sizes of 750 GB and 1 TB drives, resulting in usually 10+ hours without real redundancy (customers who use Thumpers to store important data usually don''t have the budget to connect their data centers with 10 Gbit/s, so expect 10+ hours *per disk*). - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may not be imported on the secondary node during the replication. The X4500 does not have a RAID controller which signals (and handles) drive faults. Drive failures on the secondary node may happen unnoticed until the primary nodes goes down and you want to import the zpool on the secondary node with the broken drive. Since ZFS doesn''t offer a recovery mechanism like fsck, data loss of up to 20 TB may occur. If you use AVS with ZFS, make sure that you have a storage which handles drive failures without OS interaction. - 5 hours for scrubbing a 1 TB drive. If you''re lucky. Up to 48 drives in total. - An X4500 has no battery buffered write cache. ZFS uses the server''s RAM as a cache, 15 GB+. I don''t want to find out how much time a resilver over the network after a power outage may take (a full reverse replication would take up to 2 weeks and is no valid option in a serious production environment). But the underlying question I asked myself is why I should I want to replicate data in such an expensive way, when I think the 48 TB data itself are not important enough to be protected by a battery? - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft partitions). Weren''t enough, the replication was still very slow, probably because of an insane amount of head movements, and scales badly. Putting the bitmap of a drive on the drive itself (if I remember correctly, this is recommended in one of the most referenced howto blog articles) is a bad idea. Always use ZFS on whole disks, if performance and caching matters to you. - AVS seems to require an additional shared storage when building failover clusters with 48 TB of internal storage. That may be hard to explain to the customer. But I''m not 100% sure about this, because I just didn''t find a way, I didn''t ask on a mailing list for help. If you want a fail-over solution for important data, use the external JBODs. Use AVS only to mirror complete clusters, don''t use it to replicate single boxes with local drives. And, in case OpenSolaris is not an option for you due to your company policies or support contracts, building a real cluster also A LOT cheaper. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
[jumping ahead and quoting myself] AVS is not a mirroring technology, it is a remote replication technology. So, yes, I agree 100% that people should not expect AVS to be a mirror. Ralf Ramge wrote:> Wade.Stuart at fallon.com wrote: > > >> War wounds? Could you please expand on the why a bit more? >> > > > > - ZFS is not aware of AVS. On the secondary node, you''ll always have to > force the `zfs import` due to the unnoticed changes of metadata (zpool > in use). No mechanism to prevent data loss exists, e.g. zpools can be > imported when the replicator is *not* in logging mode. >ZFS isn''t special in this regard, AFAIK all file systems, databases and other data stores suffer from the same issue with remote replication.> - AVS is not ZFS aware. For instance, if ZFS resilves a mirrored disk, > e.g. after replacing a drive, the complete disk is sent over the network > to the secondary node, even though the replicated data on the secondary > is intact. > That''s a lot of fun with today''s disk sizes of 750 GB and 1 TB drives, > resulting in usually 10+ hours without real redundancy (customers who > use Thumpers to store important data usually don''t have the budget to > connect their data centers with 10 Gbit/s, so expect 10+ hours *per disk*). >ZFS only resilvers data. Other LVMs, like SVM, will resilver the entire disk, though.> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may not be > imported on the secondary node during the replication. The X4500 does > not have a RAID controller which signals (and handles) drive faults. > Drive failures on the secondary node may happen unnoticed until the > primary nodes goes down and you want to import the zpool on the > secondary node with the broken drive. Since ZFS doesn''t offer a recovery > mechanism like fsck, data loss of up to 20 TB may occur. > If you use AVS with ZFS, make sure that you have a storage which handles > drive failures without OS interaction. >If this is the case, then array-based replication would also be similarly affected by this architectural problem. In other words, if you say that a software RAID system cannot be replicated by a software replicator, then TrueCopy, SRDF, and other RAID array-based (also software) replicators also do not work. I think there is enough empirical evidence that they do work. I can see where there might be a best practice here, but I see no fundamental issue. fsck does not recover data, it only recovers metadata.> - 5 hours for scrubbing a 1 TB drive. If you''re lucky. Up to 48 drives > in total. >ZFS only scrubs data. But it is not unusual for a lot of data scrubbing to take a long time. ZFS only performs read scrubs, so there is no replication required during a ZFS scrub, unless data is repaired.> - An X4500 has no battery buffered write cache. ZFS uses the server''s > RAM as a cache, 15 GB+. I don''t want to find out how much time a > resilver over the network after a power outage may take (a full reverse > replication would take up to 2 weeks and is no valid option in a serious > production environment). But the underlying question I asked myself is > why I should I want to replicate data in such an expensive way, when I > think the 48 TB data itself are not important enough to be protected by > a battery? >ZFS will not be storing 15 GBytes of unflushed data on any system I can imagine today. While we can all agree that 48 TBytes will be painful to replicate, that is not caused by ZFS -- though it is enabled by ZFS, because some other file systems (UFS) cannot be as large as 48 TBytes.> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft > partitions). Weren''t enough, the replication was still very slow, > probably because of an insane amount of head movements, and scales > badly. Putting the bitmap of a drive on the drive itself (if I remember > correctly, this is recommended in one of the most referenced howto blog > articles) is a bad idea. Always use ZFS on whole disks, if performance > and caching matters to you. >I think there are opportunities for perormance improvement, but don''t know who is currently actively working on this. Actually, the cases where ZFS for whole disks is a big win are small. And, of course, you can enable disk write caches by hand.> - AVS seems to require an additional shared storage when building > failover clusters with 48 TB of internal storage. That may be hard to > explain to the customer. But I''m not 100% sure about this, because I > just didn''t find a way, I didn''t ask on a mailing list for help. > > > If you want a fail-over solution for important data, use the external > JBODs. Use AVS only to mirror complete clusters, don''t use it to > replicate single boxes with local drives. And, in case OpenSolaris is > not an option for you due to your company policies or support contracts, > building a real cluster also A LOT cheaper. >AVS is not a mirroring technology, it is a remote replication technology. So, yes, I agree 100% that people should not expect AVS to be a mirror. An earlier discussion on this forum dealt with the details of when the write ordering must be preserved for ongoing operation. But when a full resync is required, the write ordering is not preserved. The theory is that this might affect ZFS more so than other file systems, or perhaps ZFS might notice it more than other file systems. But again, this affects other remote replication technologies, also. -- richard
Jorgen,> > If we get two x4500s, and look at AVS, would it be possible to: > > 1) Setup AVS to replicate zfs, and zvol (ufs) from 01 -> 02 ? > Supported > by Sol 10 5/08 ?For Solaris 10, one will need to purchase AVS. It was not until OpenSolaris, that AVS became bundled. Also the OpenSolaris version will not run on Solaris 10.> Assuming 1, if we setup a home-made IP fail-over so that; should 01 go > down, all clients are redirected to 02. > > > 2) Fail-back, are there methods in AVS to handle fail-back?Yes, its called SNDR reverse synchronization, and is key feature of SNDR and its ability to create DR site.> Since 02 has > been used, it will have newer/modified files, and will need to > replicate > backwards until synchronised, before fail-back can occur.SNDR supports on demand pull, which means that once reverse synchronization has been started, the SNDR primary volumes can be accessed. In addition to the background resilvering of difference, those blocks requested "on demand", will be included in the reverse synchronization.> We did ask our vendor, but we were just told that AVS does not > support x4500.AVS works with any Solaris blocks storage device, independent of platform. Period.> > > > Lund > > -- > Jorgen Lundman | <lundman at lundman.net> > Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) > Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) > Japan | +81 (0)3 -3375-1767 (home) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.
Ralf,> Wade.Stuart at fallon.com wrote: > >> War wounds? Could you please expand on the why a bit more? > > - ZFS is not aware of AVS. On the secondary node, you''ll always have > to > force the `zfs import` due to the unnoticed changes of metadata (zpool > in use).This is not true. If on the primary node invokes "zpool export" while replication is still active, then a forced "zpool import" is not required. This behavior is the same as with a zpool on dual-ported or SAN storage, and is NOT specific to AVS.> No mechanism to prevent data loss exists, e.g. zpools can be > imported when the replicator is *not* in logging mode.This behavior is the same as with a zpool on dual-ported or SAN storage, and is NOT specific to AVS.> - AVS is not ZFS aware.AVS is not UFS, QFS, Oracle, Sybase aware either. This makes AVS, and other host based and controller based replication services multi- functional. If you desire ZFS aware functionality, use ZFS send and recv.> For instance, if ZFS resilves a mirrored disk, > e.g. after replacing a drive, the complete disk is sent over the > network > to the secondary node, even though the replicated data on the > secondary > is intact.The complete disk IS NOT sent of the over the network to the secondary node, only those disk blocks that re-written by ZFS. This has to be this way, since ZFS does not differentiate between writes caused by re- silvering, and writes caused my new ZFS filesystem operations. Furthermore, only those portions of the ZFS storage pool are replicated in this scenario, not every block in the entire storage pool.> That''s a lot of fun with today''s disk sizes of 750 GB and 1 TB drives, > resulting in usually 10+ hours without real redundancy (customers who > use Thumpers to store important data usually don''t have the budget to > connect their data centers with 10 Gbit/s, so expect 10+ hours *per > disk*).If once creates a ZFS Storage pool whose size is 1 TB, then enables AVS after the fact, AVS can not differentiate between blocks that are in use by ZFS from those that are not, therefore AVS needs to replicate then entire TB of storage. If one enables AVS first, before the volumes are places in a ZFS storage pool, then the "sndradm -E ...", option can be used. Then when the ZFS storage pool is created, only those I/Os need to initial the pool need be replicated. If one has a ZFS storage pool that is quite large, but in actuality there is little of the storage pool in use, by enabling SNDR first on a placement volume, then invoking "zpool replace ..." on multiple ''vdevs'' in the storage pool, and optimal replication of the ZFS storage pool can be done.> > > - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may not > be > imported on the secondary node during the replication.This behavior is the same as with a zpool on dual-ported or SAN storage, and is NOT specific to AVS.> The X4500 does > not have a RAID controller which signals (and handles) drive faults. > Drive failures on the secondary node may happen unnoticed until the > primary nodes goes down and you want to import the zpool on the > secondary node with the broken drive. Since ZFS doesn''t offer a > recovery > mechanism like fsck, data loss of up to 20 TB may occur. > If you use AVS with ZFS, make sure that you have a storage which > handles > drive failures without OS interaction. > > - 5 hours for scrubbing a 1 TB drive. If you''re lucky. Up to 48 drives > in total. > > - An X4500 has no battery buffered write cache. ZFS uses the server''s > RAM as a cache, 15 GB+. I don''t want to find out how much time a > resilver over the network after a power outage may take (a full > reverse > replication would take up to 2 weeks and is no valid option in a > serious > production environment). But the underlying question I asked myself is > why I should I want to replicate data in such an expensive way, when I > think the 48 TB data itself are not important enough to be protected > by > a battery?I don''t understand the relevance to AVS in the prior three paragraphs?> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft > partitions). Weren''t enough, the replication was still very slow, > probably because of an insane amount of head movements, and scales > badly. Putting the bitmap of a drive on the drive itself (if I > remember > correctly, this is recommended in one of the most referenced howto > blog > articles) is a bad idea. Always use ZFS on whole disks, if performance > and caching matters to you.When you have the time, can you replace the "probably because of ... " with some real performance numbers?> - AVS seems to require an additional shared storage when building > failover clusters with 48 TB of internal storage. That may be hard to > explain to the customer. But I''m not 100% sure about this, because I > just didn''t find a way, I didn''t ask on a mailing list for help.When you have them time, can you replace the "AVS seems to ... " with some specific references to what you are referring to?> If you want a fail-over solution for important data, use the external > JBODs. Use AVS only to mirror complete clusters, don''t use it to > replicate single boxes with local drives. And, in case OpenSolaris is > not an option for you due to your company policies or support > contracts, > building a real cluster also A LOT cheaper.You are offering up these position statements based on what?> -- > > Ralf Ramge > Senior Solaris Administrator, SCNA, SCSA > > Tel. +49-721-91374-3963 > ralf.ramge at webde.de - http://web.de/ > > 1&1 Internet AG > Brauerstra?e 48 > 76135 Karlsruhe > > Amtsgericht Montabaur HRB 6484 > > Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas > Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver > Mauss, > Achim Weiss > Aufsichtsratsvorsitzender: Michael Scheeren > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.
Jim Dunham wrote: [...] Jim, at first: I never said that AVS is a bad product. And I never will. I wonder why you act as if you were attacked personally. To be honest, if I were a customer with the original question, such a reaction wouldn''t make me feel safer.>> - ZFS is not aware of AVS. On the secondary node, you''ll always have to >> force the `zfs import` due to the unnoticed changes of metadata (zpool >> in use). > > This is not true. If on the primary node invokes "zpool export" while > replication is still active, then a forced "zpool import" is not > required. This behavior is the same as with a zpool on dual-ported or > SAN storage, and is NOT specific to AVS.Jim. A graceful shutdown of the primary node may be a valid desaster scenario in the laboratory, but it never will be in the real life.> >> No mechanism to prevent data loss exists, e.g. zpools can be >> imported when the replicator is *not* in logging mode. > > This behavior is the same as with a zpool on dual-ported or SAN storage, > and is NOT specific to AVS.And what makes you think that I said that AVS is the problem here? And by the way, the customer doesn''t care *why* there''s a problem. He only wants to know *if* there''s a problem.>> - AVS is not ZFS aware. > > AVS is not UFS, QFS, Oracle, Sybase aware either. This makes AVS, and > other host based and controller based replication services > multi-functional. If you desire ZFS aware functionality, use ZFS send > and recv.Yes, exactly. And that''s the problem, sind `zfs send` and `zfs receive` are no working solution in a fail-safe two node environment. Again: the customer doesn''t care *why* there''s a problem. He only wants to know *if* there''s a problem.>> For instance, if ZFS resilves a mirrored disk, >> e.g. after replacing a drive, the complete disk is sent over the network >> to the secondary node, even though the replicated data on the secondary >> is intact. > > The complete disk IS NOT sent of the over the network to the secondary > node, only those disk blocks that re-written by ZFS.Yes, you''re right. But sadly, in the mentioned scenario of having replaced an entire drive, the entire disk is rewritten by ZFS. Again: And what makes you think that I said that AVS is the problem here?>> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may not be >> imported on the secondary node during the replication. > > This behavior is the same as with a zpool on dual-ported or SAN storage, > and is NOT specific to AVS.Again: And what makes you think that I said that AVS is the problem here? We are not on avs-discuss, Jim.> I don''t understand the relevance to AVS in the prior three paragraphs?We are not on avs-discuss, Jim. The customer wanted to know what drawbacks exist in his *scenario*. Not AVS.>> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft >> partitions). Weren''t enough, the replication was still very slow, >> probably because of an insane amount of head movements, and scales >> badly. Putting the bitmap of a drive on the drive itself (if I remember >> correctly, this is recommended in one of the most referenced howto blog >> articles) is a bad idea. Always use ZFS on whole disks, if performance >> and caching matters to you. > > When you have the time, can you replace the "probably because of ... " > with some real performance numbers?No problem. If you please organize a Try&Buy of two X4500 server being sent to my address, thank you.>> - AVS seems to require an additional shared storage when building >> failover clusters with 48 TB of internal storage. That may be hard to >> explain to the customer. But I''m not 100% sure about this, because I >> just didn''t find a way, I didn''t ask on a mailing list for help. > > When you have them time, can you replace the "AVS seems to ... " with > some specific references to what you are referring to?The installation and configuration process and the location where AVS wants to store the shared database. I can tell you details about it the next time I give it try. Until then, please read the last sentence you quoted once more, thank you.>> If you want a fail-over solution for important data, use the external >> JBODs. Use AVS only to mirror complete clusters, don''t use it to >> replicate single boxes with local drives. And, in case OpenSolaris is >> not an option for you due to your company policies or support contracts, >> building a real cluster also A LOT cheaper. > > You are offering up these position statements based on what?My outline agreements, my support contracts, partner web desk and finally my experience with projects in high availability scenarios with tens of thousands of servers. Jim, it''s okay. I know that you''re a project leader at Sun Microsystems and that AVS is your main concern. But if there''s one thing I cannot withstand, it''s getting stroppy replies from someone who should know better and should have realized that he''s acting publicly and in front of the people who finance his income instead of trying to start a flame war. From now on, I leave the rest to you, because I earn my living with products of Sun Microsystems, too, and I don''t want to damage neither Sun nor this mailing list. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Richard Elling wrote:>> Yes, you''re right. But sadly, in the mentioned scenario of having >> replaced an entire drive, the entire disk is rewritten by ZFS. > > No, this is not true. ZFS only resilvers data.Okay, I see we have a communication problem here. Probably my fault, I should have written "the entire data and metadata". I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB of data on it. Simply because nobody buys the 1 TB X4500 just to use 10% of the disk space, he would have bought the 250 GB, 500 GB or 750 GB model then. In any case and any disk size scenario, that''s something you don''t want to have on your network if there''s a chance to avoid this. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Ralf Ramge wrote:> Richard Elling wrote: > >>> Yes, you''re right. But sadly, in the mentioned scenario of having >>> replaced an entire drive, the entire disk is rewritten by ZFS. >> >> No, this is not true. ZFS only resilvers data. > > Okay, I see we have a communication problem here. Probably my fault, I > should have written "the entire data and metadata". > I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB > of data on it. Simply because nobody buys the 1 TB X4500 just to use > 10% of the disk space, he would have bought the 250 GB, 500 GB or 750 > GB model then.Actually, they do :-) Some storage vendors insist on it, to keep performance up -- short-stroking. I''ve done several large-scale surveys of this and the average usage is 50%. This is still a large difference in resilver times between ZFS and SVM.> In any case and any disk size scenario, that''s something you don''t > want to have on your network if there''s a chance to avoid this.Agree 100%. -- richard
On 09.09.08 19:32, Richard Elling wrote:> Ralf Ramge wrote: >> Richard Elling wrote: >> >>>> Yes, you''re right. But sadly, in the mentioned scenario of having >>>> replaced an entire drive, the entire disk is rewritten by ZFS. >>> No, this is not true. ZFS only resilvers data. >> Okay, I see we have a communication problem here. Probably my fault, I >> should have written "the entire data and metadata". >> I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB >> of data on it. Simply because nobody buys the 1 TB X4500 just to use >> 10% of the disk space, he would have bought the 250 GB, 500 GB or 750 >> GB model then. > > Actually, they do :-) Some storage vendors insist on it, to keep > performance up -- short-stroking. > > I''ve done several large-scale surveys of this and the average usage > is 50%. This is still a large difference in resilver times between > ZFS and SVM.There is RFE 6722786 "resilver on mirror could reduce window of vulnerability" which is aimed to reduce this difference for mirrors. See here: http://bugs.opensolaris.org/view_bug.do?bug_id=6722786 Wbr, Victor
Just to clarify a few items... consider a setup where we desire to use AVS to replicate the ZFS pool on a 4 drive server to like hardware. The 4 drives are setup as RaidZ. If we lose a drive (say #2) in the primary server, RaidZ will take over, and our data will still be "available" but the array is at a degraded state. But what happens to the secondary server? Specifically to its bit-for-bit copy of Drive #2... presumably it is still good, but ZFS will offline that disk on the primary server, replicate the metadata, and when/if I "promote" the seconday server, it will also be running in a degraded state (ie: 3 out of 4 drives). correct? In this scenario, my replication hasn''t really bought me any increased availablity... or am I missing something? Also, if I do chose to fail over to the secondary, can I just to a scrub the "broken" drive (which isn''t really broken, but the zpool would be inconsistent at some level with the other "online" drives) and get back to "full speed" quickly? or will I always have to wait until one of the servers resilvers itself (from scratch?), and re-replicates itself?? thanks in advance. -Matt -- This message posted from opensolaris.org
Matt Beebe wrote:> But what happens to the secondary server? Specifically to its bit-for-bit copy of Drive #2... presumably it is still good, but ZFS will offline that disk on the primary server, replicate the metadata, and when/if I "promote" the seconday server, it will also be running in a degraded state (ie: 3 out of 4 drives). correct?Correct.> In this scenario, my replication hasn''t really bought me any increased availablity... or am I missing something?No. You have an increase of availability when the entire primary node goes down, but you''re not particularly safer when it comes to decreased zpools.> Also, if I do chose to fail over to the secondary, can I just to a scrub the "broken" drive (which isn''t really broken, but the zpool would be inconsistent at some level with the other "online" drives) and get back to "full speed" quickly? or will I always have to wait until one of the servers resilvers itself (from scratch?), and re-replicates itself??I have not tested this scenario, so I can''t say anything about this. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Ralf,> Jim, at first: I never said that AVS is a bad product. And I never > will. I wonder why you act as if you were attacked personally. > To be honest, if I were a customer with the original question, such > a reaction wouldn''t make me feel safer.I am sorry that my response came across that way, it was not intentional.> >>> - ZFS is not aware of AVS. On the secondary node, you''ll always >>> have to >>> force the `zfs import` due to the unnoticed changes of metadata >>> (zpool >>> in use). >> This is not true. If on the primary node invokes "zpool export" >> while replication is still active, then a forced "zpool import" is >> not required. This behavior is the same as with a zpool on dual- >> ported or SAN storage, and is NOT specific to AVS. > > Jim. A graceful shutdown of the primary node may be a valid desaster > scenario in the laboratory, but it never will be in the real life.I agree with your assessment that in real life a ''zpool export'' will never be done in a real disaster, but unconditionally doing a forced ''zpool import'' is problematic. Prior to performing the forced import, one needs to assure that the primary node is actually down and is not in the process of booting up, or that replication is stopped and will not automatically resume. Failure to make these checks prior to a forced ''zpool import'' could lead to scenarios where two or more instances of ZFS are accessing the same ZFS storage pool, each attempting to writing their own metadata, and thus there own CRCs. In time this action will result in CRC checksum failures on reads, followed by a ZFS induced panic.>>> No mechanism to prevent data loss exists, e.g. zpools can be >>> imported when the replicator is *not* in logging mode. >> This behavior is the same as with a zpool on dual-ported or SAN >> storage, and is NOT specific to AVS. > > And what makes you think that I said that AVS is the problem here? > > And by the way, the customer doesn''t care *why* there''s a problem. > He only wants to know *if* there''s a problem.There is a mechanism to prevent data lost here, its AVS! This is the reasoning behind questioning the association made above of replication being part of the problem, where in fact how replication is implemented with AVS, it is actually part of the solution. If one does not following the guidance suggested above before invoking a forced ''zpool import'', the action will likely result in on-disk CRC checksum inconsistencies within the ZFS storage pool, resulting in secondary node data loss, the initial point above. Since AVS replication is unidirectional there is no data loss on the primary node, and when replication is resumed, AVS will undo the faulty secondary node writes, correcting the actual data loss, and in time restoring 100% synchronization of the ZFS storage pool between the primary and secondary nodes.> >>> - AVS is not ZFS aware. >> AVS is not UFS, QFS, Oracle, Sybase aware either. This makes AVS, >> and other host based and controller based replication services >> multi-functional. If you desire ZFS aware functionality, use ZFS >> send and recv. > > Yes, exactly. And that''s the problem, sind `zfs send` and `zfs > receive` are no working solution in a fail-safe two node > environment. Again: the customer doesn''t care *why* there''s a > problem. He only wants to know *if* there''s a problem.My takeaway from this is that both AVS and ZFS are data path services, but collectively they are not on their own a complete disaster recovery solution. Since AVS is not aware of ZFS, and vice-versa, additional software in the form of Solaris Cluster, GeoCluster or other developed software needs to provide the awareness, so that viable disaster recovery solutions can be possible, and supportable.>>> For instance, if ZFS resilves a mirrored disk, >>> e.g. after replacing a drive, the complete disk is sent over the >>> network >>> to the secondary node, even though the replicated data on the >>> secondary >>> is intact.The problem with this statement is that one can not guarantee that the replicated data on the secondary is intact, specifically that the data is 100% identical to the non-failing side of the mirror on the primary node. Of course if this guarantee could be assured, then an "sndradm - E ...", (equal enable) could be done, and the full disk copy could be avoided. But all is not lost... A failure in writing to a mirrored volume almost assures that the data will be different, by at least one I/O, the one that triggered the initial failure of the mirror. The momentary upside is that AVS is interposed above the failing volume, so that the I/O will get replicated, even if it failed to make it the disk. The downside is that with ZFS (or any other mirroring software), once a failure is detected by the mirroring software, it will stop writing to the side of the mirror containing the failed disk (and thus the configured AVS replica), but will still continue to write to the non-failing side of the mirror. This assures that the good side of the mirror, and the replica will be out of sync.>>> >> The complete disk IS NOT sent of the over the network to the >> secondary node, only those disk blocks that re-written by ZFS. > > Yes, you''re right. But sadly, in the mentioned scenario of having > replaced an entire drive, the entire disk is rewritten ZFS.I have to believe that the issue being referred to is an order of enabling issue. One needs to enable AVS before ZFS. Let me explain. If I have a replacement volume that has yet to be given to ZFS, it contains unknown data. Likewise its replacement volume on the secondary node also contains unknown data (even if this volume is the one above, as it known to not be 100% intact). If one was to enable these two volumes with the "sndradm -E ...", where ''E'' means equal enable, this means to the replication software that unknown data = unknown data, therefore no replication is needed to bring the two volume into synchronization. Now when one gives the primary node volume to ZFS as a replacement, ZFS and thus AVS, only need to rewrite those metadata and data blocks that are in use by ZFS on the remaining good side of the mirror. This means a full-copy is avoided, unless of course the volume is full. Conversely, if one gives the replacement volume to ZFS prior to enabling the volume in AVS for replication, then "sndradm -E ..." can not be used, as the volumes are not starting out equal, and AVS was not running to scoreboard the differences. Therefore "sndradm -e ...", must be used, and in this case the entire disk will be replicated.> Again: And what makes you think that I said that AVS is the problem > here? > >>> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may >>> not be >>> imported on the secondary node during the replication. >> This behavior is the same as with a zpool on dual-ported or SAN >> storage, and is NOT specific to AVS. > > Again: And what makes you think that I said that AVS is the problem > here? We are not on avs-discuss, Jim.Your association of "ZFS & AVS & X4500", this is purely a ZFS issue. The problem at hand is that a ZFS storage pool can not be concurrently accessed by two or more instances of ZFS. This is true for both shared storage and replicated storage. This remains true even if one instance of ZFS will be operating in a read-only mode.>> I don''t understand the relevance to AVS in the prior three >> paragraphs? > > We are not on avs-discuss, Jim. The customer wanted to know what > drawbacks exist in his *scenario*. Not AVS. > >>> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft >>> partitions). Weren''t enough, the replication was still very slow, >>> probably because of an insane amount of head movements, and scales >>> badly. Putting the bitmap of a drive on the drive itself (if I >>> remember >>> correctly, this is recommended in one of the most referenced howto >>> blog >>> articles) is a bad idea. Always use ZFS on whole disks, if >>> performance >>> and caching matters to you. >> When you have the time, can you replace the "probably because >> of ... " with some real performance numbers? > > No problem. If you please organize a Try&Buy of two X4500 server > being sent to my address, thank you.Done: http://blogs.sun.com/AVS/entry/sun_storagetek_availability_suite_4 http://www.sun.com/tryandbuy/specialoffers.jsp>>> - AVS seems to require an additional shared storage when building >>> failover clusters with 48 TB of internal storage. That may be hard >>> to >>> explain to the customer. But I''m not 100% sure about this, because I >>> just didn''t find a way, I didn''t ask on a mailing list for help. >> When you have them time, can you replace the "AVS seems to ... " >> with some specific references to what you are referring to? > > The installation and configuration process and the location where > AVS wants to store the shared database. I can tell you details about > it the next time I give it try. Until then, please read the last > sentence you quoted once more, thank you.The design of AVS in a failover Sun Cluster requires shared access to AVS''s cluster-wide configuration data. This data is fixed at ~16.5 MB, and must be contained on a single volume the can be concurrently accessed by all nodes in a Sun Cluster. At the time AVS was enhanced to support Sun Cluster, various options where taken under consideration, this was the design selected, such as it may be. FWIW: Across all of Solaris, there a various methods of maintaining persistent configuration data. Sun Cluster uses it CCR database, SVM uses its metadb database, Solaris is starting to use its SCF database (part of SMF), the list goes on and on. The AVS developers approached Sun Cluster developers asking to use their CCR database mechanism, but at the time the answer was no. At this time it would be hard to reconsider this position.>>> If you want a fail-over solution for important data, use the >>> external >>> JBODs. Use AVS only to mirror complete clusters, don''t use it to >>> replicate single boxes with local drives. And, in case OpenSolaris >>> is >>> not an option for you due to your company policies or support >>> contracts, >>> building a real cluster also A LOT cheaper. >> You are offering up these position statements based on what? > > My outline agreements, my support contracts, partner web desk and > finally my experience with projects in high availability scenarios > with tens of thousands of servers. > > Jim, it''s okay. I know that you''re a project leader at Sun > Microsystems and that AVS is your main concern. But if there''s one > thing I cannot withstand, it''s getting stroppy replies from someone > who should know better and should have realized that he''s acting > publicly and in front of the people who finance his income instead > of trying to start a flame war. From now on, I leave the rest to > you, because I earn my living with products of Sun Microsystems, > too, and I don''t want to damage neither Sun nor this mailing list.My reasoning for posting not only the original but subsequent reply, is that AVS is constantly bombarded by "War wounds", where in fact the reasons that many of these stories exist is do in part to the fact that developing and deploying disaster recovery or high availability solutions is not easy. ZFS is the new "battlefront", allowing for opportunities to learn about ZFS, AVS and other replication technologies. In their day, similar "War wounds" and successful "battles" have been had regarding AVS in use with UFS, QFS, VxFS, SVM, VxVM, Oracle, Sybase and others. Jim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.> > -- > > Ralf Ramge > Senior Solaris Administrator, SCNA, SCSA > > Tel. +49-721-91374-3963 > ralf.ramge at webde.de - http://web.de/ > > 1&1 Internet AG > Brauerstra?e 48 > 76135 Karlsruhe > > Amtsgericht Montabaur HRB 6484 > > Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas > Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver > Mauss, Achim Weiss > Aufsichtsratsvorsitzender: Michael Scheeren
Matt,> Just to clarify a few items... consider a setup where we desire to > use AVS to replicate the ZFS pool on a 4 drive server to like > hardware. The 4 drives are setup as RaidZ. > > If we lose a drive (say #2) in the primary server, RaidZ will take > over, and our data will still be "available" but the array is at a > degraded state. > > But what happens to the secondary server? Specifically to its bit- > for-bit copy of Drive #2... presumably it is still good, but ZFS > will offline that disk on the primary server, replicate the > metadata, and when/if I "promote" the seconday server, it will also > be running in a degraded state (ie: 3 out of 4 drives). correct?The issue with any form of RAID >1, is that the instant a disk fails out of the RAID set, with the next write I/O to the remaining members of the RAID set, the failed disk (and its replica) are instantly out of sync.> In this scenario, my replication hasn''t really bought me any > increased availablity... or am I missing something?In testing with ZFS in the scenario, first of all the secondary node''s ZPOOL is not in the import state. So if one stops replication, or there is a primary node failure, a zpool import operation will need to be done on the secondary node. In all my testing to date, ZFS does the correct thing, realizing that one disk had failed out of the RAID set on the primary, and thus to not use it on the secondary. In short, ZFS knows that the RAID set is degraded, was being maintained in a degraded state, and this fact was replicated to the secondary node, correctly.> Also, if I do chose to fail over to the secondary, can I just to a > scrub the "broken" drive (which isn''t really broken, but the zpool > would be inconsistent at some level with the other "online" drives) > and get back to "full speed" quickly? or will I always have to wait > until one of the servers resilvers itself (from scratch?), and re- > replicates itself?? > > thanks in advance. > > -Matt > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.
On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote:> The issue with any form of RAID >1, is that the instant a disk fails > out of the RAID set, with the next write I/O to the remaining members > of the RAID set, the failed disk (and its replica) are instantly out > of sync.Does raidz fall into that category? Since the parity is maintained only on written blocks rather than all disk blocks on all columns, it seems to be resistant to this issue. -- Darren
On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote:> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote: >> The issue with any form of RAID >1, is that the instant a disk fails >> out of the RAID set, with the next write I/O to the remaining members >> of the RAID set, the failed disk (and its replica) are instantly out >> of sync. > > Does raidz fall into that category?Yes. The key reason is that as soon as ZFS (or other mirroring software) detects a disk failure in a RAID >1 set, it will stop writing to the failed disk, which also means it will also stop writing to the replica of the failed disk. From the point of view of the remote node, the replica of the failed disk is no longer being updated. Now if replication was stopped, or the primary node powered off or panicked, during the import of the ZFS storage pool on the secondary node, the replica of the failed disk must not be part of the ZFS storage pool as its data is stale. This happens automatically, since the ZFS metadata on the remaining disks have already given up on this member of the RAID set.> Since the parity is maintained only > on written blocks rather than all disk blocks on all columns, it seems > to be resistant to this issue. > > -- > Darren > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.
On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham wrote:> > On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote: > >> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote: >>> The issue with any form of RAID >1, is that the instant a disk fails >>> out of the RAID set, with the next write I/O to the remaining members >>> of the RAID set, the failed disk (and its replica) are instantly out >>> of sync. >> >> Does raidz fall into that category? > > Yes. The key reason is that as soon as ZFS (or other mirroring software) > detects a disk failure in a RAID >1 set, it will stop writing to the > failed disk, which also means it will also stop writing to the replica of > the failed disk. From the point of view of the remote node, the replica > of the failed disk is no longer being updated. > > Now if replication was stopped, or the primary node powered off or > panicked, during the import of the ZFS storage pool on the secondary > node, the replica of the failed disk must not be part of the ZFS storage > pool as its data is stale. This happens automatically, since the ZFS > metadata on the remaining disks have already given up on this member of > the RAID set.Then I misunderstood what you were talking about. Why the restriction on RAID >1 for your statement? Even for a mirror, the data is stale and it''s removed from the active set. I thought you were talking about block parity run across columns... -- Darren
On Sep 11, 2008, at 5:16 PM, A Darren Dunham wrote:> On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham wrote: >> >> On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote: >> >>> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote: >>>> The issue with any form of RAID >1, is that the instant a disk >>>> fails >>>> out of the RAID set, with the next write I/O to the remaining >>>> members >>>> of the RAID set, the failed disk (and its replica) are instantly >>>> out >>>> of sync. >>> >>> Does raidz fall into that category? >> >> Yes. The key reason is that as soon as ZFS (or other mirroring >> software) >> detects a disk failure in a RAID >1 set, it will stop writing to the >> failed disk, which also means it will also stop writing to the >> replica of >> the failed disk. From the point of view of the remote node, the >> replica >> of the failed disk is no longer being updated. >> >> Now if replication was stopped, or the primary node powered off or >> panicked, during the import of the ZFS storage pool on the secondary >> node, the replica of the failed disk must not be part of the ZFS >> storage >> pool as its data is stale. This happens automatically, since the ZFS >> metadata on the remaining disks have already given up on this >> member of >> the RAID set. > > Then I misunderstood what you were talking about. Why the restriction > on RAID >1 for your statement?No restriction. I meant to say, RAID 1 or greater.> Even for a mirror, the data is stale and > it''s removed from the active set. I thought you were talking about > block parity run across columns... > > -- > Darren > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc. work: 781-442-4042 cell: 603.724.2972
Sorry, I popped up to Hokkdaido for a holiday. I want to thank you all for the replies. I mentioned AVS as I thought it to do be the only product close to enabling us to do a (makeshift) fail-over setup. We have 5-6 ZFS filesystem, and 5-6 zvol with UFS (for quotas). To do "zfs send" snapshots every minute might perhaps be possible (just not very attractive), but if the script dies at any time, you need to resend the full volumes, this currently takes 5 days. (Even using "nc"). Since we are forced by vendor to run Sol10, it sounds like AVS is not an option for us. If we were interested in finding a method to replicate data to a 2nd x4500, what other options are there for us? We do not need instant updates, just someplace to fail-over to when the x4500 panics, or a HDD dies. (Which equals panic) It currently takes 2 hours to fsck the UFS volumes after a panic (and yes, they are logging; it is actually just the one UFS volume that always needs fsck). Vendor has mentioned "VeritasVolumReplicator" but I was under the impression that Veritas is a whole different set to zfs/zpool. Lund Jim Dunham wrote:> On Sep 11, 2008, at 5:16 PM, A Darren Dunham wrote: >> On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham wrote: >>> On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote: >>> >>>> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote: >>>>> The issue with any form of RAID >1, is that the instant a disk >>>>> fails >>>>> out of the RAID set, with the next write I/O to the remaining >>>>> members >>>>> of the RAID set, the failed disk (and its replica) are instantly >>>>> out >>>>> of sync. >>>> Does raidz fall into that category? >>> Yes. The key reason is that as soon as ZFS (or other mirroring >>> software) >>> detects a disk failure in a RAID >1 set, it will stop writing to the >>> failed disk, which also means it will also stop writing to the >>> replica of >>> the failed disk. From the point of view of the remote node, the >>> replica >>> of the failed disk is no longer being updated. >>> >>> Now if replication was stopped, or the primary node powered off or >>> panicked, during the import of the ZFS storage pool on the secondary >>> node, the replica of the failed disk must not be part of the ZFS >>> storage >>> pool as its data is stale. This happens automatically, since the ZFS >>> metadata on the remaining disks have already given up on this >>> member of >>> the RAID set. >> Then I misunderstood what you were talking about. Why the restriction >> on RAID >1 for your statement? > > No restriction. I meant to say, RAID 1 or greater. > >> Even for a mirror, the data is stale and >> it''s removed from the active set. I thought you were talking about >> block parity run across columns... >> >> -- >> Darren >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > Jim Dunham > Engineering Manager > Storage Platform Software Group > Sun Microsystems, Inc. > work: 781-442-4042 > cell: 603.724.2972 > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Jorgen Lundman wrote:> If we were interested in finding a method to replicate data to a 2nd > x4500, what other options are there for us?If you already have an X4500, I think the best option for you is a cron job with incremental ''zfs send''. Or rsync. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
On Tue, Sep 16, 2008 at 11:51 PM, Ralf Ramge <ralf.ramge at webde.de> wrote:> Jorgen Lundman wrote: > >> If we were interested in finding a method to replicate data to a 2nd >> x4500, what other options are there for us? > > If you already have an X4500, I think the best option for you is a cron > job with incremental ''zfs send''. Or rsync. > > -- > > Ralf Ramge > Senior Solaris Administrator, SCNA, SCSA >We had some Sun reps come out the other day to talk to us about storage options, and part of the discussion was AVS replication with ZFS. I brought up the question of replicating the resilvering process, and the reps said it does not replicate. They may be mistaken, but I''m hopeful they are correct. Could this behavior have been changed recently on AVS to make replication ''smarter'' with ZFS as the underlying filesystem? -- Brent Jones brent at servuhome.net
Brent,> On Tue, Sep 16, 2008 at 11:51 PM, Ralf Ramge <ralf.ramge at webde.de> > wrote: >> Jorgen Lundman wrote: >> >>> If we were interested in finding a method to replicate data to a 2nd >>> x4500, what other options are there for us? >> >> If you already have an X4500, I think the best option for you is a >> cron >> job with incremental ''zfs send''. Or rsync. >> >> -- >> >> Ralf Ramge >> Senior Solaris Administrator, SCNA, SCSA >> > > We had some Sun reps come out the other day to talk to us about > storage options, and part of the discussion was AVS replication with > ZFS. > I brought up the question of replicating the resilvering process, and > the reps said it does not replicate. They may be mistaken, but I''m > hopeful they are correct.The resilvering process is replicated, as AVS can not differentiate between ZFS resilvering writes, and ZFS filesystem writes.> Could this behavior have been changed recently on AVS to make > replication ''smarter'' with ZFS as the underlying filesystem?No ''smarter'' changes have been made to AVS. This issue at hand, is that as soon as ZFS makes the decision to not write to one of its configured vdevs, that vdev and its replica will now contain stale data. When ZFS is told to use the vdev again, (a zpool replace), ZFS starts resilvering all in-use data, plus any new ZFS filesystem writes on the local vdev, both of which will be replicated by AVS. It is the mixture of both resilvering writes, and new ZFS filesystem writes, that make it impossible for AVS to make replication ''smarter''.> -- > Brent Jones > brent at servuhome.net > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc.
Jim Dunham wrote:> It is the mixture of both resilvering writes, and new ZFS filesystem > writes, that make it impossible for AVS to make replication ''smarter''.Jim is right here. I just want to add that I don''t see an obvious way to make AVS as "smart" as Brent may wish it to be. Sometimes I describe AVS as a low level service with some proxy functionalities. That''s not really correct, but good enough for a single powerpoint sheet. AVS receives the writes from the file system, and replicates them. It does not care about the contents of the transactions, like IP can''t take care of the responsibilities of higher layer protocols like TCP or even layer 7 data (bad comparison, I know, but it may help to understand what I mean). What AVS does is copying the contents of devices. A file system writes some data to a sector on a hard disk -> AVS is aware of this transaction -> AVS replicates the sector to the second host -> on the secondary host, AVS makes sure that *exactly* the same data is written to *exactly* the same position on the secondary host''s storage device. Your secondary storage is a 100% copy. And if you write a bazillion of 0 byte sectors to the disk with `dd`, AVS will make sure that the secondary does does it, too. And it does this in near real time (if you ignore the network bottlenecks). The downside of it: it''s easy to do something wrong and you may run in network bottlenecks due to a higher amount of traffic. What AVS can''t offer: file-based replication. In many cases, you don''t have to care about having an exact copy of a device. For example, if you want a standby solution for your NFS file server, you want to keep the contents of the files and directories in sync. You don''t care if a newly written file uses the same inode number. You only care if the file is copied to your backup host while the file system of the backup host is *mounted*. The best-known service for this functionality is `rsync`. And if you know rsync, you know the downside of these services, too: don''t even think about replicating your data in real time and/or to multiple servers. The challenge is to find out which kind of replication suits your concept better. For instance, if you want to replicate html pages, graphics or other documents, perhaps even with a "copy button" on an intranet page, file-based replication is your friend. If you need real time copying or device replication, for instance on a database server with its own file system, or for keeping configuration files in sync across a cluster, then AVS is your best bet. But let''s face it: everybody wants the best of both worlds, and so people ask if AVS could not just get smarter. The answer: no, not really. It can''t check if the file system''s write operation "make sense" or if the data "really needs to be replicated". AVS is a truck which guarantees fast and accurate delivery of whatever you throw into it. Taking care of the content itself is the job of the person who prepares the freight. And, in our case, this person is called UFS. Or ZFS. And ZFS could do a much better job here. Sun''s marketing sells ZFS as offering data integrity at *all times* (http://www.sun.com/2004-0914/feature/). Well, that''s true, at least as long as there is no problem on lower layers. And I often wondered if ZFS doesn''t offer something fsck-like for faulted pools because it''s technically impossible, or because the marketing guys forbade it. I also wondered why people are enthusiastic about gimmicks like ditto blocks, but don''t want data protection in case an X4540 suffers a power outage and lots of gigabytes of zfs cache may go down the drain. Proposal: ZFS should offer some kind of "IsReplicated" flag in the zpool metadata. During a `zpool import`, this flag should be checked and if it is set, a corresponding error message should be printed on stdout. Or the ability to set dummy zpool parameters, something like a "zpool set storage:cluster:avs=true tank". This would be only some kind of first aid only, but that''s better than nothing. This has nothing to do with AVS only. It also applies to other replication services. It would allow us to write simple wrapper scripts to switch the replication mechanism into logging mode, thus allowing us to safely force the import of the zpool in case of a desaster. Of course, it would be even better to integrate AVS into ZFS itself. "zfs set replication=<hostname1>[,<hostname2>...<hostnameN>]" would be the coolest thing on earth, because it would combine the benefits of AVS and rsync-like replication into a perfect product. And it would allow the marketing people to use the "high availability" and "full data redundancy" buzzwords in their flyers. But until then, I''ll have to continue using cron jobs on the secondary node which try to log in to the primary with ssh and to do a "zfs get storage:cluster:avs <filesystem>" on all mounted file systems and save it locally for my "zpool import wrapper" script. This is a cheap workaround, but honestly: You can use something like this for your own datacenter, but I bet nobody wants to sell it to a customer as a supported solution ;-) -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren