Hi, When I read the ZFS manual, it usually recommends to configure redundancy at the ZFS layer, mainly because there are features that will work only with redundant configuration (like corrupted data correction), also it implies that the overall robustness will improve. My question is simple, what is the recommended configuration on SAN (on high-end EMC, like the Symmetrix DMX series for example) where usually the redundancy is configured at the array level, so most likely we would use simple ZFS layout, without redundancy? Is it worth to move the redundancy from the SAN array layer to the ZFS layer? (configuring redundancy on both layers is sounds like a waste to me) There are certain advantages on the array to have redundancy configured (beyond the protection against simple disk failure). Can we compare the advantages of having (for example) RAID5 configured on a high-end SAN with no redundancy at the ZFS layer versus no redundant RAID configuration on the high-end SAN but having raidz or raidz2 on the ZFS layer? Any tests, experience or best practices regarding this topic? How does ZFS perform (from performance and robustness (or availability if you like) point of view) on high-end SANs, compared to a VSF for example? If you could share your experience with me, I would really appreciate that. Regards, sendai -- This message posted from opensolaris.org
Damon, Yes, we can provide simple concat inside the array (even though today we provide RAID5 or RAID1 as our standard, and using Veritas with concat), the question is more of if it''s worth it to switch the redundancy from the array to the ZFS layer. The RAID5/1 features of the high-end EMC arrays also provide performance improvements, that''s why I wonder what would be the pros/cons of such a switch (I mean the switch of the redundancy from the array to the ZFS layer). So, you telling me that even if the SAN provides redundancy (HW RAID5 or RAID1), people still configure ZFS with either raidz or mirror? Regards, sendai On Sat, Feb 14, 2009 at 6:06 AM, Damon Atkins <Damon.Atkins at _no_spam_yahoo.com.au> wrote:> Andras, > It you can get Concat Disk or Raid 0 Disk inside the array, then use RaidZ > (if I/O is not large amount or its mostly sequential) if very high I/O then > use ZFS Mirror. You can not spread a zpool over multiple EMC Arrays using > SRDF if you are not using EMC Power Path. > > HDS for example does not support anything other than Mirror or RAID5 > configuration, so RaidZ or ZFS Mirror results in a lot of wasted disk space. > However people still use RaidZ on HDS Raid5. As the top of the line HDS > arrays are very fast and they want the features offered by ZFS. > > Cheers > >-- This message posted from opensolaris.org
On 14-Feb-09, at 2:40 AM, Andras Spitzer wrote:> Damon, > > Yes, we can provide simple concat inside the array (even though > today we provide RAID5 or RAID1 as our standard, and using Veritas > with concat), the question is more of if it''s worth it to switch > the redundancy from the array to the ZFS layer. > > The RAID5/1 features of the high-end EMC arrays also provide > performance improvements, that''s why I wonder what would be the > pros/cons of such a switch (I mean the switch of the redundancy > from the array to the ZFS layer). > > So, you telling me that even if the SAN provides redundancy (HW > RAID5 or RAID1), people still configure ZFS with either raidz or > mirror?Without doing so, you don''t get the benefit of checksummed self-healing. --Toby> > Regards, > sendai > > On Sat, Feb 14, 2009 at 6:06 AM, Damon Atkins > <Damon.Atkins at _no_spam_yahoo.com.au> wrote: >> Andras, >> It you can get Concat Disk or Raid 0 Disk inside the array, then >> use RaidZ >> (if I/O is not large amount or its mostly sequential) if very high >> I/O then >> use ZFS Mirror. You can not spread a zpool over multiple EMC >> Arrays using >> SRDF if you are not using EMC Power Path. >> >> HDS for example does not support anything other than Mirror or RAID5 >> configuration, so RaidZ or ZFS Mirror results in a lot of wasted >> disk space. >> However people still use RaidZ on HDS Raid5. As the top of the >> line HDS >> arrays are very fast and they want the features offered by ZFS. >> >> Cheers >> >> > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Andras Spitzer wrote:> Is it worth to move the redundancy from the SAN array layer to the ZFS layer? (configuring redundancy on both layers is sounds like a waste to me) There are certain advantages on the array to have redundancy configured (beyond the protection against simple disk failure). Can we compare the advantages of having (for example) RAID5 configured on a high-end SAN with no redundancy at the ZFS layer versus no redundant RAID configuration on the high-end SAN but having raidz or raidz2 on the ZFS layer? > > Any tests, experience or best practices regarding this topic? > > >Would also like to hear about experiences with ZFS on EMC''s Symmetrix. Currently we are using VxFS with Powerpath for multipathing, and synchronous SRDF for replication to our other datacenter. At some point we will move to ZFS, but there are so many options how to implement this. From a sysadmin point of view (simplicity), I would like to use mpxio and host based mirroring. ZFS self-healing would be available in this configuration. Asking EMC guys for their opinion is not an option. They will push you to buy SRDF and Powerpath licenses... :-)
On Fri, 13 Feb 2009, Andras Spitzer wrote:> > So, you telling me that even if the SAN provides redundancy (HW > RAID5 or RAID1), people still configure ZFS with either raidz or > mirror?When ZFS''s redundancy features are used, there is decreased risk of total pool failure. With redundancy at the ZFS level, errors may be corrected. With care in the pool design, more overall performance may be obtained since a number of independent arrays may be pooled together to obtain more bandwidth and storage space. With this in mind, if the SAN hardware is known to work very well, placing the pool on a single SAN device is still an option. If you do use ZFS''s redundancy features, it is important to consider resilver time. Try to keep volume size small enough that it may be resilvered in a reasonable amount of time. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>>>>> "as" == Andras Spitzer <wsendai at gmail.com> writes:as> So, you telling me that even if the SAN provides redundancy as> (HW RAID5 or RAID1), people still configure ZFS with either as> raidz or mirror? There''s some experience that, in the case where the storage device or the FC mesh glitches or reboots while the ZFS host stays up across the reboot, you are less likely to lose the whole pool to ``ZFS-8000-72 The pool metadata is corrupted and cannot be opened. Destroy the pool and restore from backup.'''' if you have ZFS-level redundancy than if you don''t. Note that this ``corrupt and cannot be opened'''' is a different problem from ``not being able to self-heal.'''' When you need self-healing and don''t have it, you usually shouldn''t lose the whole pool. You should get a message in ''zpool status'' telling you the name of a file that has unrecoverable errors. Any attempt to read the file returns an I/O error (not the marginal data). Then you have to go delete that file to clear the error, but otherwise the pool keeps working. In this self-heal case, if you''d had the ZFS-layer redundancy you''d get a count in the checksum column of one device and wouldn''t have to delete the file, in fact you wouldn''t even know the name of the file that got healed. some people have been trying to blame the ``corrupt and cannot be opened'''' on bit-flips supposedly happening inside the storage or the FC cloud, the same kind of bit flip that causes the other self-healable problem, but I don''t buy it. I think it''s probably cache sync / write barrier problems that''s killing the unredundant pools on SAN''s. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090215/7d45ad99/attachment.bin>
Hello Bob, Saturday, February 14, 2009, 6:16:54 PM, you wrote: BF> If you do use ZFS''s redundancy features, it is important to consider BF> resilver time. Try to keep volume size small enough that it may be BF> resilvered in a reasonable amount of time. Well, in most cases resilver in ZFS should be quicker than resilver in a disk array because ZFS will resilver only blocks which are actually in use while most disk arrays will blindly resilver full disk drives. So assuming you still have plenty unused disk space in a pool then zfs resilver should take less time. -- Best regards, Robert Milkowski http://milek.blogspot.com
On Sun, 15 Feb 2009, Robert Milkowski wrote:> > Well, in most cases resilver in ZFS should be quicker than resilver in > a disk array because ZFS will resilver only blocks which are actually > in use while most disk arrays will blindly resilver full disk drives. > So assuming you still have plenty unused disk space in a pool then zfs > resilver should take less time.It is reasonable to assume that storage will eventually become close to full. Then the user becomes entrapped by their design. Adding to the issues is that as the ZFS pool ages and becomes full, it becomes slower as well due to increased fragmentation, and this fragmentation slows down resilver performance. We have heard here from people who based their pool on a mirror of large multi-terrabyte LUNs. This seemed to initially work ok but later on (to their dismay) they discovered that it would take several days or a week to resilver one of the LUNs. The most severe cases were when the huge LUN is actually a ZFS volume exported by iSCSI from a server (e.g. a whole Thumper). When one of the LUNs gets rebooted, it takes quite a long time for ZFS to catch it up, and possibly it (or its peer) will be rebooted again in the mean-time. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sun, Feb 15, 2009 at 5:00 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Sun, 15 Feb 2009, Robert Milkowski wrote: > >> >> Well, in most cases resilver in ZFS should be quicker than resilver in >> a disk array because ZFS will resilver only blocks which are actually >> in use while most disk arrays will blindly resilver full disk drives. >> So assuming you still have plenty unused disk space in a pool then zfs >> resilver should take less time. >> > > It is reasonable to assume that storage will eventually become close to > full. Then the user becomes entrapped by their design. Adding to the > issues is that as the ZFS pool ages and becomes full, it becomes slower as > well due to increased fragmentation, and this fragmentation slows down > resilver performance. >Pardon me for jumping into this discussion. I invariably lurk and keep mouth firmly shut. In this case however, curiosity and a degree of alarm bade me to jump in....could you elaborate on ''fragmentation'' since the only context I know this is Windows. Now surely, ZFS doesn''t suffer from the same sickness? As a followup; is there any ongoing sensible way to defend against the dreaded fragmentation? A [shudder] "defrag" routine of some kind perhaps? Forgive the "silly questions" from the sidelines.....ignorance knows no bounds apparently :) Warm Regards, -Colin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090215/a203ab0e/attachment.html>
On Sun, 15 Feb 2009, Colin Raven wrote:> > Pardon me for jumping into this discussion. I invariably lurk and keep mouth > firmly shut. In this case however, curiosity and a degree of alarm bade me > to jump in....could you elaborate on ''fragmentation'' since the only context > I know this is Windows. Now surely, ZFS doesn''t suffer from the same > sickness?ZFS is "fragmented by design". Regardless, it takes steps to minimize fragmentation, and the costs of fragmentation. Files written sequentially at a reasonable rate of speed are usually contiguous on disk as well. A "slab" allocator is used in order to allocate space in larger units, and then dice this space up into ZFS 128K blocks so that related blocks will be close together on disk. The use of larger block sizes (default 128K vs 4K, or 8K) dramatically reduces the amount of disk seeking required for sequential I/O when fragmentation is present. Written data is buffered in RAM for up to 5 seconds before being written so that opportunities for contiguous storage are improved. When the pool has multiple vdevs, then ZFS''s "load share" can also intelligently allocate file blocks across multiple disks such that there is minimal head movement, and multiple seeks can take place at once.> As a followup; is there any ongoing sensible way to defend against the > dreaded fragmentation? A [shudder] "defrag" routine of some kind perhaps? > Forgive the "silly questions" from the sidelines.....ignorance knows no > bounds apparently :)The most important thing is to never operate your pool close to 100% full. Always leave a reserve so that ZFS can use reasonable block allocation policies, and is not forced to allocate blocks in a way which causes additional performance penalty. Installing more RAM in the system is likely to decrease fragmentation since then ZFS can defer writes longer and make better choices about where to put the data. Updating already written portions of files "in place" will convert a completely contiguous file into a fragmented file due to ZFS''s copy-on-write design. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sun, Feb 15, 2009 at 8:02 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Sun, 15 Feb 2009, Colin Raven wrote: > >> >> Pardon me for jumping into this discussion. I invariably lurk and keep >> mouth >> firmly shut. In this case however, curiosity and a degree of alarm bade me >> to jump in....could you elaborate on ''fragmentation'' since the only >> context >> I know this is Windows. Now surely, ZFS doesn''t suffer from the same >> sickness? >> > > ZFS is "fragmented by design". Regardless, it takes steps to minimize > fragmentation, and the costs of fragmentation. Files written sequentially > at a reasonable rate of speed are usually contiguous on disk as well. A > "slab" allocator is used in order to allocate space in larger units, and > then dice this space up into ZFS 128K blocks so that related blocks will be > close together on disk. The use of larger block sizes (default 128K vs 4K, > or 8K) dramatically reduces the amount of disk seeking required for > sequential I/O when fragmentation is present. Written data is buffered in > RAM for up to 5 seconds before being written so that opportunities for > contiguous storage are improved. When the pool has multiple vdevs, then > ZFS''s "load share" can also intelligently allocate file blocks across > multiple disks such that there is minimal head movement, and multiple seeks > can take place at once. > > As a followup; is there any ongoing sensible way to defend against the >> dreaded fragmentation? A [shudder] "defrag" routine of some kind perhaps? >> Forgive the "silly questions" from the sidelines.....ignorance knows no >> bounds apparently :) >> > > The most important thing is to never operate your pool close to 100% full. > Always leave a reserve so that ZFS can use reasonable block allocation > policies, and is not forced to allocate blocks in a way which causes > additional performance penalty. Installing more RAM in the system is likely > to decrease fragmentation since then ZFS can defer writes longer and make > better choices about where to put the data. > > Updating already written portions of files "in place" will convert a > completely contiguous file into a fragmented file due to ZFS''s copy-on-write > design. > >> Thank you for a most lucid and readily understandable explanation. I shall > now return to the sidelines....hoping to have a zfs box up and running > sometime in the near future when budget and time permit. Keeping up with > this list is helpful in anticipation of that time arriving. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090215/b1eb4834/attachment.html>
On Sun, 15 Feb 2009, Colin Raven wrote:>> >> As a followup; is there any ongoing sensible way to defend against the >>> dreaded fragmentation? A [shudder] "defrag" routine of some kind perhaps? >>> Forgive the "silly questions" from the sidelines.....ignorance knows no >>> bounds apparently :)There is no "defrag" facility since the zfs pool design naturally controls the degree of fragmentation. Zfs filesystems don''t tend to "fall apart" like Windows FAT, or NTFS. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sendai, On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer wrote:> Hi, > > When I read the ZFS manual, it usually recommends to configure redundancy at the ZFS layer, mainly because there are features that will work only with redundant configuration (like corrupted data correction), also it implies that the overall robustness will improve. > > My question is simple, what is the recommended configuration on SAN (on high-end EMC, like the Symmetrix DMX series for example) where usually the redundancy is configured at the array level, so most likely we would use simple ZFS layout, without redundancy?>From my experience, this is a bad idea. I ahve seen couple of cases with suchconfig (no redundancy at ZFS level) where the connection between the HBA and the storage was flaky. And there was no way for ZFS to recover. I agree that MPxIO or any other multipathing handles failure of links. But, that in itself is not sufficient. Thanks and regards, Sanjeev -- ---------------- Sanjeev Bagewadi Solaris RPE Bangalore, India
On Mon, Feb 16, 2009 at 9:11 AM, Sanjeev <sanjeev.bagewadi at sun.com> wrote:> Sendai, > > On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer wrote: >> Hi, >> >> When I read the ZFS manual, it usually recommends to configure redundancy at the ZFS layer, mainly because there are features that will work only with redundant configuration (like corrupted data correction), also it implies that the overall robustness will improve. >> >> My question is simple, what is the recommended configuration on SAN (on high-end EMC, like the Symmetrix DMX series for example) where usually the redundancy is configured at the array level, so most likely we would use simple ZFS layout, without redundancy? > > >From my experience, this is a bad idea. I ahve seen couple of cases with such > config (no redundancy at ZFS level) where the connection between the HBA and the > storage was flaky. And there was no way for ZFS to recover. I agree that MPxIO > or any other multipathing handles failure of links. But, that in itself is not > sufficient. >So what would you recommend then, Sanjeev ? - multiple ZFS pools running on a SAN ? - An S10 box or boxes that provide ZFS backed iSCSI ? -- Sriram
Sriram, On Mon, Feb 16, 2009 at 11:12:42AM +0530, Sriram Narayanan wrote:> On Mon, Feb 16, 2009 at 9:11 AM, Sanjeev <sanjeev.bagewadi at sun.com> wrote: > > Sendai, > > > > On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer wrote: > >> Hi, > >> > >> When I read the ZFS manual, it usually recommends to configure redundancy at the ZFS layer, mainly because there are features that will work only with redundant configuration (like corrupted data correction), also it implies that the overall robustness will improve. > >> > >> My question is simple, what is the recommended configuration on SAN (on high-end EMC, like the Symmetrix DMX series for example) where usually the redundancy is configured at the array level, so most likely we would use simple ZFS layout, without redundancy? > > > > >From my experience, this is a bad idea. I ahve seen couple of cases with such > > config (no redundancy at ZFS level) where the connection between the HBA and the > > storage was flaky. And there was no way for ZFS to recover. I agree that MPxIO > > or any other multipathing handles failure of links. But, that in itself is not > > sufficient. > > > > So what would you recommend then, Sanjeev ? > - multiple ZFS pools running on a SAN ?That''s fine. What I meant was you need to have redundancy at ZFS level.> - An S10 box or boxes that provide ZFS backed iSCSI ?That should be fine as well. The point of discussion was whether we should have redundancy at ZFS level and the answer is yes. Thanks and regards, Sanjeev> > -- Sriram-- ---------------- Sanjeev Bagewadi Solaris RPE Bangalore, India
Hello Bob, Sunday, February 15, 2009, 9:42:25 PM, you wrote: BF> On Sun, 15 Feb 2009, Colin Raven wrote:>>> >>> As a followup; is there any ongoing sensible way to defend against the >>>> dreaded fragmentation? A [shudder] "defrag" routine of some kind perhaps? >>>> Forgive the "silly questions" from the sidelines.....ignorance knows no >>>> bounds apparently :)BF> There is no "defrag" facility since the zfs pool design naturally BF> controls the degree of fragmentation. Zfs filesystems don''t tend to BF> "fall apart" like Windows FAT, or NTFS. Well... -- Best regards, Robert Milkowski http://milek.blogspot.com
On Mon, Feb 16, 2009 at 12:26 AM, Sanjeev <Sanjeev.Bagewadi at sun.com> wrote:> Sriram, > > On Mon, Feb 16, 2009 at 11:12:42AM +0530, Sriram Narayanan wrote: > > On Mon, Feb 16, 2009 at 9:11 AM, Sanjeev <sanjeev.bagewadi at sun.com> > wrote: > > > Sendai, > > > > > > On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer wrote: > > >> Hi, > > >> > > >> When I read the ZFS manual, it usually recommends to configure > redundancy at the ZFS layer, mainly because there are features that will > work only with redundant configuration (like corrupted data correction), > also it implies that the overall robustness will improve. > > >> > > >> My question is simple, what is the recommended configuration on SAN > (on high-end EMC, like the Symmetrix DMX series for example) where usually > the redundancy is configured at the array level, so most likely we would use > simple ZFS layout, without redundancy? > > > > > > >From my experience, this is a bad idea. I ahve seen couple of cases > with such > > > config (no redundancy at ZFS level) where the connection between the > HBA and the > > > storage was flaky. And there was no way for ZFS to recover. I agree > that MPxIO > > > or any other multipathing handles failure of links. But, that in itself > is not > > > sufficient. > > > > > > > So what would you recommend then, Sanjeev ? > > - multiple ZFS pools running on a SAN ? > > That''s fine. What I meant was you need to have redundancy at ZFS level. > > > - An S10 box or boxes that provide ZFS backed iSCSI ? > That should be fine as well. > > The point of discussion was whether we should have redundancy at ZFS level > and the answer is yes. > > Thanks and regards, > Sanjeev > >Uhhh, S10 box that provide zfs backed iSCSI is NOT fine. Cite the plethora of examples on this list of how the fault management stack takes so long to respond it''s basically unusable as it stands today. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090216/da33427f/attachment.html>
>>>>> "t" == Tim <tim at tcsac.net> writes:t> Uhhh, S10 box that provide zfs backed iSCSI is NOT fine. Cite t> the plethora of examples on this list of how the fault t> management stack takes so long to respond it''s basically t> unusable as it stands today. well...if we are talking about reliability, whether or not you lose the whole pool when some network element or disk target reboots, that''s separate from availability, do your final applications experience glitches and outages or are they insulated from failures that happen far enough underneath the layering? The insulating right now seems pretty poor compared to other SAN products, but that''s not the same problem as the single-LUN Reliability issue we''ve been discussing recently. if you make a zpool mirror out of two S10 box providing iSCSI instead of one iSCSI target, that is better. If you make a zpool from one S10 box providing iSCSI, does not matter if the iSCSI target software is serving the one LUN from a zvol or a disk or an SVM slice, it is not fine for reliability to have a single-LUN iSCSI vdev. You must have ZFS-layer redundancy on the overall pool, above the iSCSI. if you change from: client NFS | +------------------------------+ +------------+ | SAN | | NFS ZFS|--lun0----|iSCSI/FC | | | +------------------------------+ +------------+ | | | +------------+-------------++------------+ | disk || disk || disk | | shelf || shelf || shelf | | || || | +------------++------------++------------+ to using a non-ZFS filesystem above the iSCSI: [client |notZFS] -----+-------------+------------+ | | | | | | +------------+-------------++------------+ |iSCSI target||iSCSI target||iSCSI target| | ZFS || ZFS || ZFS | |local disk ||local disk ||local disk | +------------++------------++------------+ then that is probably more okay. But for me the appeal of ZFS is to aggregate spread-out storage into really huge pools. Maybe it''s hard to find a good notZFS to use in that diagram. My understanding so far is, this is also better: client NFS | +------------------------------+ +------------+---lun0---| SAN | | NFS ZFS|---lun1---|iSCSI/FC | | | +------------------------------+ +------------+ | | | +------------+-------------++------------+ | disk || disk || disk | | shelf || shelf || shelf | | || || | +------------++------------++------------+ where lun0 and lun1 make up a mirrored vdev in ZFS. It does not matter if lun0 and lun1 are carried on the same physical cable or connected to the same SAN controller, but obviously they DO need to be backed by separate storage, so you''re wasting a lot of disk to do this. Even if some maintenance event reboots lun0 and lun1 at the same time, my understanding so far is, people have found this configuration above less likely to lose whole pools than running on a single lun. Ex., see thread including message-id <B5EB902A18810E43800F26A02DB4279A2E6F13FE at CNJEXCHANGE.composers.caxton.com> around 2008-08-13. Another alternative is to go ahead and use single-lun vdev''s, but have data mirrored across multiple zpools using zfs send | zfs recv or rsync in case you lose a whole pool. That''s appealing to me because the backup pool can be built from cheaper slower pieces than the main pool, instead of burning up double the amount of expensive main pool storage, and it protects from problems/bugs/mistakes that a vdev mirror does not, and it allows changing pool geometry and removing slogs and stuff. The downside to planning on restoring from backup is, if you lose your single-lun pool relatively often, these big heavily-aggregated pools can take days to restore. It''s a type of offline maintenance, which is always against the original ZFS kool-aid philosophy because offline maintenance puts a de-facto cap on max pool size. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090216/0714033e/attachment.bin>
Hi all, Ok, this might be to stir some things up again but I would like to make this more clear. I have been reading this and other threads regarding ZFS on SAN and how well ZFS can recover from a serious error such as a cached disk array goes down or the connection to the SAN is lost. What I am hearing (Miles, ZFS-8000-72) is that sometimes you can end up in an unrecoverable state that forces you to restore the whole pool. I have been operating quite large deployments of SVM/UFS VxFS/VxVM for some years and while you sometimes are forced to do a filesystem check and some files might end up in lost+found I have never lost a whole filesystem. This is despite whole arrays crashing, split-brain scenarios etc. In the previous discussion a lot of fingers was pointed at hardware and USB connections, but then some people mentioned loosing pools located SAN in this thread. We are currently evaluating if we should begin to implement ZFS in our SAN. I can see great opportunities with ZFS but if we have a higher risk of loosing entire pools that is a serious issue. I am aware that the other filesystems might not be in a correct state after a serious failure, but as stated before that can be much better than restoring a multi terabyte filesystem from yesterdays backup. So, what is the opinion, is this an existing problem even when using enterprise arrays? If I understand this correctly there should be no risk of loosing an entire pool if DKIOCFLUSHWRITECACHE is honored by the array? If it is a problem, will the worst case scenario be at least on pair with UFS/VxFS when 6667683 is fixed? Grateful for any additional information. Regards Henrik Johansson http://sparcv9.blogspot.com
On Tue, 17 Feb 2009, Henrik Johansson wrote:> > We are currently evaluating if we should begin to implement ZFS in our SAN. I > can see great opportunities with ZFS but if we have a higher risk of loosing > entire pools that is a serious issue. I am aware that the other filesystems > might not be in a correct state after a serious failure, but as stated before > that can be much better than restoring a multi terabyte filesystem from > yesterdays backup.It is not clear that the risk of loosing the entire pool is higher than other filesystem types. This is a point of considerable conjecture, with no failure data to base statistics on. What is clear is that ZFS allows you to easily build much much larger pools than other filesystem types do. A 12-disk pool that I built a year ago is still working fine with absolutely no problems at all. Another two disk pool built using cheap large USB drives has been running for maybe eight months, with no problems. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
bfriesen at simple.dallas.tx.us said:> A 12-disk pool that I built a year ago is still working fine with absolutely > no problems at all. Another two disk pool built using cheap large USB > drives has been running for maybe eight months, with no problems.We have non-redundant ZFS pools on an HDS 9520V array, and also a Sun 6120 array, some of them running for two years now (S10U3, S10U4, S10U5, both SPARC and x86), up to 4TB in size. We have experienced SAN zoning mistakes, complete power loss to arrays, servers, and/or SAN switches, etc., with no pool corruption or data loss. We have not even seen one block checksum error detected by ZFS on these arrays (we have seen one such error on our X4500 in the past 6 months). Note that the only available pool failure mode in the presence of a SAN I/O error for these OS''s has been to panic/reboot, but so far when the systems have come back, data has been fine. We also do tape backups of these pools, of course. Regards, -- Marion Hakanson <hakansom at ohsu.edu> OHSU Advanced Computing Center
On Tue, February 17, 2009 01:50, Marion Hakanson wrote:> Note that the only available pool failure mode in the presence of a SAN > I/O error for these OS''s has been to panic/reboot, but so far when the > systems have come back, data has been fine. We also do tape backups > of these pools, of course.Starting with Solaris 10u6 (?), the following property is available in zpool(1M): failmode=wait | continue | panic Controls the system behavior in the event of catas- trophic pool failure. This condition is typically a result of a loss of connectivity to the underlying storage device(s) or a failure of all devices within the pool. The behavior of such an event is determined as follows: wait Blocks all I/O access until the device con- nectivity is recovered and the errors are cleared. This is the default behavior. continue Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked. panic Prints out a message to the console and gen- erates a system crash dump.
Hi All, I have been watching this thread for a while and thought it was time a chipped my 2 cents worth in. I have been an aggressive adopter of ZFS here across all of our Solaris systems and have found the benefits have far outweighed any small issues that have arisen. Currently I have many systems that have LUNs provided from SAN based storage to systems for zpools. All our systems are configured with mirrored vdevs and the reliability factor has been as good as, if not greater than UFS and LVM. My rules of thumb around systems tend to stem around getting the storage infrastructure right as that generally leads to the best availability. To this end for every single SAN attached system we have dual paths to separate switches, every array has dual controllers dual pathed to different switches. ZFS may be more or less susceptible to any physical infrastructure problem, but in my experience it is on a par with UFS (and I gave up shelling out for vxfs long ago) The reasons for the above configuration is that our storage is evenly split between two sites dark fibre between them across redundant routes. This forms a ring configuration which is around 5 km around. We have so much storage that we need to have this in case of a data center catastrophe. The business recognizes the time to recovery risk would be so great that if we didn''t we would be out of business in the event of one of our data centres burning or other natural disaster. I have seen other people discussing power availability on other threads recently. If you want it, you can have it. You just need the business case for it. I don''t buy the comments on UPS unreliability. Quite frequently I have rebooted arrays and removed them from mirrored vdevs and have not had any issues with the LUNS they provided reattaching and re silvering. Scrubs on the pools have always been successful. Largest single mirrored pool is around 11TB which is form two 6140 RAID 5''s. We also use Loki boxes as well for very large storage pools which are routinely filled. (I was a beta tester for Loki). I have two J4500''s, one with 48 x 250 GB and 1 x with 48 x 1 TB drives. No issues there either. The 48 x 1 TB is used in a a Disk _> Disk - Tape config with a SL500 to back up our entire site. It is routinely fulled to the brim and it performs admirably attached to a T5220 which is 10 gig attached. All of the systems I have mentioned vary from Samba servers to compliance archives to Oracle DB servers, Blackboard content stores, squid web caches, LDAP directory servers, Mail stores, Mail spools., Calendar servers DB''s. The list covers 60 plus systems. I have 0% Solaris older than Solaris 10. Why would you? In short I hope people don''t hold back from adoption of ZFS because they are unsure about it. Judge for yourself as I have done and dip your toes in at whatever rate you are happy to do so. Thats what I did. /Scott. I also use it at home too with and old D1000 attached to a v120 with 8 x 320 GB scsi''s in a RAIDZ2 for all our home data and home business (which is a printing outfit which creates a lot of very big files on our macs). -- _______________________________________________________________________ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz ________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' ________________________________________________________________________
Hi All, I have been watching this thread for a while and thought it was time a chipped my 2 cents worth in. I have been an aggressive adopter of ZFS here across all of our Solaris systems and have found the benefits have far outweighed any small issues that have arisen. Currently I have many systems that have LUNs provided from SAN based storage to systems for zpools. All our systems are configured with mirrored vdevs and the reliability factor has been as good as, if not greater than UFS and LVM. My rules of thumb around systems tend to stem around getting the storage infrastructure right as that generally leads to the best availability. To this end for every single SAN attached system we have dual paths to separate switches, every array has dual controllers dual pathed to different switches. ZFS may be more or less susceptible to any physical infrastructure problem, but in my experience it is on a par with UFS (and I gave up shelling out for vxfs long ago) The reasons for the above configuration is that our storage is evenly split between two sites dark fibre between them across redundant routes. This forms a ring configuration which is around 5 km around. We have so much storage that we need to have this in case of a data center catastrophe. The business recognizes the time to recovery risk would be so great that if we didn''t we would be out of business in the event of one of our data centres burning or other natural disaster. I have seen other people discussing power availability on other threads recently. If you want it, you can have it. You just need the business case for it. I don''t buy the comments on UPS unreliability. Quite frequently I have rebooted arrays and removed them from mirrored vdevs and have not had any issues with the LUNS they provided reattaching and re silvering. Scrubs on the pools have always been successful. Largest single mirrored pool is around 11TB which is form two 6140 RAID 5''s. We also use Loki boxes as well for very large storage pools which are routinely filled. (I was a beta tester for Loki). I have two J4500''s, one with 48 x 250 GB and 1 x with 48 x 1 TB drives. No issues there either. The 48 x 1 TB is used in a a Disk _> Disk - Tape config with a SL500 to back up our entire site. It is routinely fulled to the brim and it performs admirably attached to a T5220 which is 10 gig attached. All of the systems I have mentioned vary from Samba servers to compliance archives to Oracle DB servers, Blackboard content stores, squid web caches, LDAP directory servers, Mail stores, Mail spools., Calendar servers DB''s. The list covers 60 plus systems. I have 0% Solaris older than Solaris 10. Why would you? In short I hope people don''t hold back from adoption of ZFS because they are unsure about it. Judge for yourself as I have done and dip your toes in at whatever rate you are happy to do so. Thats what I did. /Scott. I also use it at home too with and old D1000 attached to a v120 with 8 x 320 GB scsi''s in a RAIDZ2 for all our home data and home business (which is a printing outfit which creates a lot of very big files on our macs). -- _______________________________________________________________________ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz ________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' ________________________________________________________________________
>>>>> "hj" == Henrik Johansson <henrikj at henkis.net> writes:hj> I have been operating quite large deployments of SVM/UFS hj> VxFS/VxVM for some years and while you sometimes are forced to hj> do a filesystem check and some files might end up in hj> lost+found I have never lost a whole filesystem. I think in the world we want, even with the other filesystems, the SAN fabric or array controller or disk shelf should be able to reboot without causing any files to show up in lost+found, or requiring anything other than the normal log roll-forward. I bet there are rampant misimplementations. Maybe the whole SAN situation is ubiquitously misthought because filesystem designers build things assuming that whenever anything ``crashes,'''' the kernel and their own code will go down too. They invent a clever way to handle a non-SAN cord-yanking, test it, and yup you can yank the cord it works fine. But this isn''t the actual way things can fail. In the diagram below the disk loses power, but the host, SAN, and controller don''t. I doubt this is too common. Probably I should redo diagrams like this after better understanding the disk commandset and iSCSI tagged commands and stuff, for other parts of the stack rebooting like the SAN or the controller. filesystem initiator SAN controller diskbuffer platter [...earlier writes not shown...] t SYNC ------.. i ---------.. m -----------.. e ------------- write(A) | . write(B) v . write(C) ..------------- ..----------- ..--------- success ------ good. A-C are on the platter. commit ueberblock(D). write(D) -----.. ---------.. write(E) -----.. -----------.. --------.. ------------ [D] write(F) -----.. -----------.. -------.. ------------- [E] write(G) -----.. -----------.. =======POWER FAILURE======= -------.. -------------- poof...[F] gone -----------.. XXXX no ..XXXX disk ..----------- ..------- ERROR(G) <---- ohno! couldn''t write G. increment error counter =======POWER RESTORED======= retry write(G) -----.. -------.. SYNC -----.. -----------.. -------.. -------------- [G] -----------.. -------------- write(G) . ..-------------- ..------------ ..------- success ----- good. that means D-G are on the platter. commit ueberblock(H) write(H) <-- DANGER, Will Robinson. Writes D - F were lost in this ``event,'''' and the filesystem has no idea. If ===POWER FAILURE=== applied to the filesystem and the disk at the same time, then this problem would not exist---the way we are using SYNC here would be enough to stop H from being written---so power failures for non-SAN setups are safe from this. Also if we treat the disk as bad the moment it says ``write failure'''', and the array controller decides ``this disk is bad, forever,'''', if, the instant it loses power and times out write F the controller considers its entire contents lost and does not bother reading ANYthing from it until it''s been resilvered by other disks in the RAIDset, then we also do not have this problem, so power failures on SVM mirror with no understanding of the overlying filesystem are okay. Using naked UFS or ext3 or whatever over a SAN still has this problem I think. The filesystems are just better at losing some data but not the whole filesystem, compared to ZFS. I think ZFS attempts to be smarter than SVM, and also more broadly ambitious than one power supply all in one box, but is probably not smart enough to finish the job. Rather than just more UFS/VxFS-style robustness I''d like to see the job finished and this SAN write hole closed up. It''s important to accept that nothing is broken in this event. It''s just a yanked power cord. I won''t accept, ``a device failed, and you didn''t have enough redundancy, so all bets are off. You must feed ZFS more redundnacy. You expect the impossible.'''' No, that argument is bullshit. Losing power unexpectedly is not the same as a device failure---unexpected power loss is part of the overall state diagram of a normal, working storage system. hj> We are currently evaluating if we should begin to implement hj> ZFS in our SAN. I can see great opportunities with ZFS but if hj> we have a higher risk of loosing entire pools Optimistically, the ueberblock rollback will make ZFS like the other filesystems, though maybe faster to recover. If you are tied to stable solaris it''ll probably take like a year before you get your hands on it, but so far I think everyone agrees it''s promising. I think it''s not enough though. If the problem is that a batch of writes were lost, then a trick to recover the pool still won''t recover those lost writes, and you promised applications those writes were on the disk. Databases and filesystems inside zvol''s could still become corrupt. What this really means, is that using SAN''s makes corruption in general more likely. I think we sysadmins should start using some tiny 10-line programs to test the SAN''s and figure out what''s wrong with them. I think in the end we will need about two things to fix it: * some kind of commit/replay feature in iSCSI and FC initiators. or else the same feature implemented in the filesystems right above them but cooperating with the initiators pretty intimately. Gigabytes of write data could be ``in flight''''---we are talking about however much data is between the return of a first SYNCHRONIZE CACHE command and the next one---so it''d be good to arrange that it not be buffered two or three or four times, which may require layer-violating cooperation. I''m all but certain nobody''s doing this now. - is it in the initiator? commit/replay in the initiator would mean the initiator issues SYNCHRONIZE CACHE commands for itself, ones not demanded by the filesystem above it, whenever its replay write cache gets too large. I''ve never heard of that. and I don''t think anyone would put up with an iSCSI/FC initiator burning up gigabytes of RAM without an explanation which would mean that I''d hear about it and be worried about tuning it. - is it in the filesystem? Any filesystem designed before SAN''s will expect to eventually get a successful return from any SYNCHRONIZE CACHE command it passes to storage. a failed SYNC will happen in the form of someone yanking the cord, so the filesystem code will never see the failure because it won''t be executing any longer. UFS and ext3 don''t even bother to issue SYNCHRONIZE CACHE at all, much less pay attention to its return value and buffer writes so they can be replayed if it fails, so I doubt they have an exception path for a failed SYNC command. Putting repaly in the filesystem also means, if the iSCSI initiator notices the target bounce, then it MUST warn the layers above that writes were lost, for example by waiting for the next SYNCHRONIZE CACHE command to come along and deliberately returning it failed without consulting the target, even though the LUN would say it succeeded if it were issued. I''ve never heard of anything like this. * pay some attention to what happens to ZFS when a SAN controller reboots, separately with each ''failmode'' setting. To maintain correctness with NFS clients the zpool is serving, or with replicated/tiered database applications where the dbms app is keeping several nodes in sync, ZFS may need a failmode=umount that kills any app with outstanding writes on a failed pool and un-NFS-exports all the pool''s filesystems. the existing failmode=panic could probably be verified (and likely have to be fixed) to provide the same level of correctness, but that would not be as good as the umount-and-kill because it''d make HA and zones more antagonistic to each other, by putting many zones at the mercy of the weakest pool on the system, which could even be a USB stick or something. It''s the wrong direction to move. I am not sure what failmode=continue and failmode=wait mean now, or what they should mean to fix this problem. It''d be nice if they meant what they claim to be: ``wait: use commit/replay schemes so that no writes are lost even if the SAN controller reboots. apps should be frozen until they can be allowed to continue as if nothing went wrong. continue: fsync() returns -1 immediately for the first data that never made it to disk, and continues returning -1 until all writes issued up to now are on the platter, including writes that had to be replayed because of the reboot. Once fsync() has been called and has returned -1, all write() to that file must also fail because of the barrier. And once your app calls fsync() a second, third, fourth time and finally gets a 0 return from fsync(), it can be sure no data was lost.'''' Of course all that seems optimistic beyond ridiculous, even for UFS and VxFS. but if implemented like that, panic and wait should both be safe for SAN outages, and continue we already understand to be unsafe but implemented like this it becomes possible to write a cooperating app, like a database or a user-mode iSCSI target app for example, which is correct. hj> So, what is the opinion, is this an existing problem even when hj> using enterprise arrays? If I understand this correctly there hj> should be no risk of loosing an entire pool if hj> DKIOCFLUSHWRITECACHE is honored by the array? no, the timing diagram I showed explains how I think data might still be lost during a SAN reboot, even for a SAN which respects cache flushes. but all this is pretty speculative for now. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090217/dea0ae4b/attachment.bin>
On 17-Feb-09, at 3:01 PM, Scott Lawson wrote:> Hi All, > ... > I have seen other people discussing power availability on other > threads > recently. If you > want it, you can have it. You just need the business case for it. I > don''t buy the comments > on UPS unreliability.Hi, I remarked on it. FWIW, my experience is that commercial data centres do not avoid ''unscheduled outages'', no matter how many steely-eyed assurances they give. It seems rather imprudent to assume that power is never going to fail. No matter how many diesel generators, rooftop tanks, or pebblebed reactors you have, somebody is inevitably going to kick out a plug... at least in most of the real world. --Toby> ... > > -- > ______________________________________________________________________ > _ > > > Scott Lawson > Systems Architect > Manukau Institute of Technology > Information Communication Technology Services Private Bag 94006 > Manukau > City Auckland New Zealand > > Phone : +64 09 968 7611 > Fax : +64 09 968 7641 > Mobile : +64 27 568 7611 > > mailto:scott at manukau.ac.nz > > http://www.manukau.ac.nz > > ______________________________________________________________________ > __ > > > perl -e ''print > $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' > > ______________________________________________________________________ > __ > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Feb 17, 2009, at 21:35, Scott Lawson wrote:> Everything we have has dual power supplies, feed from dual power > rails, feed from separate switchboards, through separate very large > UPS''s, backed by generators, feed by two substations and then cloned > to another data center 3 km away. HAhttp://www.geonet.org.nz/earthquake/quakes/recent_quakes.html ;)> I am far, far more worried with someone with root access typing > ''zpool destroy'' than I am worried about the lights going out in the > data centers I designed that house hundreds and hundreds of > servers. ;)Yeah, this is probably more likely.
Toby Thain wrote:> > On 17-Feb-09, at 3:01 PM, Scott Lawson wrote: > >> Hi All, >> ... >> I have seen other people discussing power availability on other threads >> recently. If you >> want it, you can have it. You just need the business case for it. I >> don''t buy the comments >> on UPS unreliability. > > Hi, > > I remarked on it. FWIW, my experience is that commercial data centres > do not avoid ''unscheduled outages'', no matter how many steely-eyed > assurances they give. It seems rather imprudent to assume that power > is never going to fail. > > No matter how many diesel generators, rooftop tanks, or pebblebed > reactors you have, somebody is inevitably going to kick out a plug... > at least in most of the real world. > > --TobyThats why you have two plugs if not more. I still don''t buy your argument. It comes down to procedural issues on the site when it comes to people kicking plugs out. Everything we have has dual power supplies, feed from dual power rails, feed from separate switchboards, through separate very large UPS''s, backed by generators, feed by two substations and then cloned to another data center 3 km away. HA is all about design. (I won''t even comment about further up the stack than electricity) We have secure data centers with strict practices of work and qualified staff following best practice for maintenance and risk management around maintenance. I am far, far more worried with someone with root access typing ''zpool destroy'' than I am worried about the lights going out in the data centers I designed that house hundreds and hundreds of servers. ;) and no we don''t have unplanned outages. Not in a long time. Not all people that design data centers know how to design power systems for them. Sometimes the IT people don''t convey their requirements exactly enough to the electrical engineers. (I am an electrical engineer who got sidetracked by SunOS around ''91 and never went back.) Anyway we diverge I think. Maybe we can agree to disagree? Back to discussions about disk caddies and overpriced hardware.. slightly more closer to the topic at hand... ;)> >> ... >> >> -- >> _______________________________________________________________________ >> >> >> Scott Lawson >> Systems Architect >> Manukau Institute of Technology >> Information Communication Technology Services Private Bag 94006 Manukau >> City Auckland New Zealand >> >> Phone : +64 09 968 7611 >> Fax : +64 09 968 7641 >> Mobile : +64 27 568 7611 >> >> mailto:scott at manukau.ac.nz >> >> http://www.manukau.ac.nz >> >> ________________________________________________________________________ >> >> >> perl -e ''print >> $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' >> >> ________________________________________________________________________ >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- _______________________________________________________________________ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz ________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' ________________________________________________________________________
On 17-Feb-09, at 9:35 PM, Scott Lawson wrote:> > > Toby Thain wrote: >> >> On 17-Feb-09, at 3:01 PM, Scott Lawson wrote: >> >>> Hi All, >>> ... >>> I have seen other people discussing power availability on other >>> threads >>> recently. If you >>> want it, you can have it. You just need the business case for it. I >>> don''t buy the comments >>> on UPS unreliability. >> >> Hi, >> >> I remarked on it. FWIW, my experience is that commercial data >> centres do not avoid ''unscheduled outages'', no matter how many >> steely-eyed assurances they give. It seems rather imprudent to >> assume that power is never going to fail. >> >> No matter how many diesel generators, rooftop tanks, or pebblebed >> reactors you have, somebody is inevitably going to kick out a >> plug... at least in most of the real world. >> >> --Toby > Thats why you have two plugs if not more. I still don''t buy your > argument. It comes down to procedural > issues on the site when it comes to people kicking plugs out. > Everything we have has dual power supplies, > feed from dual power rails, feed from separate switchboards, > through separate very large UPS''s, backed by > generators, feed by two substations and then cloned to another data > center 3 km away. HA is all about > design. (I won''t even comment about further up the stack than > electricity) > > We have secure data centers with strict practices of work and > qualified staff following best practice for maintenance > and risk management around maintenance. > > I am far, far more worried with someone with root access typing > ''zpool destroy'' than I am worried about the lights > going out in the data centers I designed that house hundreds and > hundreds of servers. ;) and no we don''t have unplanned > outages. Not in a long time. Not all people that design data > centers know how to design power systems > for them. Sometimes the IT people don''t convey their requirements > exactly enough to the electrical engineers. (I am > an electrical engineer who got sidetracked by SunOS around ''91 and > never went back.) > > Anyway we diverge I think. Maybe we can agree to disagree?Not at all. You''ve convinced me. Your servers will never, ever lose power unexpectedly. --Toby> Back to discussions about disk caddies and > overpriced hardware.. slightly more closer to the topic at hand... ;) >> >>> ... >>> >>> -- >>> ____________________________________________________________________ >>> ___ >>> >>> >>> Scott Lawson >>> Systems Architect >>> Manukau Institute of Technology >>> Information Communication Technology Services Private Bag 94006 >>> Manukau >>> City Auckland New Zealand >>> >>> Phone : +64 09 968 7611 >>> Fax : +64 09 968 7641 >>> Mobile : +64 27 568 7611 >>> >>> mailto:scott at manukau.ac.nz >>> >>> http://www.manukau.ac.nz >>> >>> ____________________________________________________________________ >>> ____ >>> >>> >>> perl -e ''print >>> $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' >>> >>> ____________________________________________________________________ >>> ____ >>> >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- > ______________________________________________________________________ > _ > > > Scott Lawson > Systems Architect > Manukau Institute of Technology > Information Communication Technology Services Private Bag 94006 > Manukau > City Auckland New Zealand > > Phone : +64 09 968 7611 > Fax : +64 09 968 7641 > Mobile : +64 27 568 7611 > > mailto:scott at manukau.ac.nz > > http://www.manukau.ac.nz > > ______________________________________________________________________ > __ > > > perl -e ''print > $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' > > ______________________________________________________________________ > __
David Magda wrote:> > On Feb 17, 2009, at 21:35, Scott Lawson wrote: > >> Everything we have has dual power supplies, feed from dual power >> rails, feed from separate switchboards, through separate very large >> UPS''s, backed by generators, feed by two substations and then cloned >> to another data center 3 km away. HA > > http://www.geonet.org.nz/earthquake/quakes/recent_quakes.html >Ha. Yeah thats why we were once known to the British as "The Shaky Isles" . We do have lot''s of earthquakes around the pacific rim. We are in Auckland however which is north of all those little stars on the pic which is where the edge of the pacific plate intersects with the Australian plate. So not too many earthquakes to worry about in Auckland compared to the rest of NZ. Although one of the data centers I built recently was on the second floor of a building and had to be earthquake restrained due to the fact that we were going to be potentially creating up to one ton point loads on the floor. The rest of NZ gets little and biggish quakes fairly often, so much so that the Aussies next door on the west island see fit to warn their citizens about the potential for earthquakes in NZ if visiting... Our capital city Wellington the other hand is built on fault lines.. think San Francisco... Now a Volcano in Auckland might be a different story... We have over 50 dormant cones and whopping big one in the main harbor which is called "Rangitoto". Translated form the Maori it means "Blood Sky". My UPS''s wont protect from that one...> ;) > >> I am far, far more worried with someone with root access typing >> ''zpool destroy'' than I am worried about the lights going out in the >> data centers I designed that house hundreds and hundreds of servers. ;) > > Yeah, this is probably more likely. >-- _______________________________________________________________________ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz ________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' ________________________________________________________________________
Toby Thain wrote:> Not at all. You''ve convinced me. Your servers will never, ever lose > power unexpectedly.Methinks living in Auckland has something to do with that :-) http://en.wikipedia.org/wiki/1998_Auckland_power_crisis When services are reliable, then complacency brings risk. My favorite example recently is the levees in New Orleans. Katrina didn''t top the levees, they were undermined. -- richard
Hi Andras, No problems writing direct. Answers inline below. (If there are any typo''s it cause it''s late and I have had a very long day ;)) andras spitzer wrote:> Scott, > > Sorry for writing you directly, but most likely you have missed my > questions regarding your SW design, whenever you have time, would you > reply to that? I really value your comments and appreciate it as it > seems you have great experience with ZFS in a professional > environment, and this is something not so frequent today. > > That was my e-mail, response to your e-mail (it''s in the thread) : > > "Scott, > > That is an awesome reference you wrote, I totally understand and agree > with your idea of having everything redundant (dual path, redundant > switches, dual controllers) at the SAN infrastructure, I would have > some question about the sw design you use if you don''t mind. > > - are you using MPxIO as DMP? >Yes. configuring via ''stmsboot''. I have used Sun MPXIO for quite a few years now and have found it works well (was SAN Foundatin Kit for many years).> - as I understood from your e-mail all of your ZFS pools are ZFS > mirrored? (you don''t have non-redundant ZFS configuration) >Certainly the ones that are from SAN based disk. No there are no non redundant ZFS configurations. All storage is doubled up. Expensive, but we tend to stick to modular storage for this and spread the cost over many yeasr. Storage budget is at least 50% of systems group infrastructure budget. There are many other ZFS file systems which aren''t SAN attached and are in mirrors, RAIDZ''s etc. I mentioned the Loki''s aka J4500 which are in RAIDZ''s. Very nice and have worked very reliably so far. I would strongly advocate these units for ZFS if you want a lot of disk reasonably cheaply that performs well...> - why you decided to use ZFS mirror instead of ZFS raidz or raidz2? >As we already have hardware based RAID5 from our arrays. (Sun 3510, 3511, 6140''s). The ZFS file systems are used mostly for mirroring purposes, but also to take advantage of the other nice things ZFS brings lack snapshots, cloning, clone promotions etc.> > - you have RAID 5 protected LUNs from SAN, and you put ZFS mirror on > top of them? >Yes. Covered above I think.> Could you please share some details about your configuration regarding > SAN redundancy VS ZFS redundancy (I guess you use both here), also > some background why you decided to go with that? >Been doing it for many years. Not just with ZFS, but UFS and VXFS as well. Also quite a large number of NTFS machines. We have two geographically separate data centers which are a few kilometers apart with redundant dark fibre links over different routes. All core switches are in a full mesh with two cores per site, each with a redundant connection to the two cores at the other site. One via each route. We believe strongly that storage is the key to our business. Servers are but processing to work the data and are far easier to replace. We tend to standardize on particular models and then buy a bunch of em and not necessarily maintenance for them. There are a lot of key things to building a reliable data center. I have been having a lively discussion on this twith Toby and Richard which has been raising some interesting points. I do firmly believe in getting things right from the ground up. I start with power and environment. Storage comes next in my book.> Regards, > sendai " > > One point I''m really interested is that it seems you deploy ZFS with > ZFS mirror, even when you have RAID redundancy at the HW/SAN level, > which means extra costs to you obviously. I''m looking for a fairly > decisive opinion whether is it safe to use ZFS configuration without > redundancy when you have RAID redundancy in your high-end SAN, or you > still decide to go with ZFS redundancy (ZFS mirror in your case, not > even raidz or raidz2) because of the extra self-healing feature and > the lowered risk of total pool failure? >I think this has also been covered in recent list posts. the important thing is really to have two copies of blocks if you wish to be able to self heal. The cost I guess is what value you place on availability and reliability of your data. ZFS mirrors are faster for resilvering as well. Much much faster in my experience. We recently used this during a data center move and rebuild. Our SAN fabric was extended to 3 sites and we moved blocks of storage one piece at a time and resynced them at the new location once they were in place with 0% disruption to the business. I do think the fishworks stuff are going to prove to be game breakers in the near future for many people as they will offer many of the features we want in our storage. Once COMSTAR has been integrated into this line I might buy some. (I have a large investment in fibre channel and I don''t trust networking people as far as I can kick them when it comes to understanding the potential problems that can arise from disconnecting block targets that are coming in over Ethernet. )> Also, if you could reply in the thread, so that everyone can read your > experiences, that would be great! > > Regards, > sendai >