ZFS fans, I''m preparing some analyses on RAS for large JBOD systems such as the Sun Fire X4500 (aka Thumper). Since there are zillions of possible permutations, I need to limit the analyses to some common or desirable scenarios. Naturally, I''d like your opinions. I''ve already got a few scenarios in analysis, and I don''t want to spoil the brain storming, so feel free to think outside of the box. If you had 46 disks to deploy, what combinations would you use? Why? Examples, 46-way RAID-0 (I''ll do this just to show why you shouldn''t do this) 22x2-way RAID-1+0 + 2 hot spares 15x3-way RAID-Z2+0 + 1 hot spare ... Because some people get all wrapped up with the controllers, assume 5 8-disk SATA controllers plus 1 6-disk controller. Note: the reliability of the controllers is much greater than the reliability of the disks, so the data availability and MTTDL analysis will be dominated by the disks themselves. In part, this is due to using SATA/SAS (point-to-point disk connections) rather than a parallel bus or FC-AL where we would also have to worry about bus or loop common cause failures. I will be concentrating on data availability and MTTDL as two views of RAS. The intention is that the interesting combinations will also be analyzed for performance and we can complete a full performability analysis on them. Thanks -- richard
To maximize the throughput, I''d go with 8 5-disk raid-z{2} luns. Using that configuration, a full-width stripe write should be a single operation for each controller. In production, the application needs would probably dictate the resulting disk layout. If the application doesn''t need tons of i/o, you could bind more disks together for larger luns... On Jul 17, 2006, at 3:30 PM, Richard Elling wrote:> ZFS fans, > I''m preparing some analyses on RAS for large JBOD systems such as > the Sun Fire X4500 (aka Thumper). Since there are zillions of > possible > permutations, I need to limit the analyses to some common or desirable > scenarios. Naturally, I''d like your opinions. I''ve already got a few > scenarios in analysis, and I don''t want to spoil the brain > storming, so > feel free to think outside of the box. > > If you had 46 disks to deploy, what combinations would you use? Why? > > Examples, > 46-way RAID-0 (I''ll do this just to show why you shouldn''t do this) > 22x2-way RAID-1+0 + 2 hot spares > 15x3-way RAID-Z2+0 + 1 hot spare > ... > > Because some people get all wrapped up with the controllers, assume 5 > 8-disk SATA controllers plus 1 6-disk controller. Note: the > reliability of > the controllers is much greater than the reliability of the disks, so > the data availability and MTTDL analysis will be dominated by the > disks > themselves. In part, this is due to using SATA/SAS (point-to-point > disk > connections) rather than a parallel bus or FC-AL where we would > also have > to worry about bus or loop common cause failures. > > I will be concentrating on data availability and MTTDL as two views > of RAS. > The intention is that the interesting combinations will also be > analyzed > for performance and we can complete a full performability analysis > on them. > Thanks > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060717/4057d759/attachment.html>
I agree with Greg - For ZFS, I''d recommend a larger number of raidz luns, with a smaller number of disks per LUN, up to 6 disks per raidz lun. This will more closely align with performance best practices, so it would be cool to find common ground in terms of a sweet-spot for performance and RAS. /jim Gregory Shaw wrote:> To maximize the throughput, I''d go with 8 5-disk raid-z{2} luns. > Using that configuration, a full-width stripe write should be a > single operation for each controller. > > In production, the application needs would probably dictate the > resulting disk layout. If the application doesn''t need tons of i/o, > you could bind more disks together for larger luns... > > On Jul 17, 2006, at 3:30 PM, Richard Elling wrote: > >> ZFS fans, >> I''m preparing some analyses on RAS for large JBOD systems such as >> the Sun Fire X4500 (aka Thumper). Since there are zillions of possible >> permutations, I need to limit the analyses to some common or desirable >> scenarios. Naturally, I''d like your opinions. I''ve already got a few >> scenarios in analysis, and I don''t want to spoil the brain storming, so >> feel free to think outside of the box. >> >> If you had 46 disks to deploy, what combinations would you use? Why? >> >> Examples, >> 46-way RAID-0 (I''ll do this just to show why you shouldn''t do this) >> 22x2-way RAID-1+0 + 2 hot spares >> 15x3-way RAID-Z2+0 + 1 hot spare >> ... >> >> Because some people get all wrapped up with the controllers, assume 5 >> 8-disk SATA controllers plus 1 6-disk controller. Note: the >> reliability of >> the controllers is much greater than the reliability of the disks, so >> the data availability and MTTDL analysis will be dominated by the disks >> themselves. In part, this is due to using SATA/SAS (point-to-point disk >> connections) rather than a parallel bus or FC-AL where we would also have >> to worry about bus or loop common cause failures. >> >> I will be concentrating on data availability and MTTDL as two views >> of RAS. >> The intention is that the interesting combinations will also be analyzed >> for performance and we can complete a full performability analysis on >> them. >> Thanks >> -- richard >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org <mailto:zfs-discuss at opensolaris.org> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-2773 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com > <mailto:greg.shaw at sun.com> (work) > Louisville, CO 80028-4382 shaw at fmsoft.com > <mailto:shaw at fmsoft.com> (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds > > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
[stirring the pot a little...] Jim Mauro wrote:> I agree with Greg - For ZFS, I''d recommend a larger number of raidz > luns, with a smaller number > of disks per LUN, up to 6 disks per raidz lun.For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z or RAID-Z2. For 3-5 disks, RAID-Z2 offers better resiliency, even over split-disk RAID-1+0.> This will more closely align with performance best practices, so it > would be cool to find > common ground in terms of a sweet-spot for performance and RAS.It is clear that a single 46-way RAID-Z or RAID-Z2 zpool won''t be popular :-) -- richard
> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z > or RAID-Z2.Maybe I''m missing something, but it ought to be the other way around. With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible two-disk failure scenarios, three of them are fatal. Jeff
Jeff Bonwick wrote:>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z >> or RAID-Z2. > > Maybe I''m missing something, but it ought to be the other way around. > With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas > for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible > two-disk failure scenarios, three of them are fatal.For the 6-disk case, with RAID-1+0 you get 27/64 surviving states versus 22/64 for RAID-Z2. This accounts for the cases where you could lose 3 disks and survive with RAID-1+0. -- richard
On Jul 18, 2006, at 8:58, Richard Elling wrote:> Jeff Bonwick wrote: >>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z >>> or RAID-Z2. >> Maybe I''m missing something, but it ought to be the other way around. >> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas >> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible >> two-disk failure scenarios, three of them are fatal. > > For the 6-disk case, with RAID-1+0 you get 27/64 surviving states > versus 22/64 for RAID-Z2. This accounts for the cases where you could > lose 3 disks and survive with RAID-1+0.It seems to me that a useful resiliency calculation must include the probability of the failures. Just because there are more potential failure states for RAID-Z doesn''t mean, in practical terms, at least, that it is less resilient. Yes, there is one case of 3-disk failure that the 3x2 arrangement will survive that RAID-Z2 won''t, but there are (as Jeff pointed out) three 2-disk failures that are fatal to 3x2. Three different 2-failure scenarios total a much more likely occurrence than than the net five (all requiring three or more failures) scenarios that would be fatal to RAID-Z2 but not 3x2. --Ed
more below... Ed Gould wrote:> On Jul 18, 2006, at 8:58, Richard Elling wrote: >> Jeff Bonwick wrote: >>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z >>>> or RAID-Z2. >>> Maybe I''m missing something, but it ought to be the other way around. >>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas >>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible >>> two-disk failure scenarios, three of them are fatal. >> >> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states >> versus 22/64 for RAID-Z2. This accounts for the cases where you could >> lose 3 disks and survive with RAID-1+0. > > It seems to me that a useful resiliency calculation must include the > probability of the failures. Just because there are more potential > failure states for RAID-Z doesn''t mean, in practical terms, at least, > that it is less resilient. Yes, there is one case of 3-disk failure > that the 3x2 arrangement will survive that RAID-Z2 won''t, but there are > (as Jeff pointed out) three 2-disk failures that are fatal to 3x2. > Three different 2-failure scenarios total a much more likely occurrence > than than the net five (all requiring three or more failures) scenarios > that would be fatal to RAID-Z2 but not 3x2.A combinatorial resiliency analysis has no concept of time. To consider reliability, you need to consider time. Ergo, the combinatorial analysis is only suitable when the reliability of the components are the same, such as the JBOD disk case. As usual, we also do a large number of models for data availability and MTTDL which are based on reliability, recovery, spares, etc. Nevertheless, there are some valid cases where the combinatorial analysis is particularly useful: those where you cannot service or cannot service for long periods of time. As you would expect, those cases also tend towards triple redundant (3-way RAID-1) designs. It is worth noting that RAID-Z2 is more resilient than 2-way RAID-1 when the number of disks is <= 5, but not as the number of disks grows beyond 6. This is in line with Roch''s performance optimization observations, where we may recommend something like 2x6-way RAID-Z2 over 12-way RAID-Z2 for performance and resiliency, sacrificing space. -- richard
Richard Elling schrieb:> Jeff Bonwick wrote: >>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z >>> or RAID-Z2. >> >> Maybe I''m missing something, but it ought to be the other way around. >> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas >> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible >> two-disk failure scenarios, three of them are fatal. > > For the 6-disk case, with RAID-1+0 you get 27/64 surviving states > versus 22/64 for RAID-Z2. This accounts for the cases where you could > lose 3 disks and survive with RAID-1+0.I think this type of calculation is flawed. Disk failures are rare and multiple disk failures at the same time are even more rare. Let''s do some other calculation: 1. Assume each disk reliability independent of the others. For ease of calculation: 2. One week between disk failure and its replacement (including resilvering) 3. Failure rate of 1% per week for each disk. Compare: a. 6 disk RAID-1+0 b. 6 disk RAID-Z2 i. 1 disk failures have a probability of ~5.7 % per week but more interesting: ii. 2 disk failures 0.14 % probability per week a. fatal probability: 20% b. fatal probability: 0% iii. 3 disk failures 0.002% probability per week a. fatal probability: 60% b. fatal probability: 100% The remaining probabilites become more and more unlikely. In summary: Probability for a fatal loss a. 0.14% * 20% + 0.002% * 60% = 0.03% per week b. 0.14% * 0% + 0.002% * 100% = 0.002% per week Daniel
On Jul 18, 2006, at 10:35 AM, Ed Gould wrote:> On Jul 18, 2006, at 8:58, Richard Elling wrote: >> Jeff Bonwick wrote: >>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z >>>> or RAID-Z2. >>> Maybe I''m missing something, but it ought to be the other way >>> around. >>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas >>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible >>> two-disk failure scenarios, three of them are fatal. >> >> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states >> versus 22/64 for RAID-Z2. This accounts for the cases where you >> could >> lose 3 disks and survive with RAID-1+0. > > It seems to me that a useful resiliency calculation must include > the probability of the failures. Just because there are more > potential failure states for RAID-Z doesn''t mean, in practical > terms, at least, that it is less resilient. Yes, there is one case > of 3-disk failure that the 3x2 arrangement will survive that RAID- > Z2 won''t, but there are (as Jeff pointed out) three 2-disk failures > that are fatal to 3x2. Three different 2-failure scenarios total a > much more likely occurrence than than the net five (all requiring > three or more failures) scenarios that would be fatal to RAID-Z2 > but not 3x2. > > --Ed >To add to Ed''s comments: It would be good to add serviceability to the picture as well. If we can detect a failed disk and fix it without impact, the probability of a multi-disk failure decreases. When it comes to cascade failures (lots of disks going bad, whether that be real (bad disks) or perceived (i/o subsystem failures)), the only real solution is to use discrete disk solutions with separate power, controllers, etc. It goes back to the ILM model. If the value of your data justifies multiple disk subsystems, so be it. If it doesn''t, and you have a cascade failure, I hope your backups are intact. ----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060718/2e92fbd0/attachment.html>
Daniel, When we take into account time, the models we use are Markov models which consider the amount of space used, disk size, block and whole disk failures, RAID scheme, recovery from tape time, spares, etc. All of these views of the system are being analyzed. Needless to say, with the zillions of permutations of RAID schemes possible with a system such as the Sun Fire X4500, we''ll never model all of them. Hence my request for the popular configs which I will model in detail. -- richard Daniel Rock wrote:> Richard Elling schrieb: >> Jeff Bonwick wrote: >>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z >>>> or RAID-Z2. >>> >>> Maybe I''m missing something, but it ought to be the other way around. >>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas >>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible >>> two-disk failure scenarios, three of them are fatal. >> >> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states >> versus 22/64 for RAID-Z2. This accounts for the cases where you could >> lose 3 disks and survive with RAID-1+0. > > I think this type of calculation is flawed. Disk failures are rare and > multiple disk failures at the same time are even more rare. > > Let''s do some other calculation: > > 1. Assume each disk reliability independent of the others. > > For ease of calculation: > > 2. One week between disk failure and its replacement (including > resilvering) > 3. Failure rate of 1% per week for each disk. > > > Compare: > > a. 6 disk RAID-1+0 > b. 6 disk RAID-Z2 > > > i. 1 disk failures have a probability of > ~5.7 % per week > > > but more interesting: > > ii. 2 disk failures > 0.14 % probability per week > a. fatal probability: 20% > b. fatal probability: 0% > > iii. 3 disk failures > 0.002% probability per week > a. fatal probability: 60% > b. fatal probability: 100% > > The remaining probabilites become more and more unlikely. > > In summary: > > Probability for a fatal loss > a. 0.14% * 20% + 0.002% * 60% = 0.03% per week > b. 0.14% * 0% + 0.002% * 100% = 0.002% per week > > > Daniel
On Tue, 18 Jul 2006, Daniel Rock wrote:> Richard Elling schrieb: > > Jeff Bonwick wrote: > >>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z > >>> or RAID-Z2. > >> > >> Maybe I''m missing something, but it ought to be the other way around. > >> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas > >> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible > >> two-disk failure scenarios, three of them are fatal. > > > > For the 6-disk case, with RAID-1+0 you get 27/64 surviving states > > versus 22/64 for RAID-Z2. This accounts for the cases where you could > > lose 3 disks and survive with RAID-1+0. > > I think this type of calculation is flawed. Disk failures are rare and > multiple disk failures at the same time are even more rare.Stop right here! :) If you have a large number of identical disks which operate in the same environment[1], and possibly the same enclosure, it''s quite likely that you''ll see 2 or more disks die within the same, relatively short, timeframe. Also, with todays higher density disk enclosures, a fan failure, which goes un-noticed for a period of time, is likely to affect more than one drive - again leading to multiple disks failing in the same general timeframe. This is also why I advocate having cold spares available - so that the probability of the spare failing within the same timeframe is greatly diminished. [1] Same ambient temp, same power quality, same IOPS (load), same vibration etc.> Let''s do some other calculation: > > 1. Assume each disk reliability independent of the others. > > For ease of calculation: > > 2. One week between disk failure and its replacement (including resilvering) > 3. Failure rate of 1% per week for each disk. > > > Compare: > > a. 6 disk RAID-1+0 > b. 6 disk RAID-Z2 > > > i. 1 disk failures have a probability of > ~5.7 % per week > > > but more interesting: > > ii. 2 disk failures > 0.14 % probability per week > a. fatal probability: 20% > b. fatal probability: 0% > > iii. 3 disk failures > 0.002% probability per week > a. fatal probability: 60% > b. fatal probability: 100% > > The remaining probabilites become more and more unlikely. > > In summary: > > Probability for a fatal loss > a. 0.14% * 20% + 0.002% * 60% = 0.03% per week > b. 0.14% * 0% + 0.002% * 100% = 0.002% per week > > > Daniel > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
On Tue, Jul 18, 2006 at 10:37:35AM -0700, Richard Elling wrote:> Daniel, > When we take into account time, the models we use are Markov models > which consider the amount of space used, disk size, block and whole > disk failures, RAID scheme, recovery from tape time, spares, etc. > All of these views of the system are being analyzed. Needless to say, > with the zillions of permutations of RAID schemes possible with a > system such as the Sun Fire X4500, we''ll never model all of them. > Hence my request for the popular configs which I will model in detail.My two cents - One thing I would pay attention to is the future world of native ZFS root. On a thumper, you only have two drives which are bootable from the BIOS. For any application in which reliability is important, you would have these two drives mirrored as your root filesystem. There can be no hot spares for this pool, because any device you hot spare in will not be readable from the BIOS. So, you should assume that for any RAID configuration, you''re going to have a mirror of c3t0 and c3t4 (disks 0 and 1) as your root pool, with the remaining 46 disks available for user data. The loss of both root disks would imply that the system becomes unavailable, but wouldn''t necessarily result in loss of user data. If the model supports distinguishing these two outcomes, it could potentially cover such things as motherboard failure or controller failure, which would bring the system down but would not result in loss of data. A truly complete model would also take into account the loss of fans (thumper has 5x2 fans covering 12 rows of disks), though I doubt that anyone has any reliable data on the effect of running with only one fan in a redundant group. For all the Thumper raidz2 models, I would assume only having 46 disks. This gives a nice bias towards one of the following configurations: - 5x(7+2), 1 hot spare, 21.0TB - 4x(9+2), 2 hot spares, 18.0TB - 6x(5+2), 4 hot spares, 15.0TB The performance characteristics of these configurations would be equally interesting. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Tue, Jul 18, 2006 at 10:59:59AM -0700, Eric Schrock wrote:> > One thing I would pay attention to is the future world of native ZFS > root. On a thumper, you only have two drives which are bootable from > the BIOS. For any application in which reliability is important, you > would have these two drives mirrored as your root filesystem. There can > be no hot spares for this pool, because any device you hot spare in will > not be readable from the BIOS.Of course, now I went back and checked the original message and noted that you are only dealing with 46 disks, not 48. So you''re one step ahead of me ;-) - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Tue, Jul 18, 2006 at 10:59:59AM -0700, Eric Schrock wrote:> > - 5x(7+2), 1 hot spare, 21.0TB^^^^^ Sigh, that should also be ''17.5TB'', not 21TB. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Al Hopper schrieb:> On Tue, 18 Jul 2006, Daniel Rock wrote: >> I think this type of calculation is flawed. Disk failures are rare and >> multiple disk failures at the same time are even more rare. > > Stop right here! :) If you have a large number of identical disks which > operate in the same environment[1], and possibly the same enclosure, it''s > quite likely that you''ll see 2 or more disks die within the same, > relatively short, timeframe.Not my experience. I work and have worked with several disk arrays (EMC, IBM, Sun, etc.) and the failure rates of individual disks were fairly random.> Also, with todays higher density disk enclosures, a fan failure, which > goes un-noticed for a period of time, is likely to affect more than one > drive - again leading to multiple disks failing in the same general > timeframe.Then make sure not more than 2 disks of the same raidz2 pack are in the same airflow path (or equivalent for RAID-1). Daniel
On Tue, 18 Jul 2006, Richard Elling wrote:> Daniel, > When we take into account time, the models we use are Markov models > which consider the amount of space used, disk size, block and whole > disk failures, RAID scheme, recovery from tape time, spares, etc. > All of these views of the system are being analyzed. Needless to say, > with the zillions of permutations of RAID schemes possible with a > system such as the Sun Fire X4500, we''ll never model all of them. > Hence my request for the popular configs which I will model in detail.This, for me, is a fascinating thread. What I''d like to see, please, is some detailed notes on the methodology employed, for those of us who have no idea what a Markov model is. That, and the thought process behind it, and some of terminology explained. For example, when one refers to an N-way RAID-Z pool, is that N disks + 1 for parity, or does N include the parity disk (or disks in the case of RAID-Z2)? -- Rich Teer, SCNA, SCSA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich
On Tue, 2006-07-18 at 15:32, Daniel Rock wrote:> > Stop right here! :) If you have a large number of identical disks which > > operate in the same environment[1], and possibly the same enclosure, it''s > > quite likely that you''ll see 2 or more disks die within the same, > > relatively short, timeframe. > > Not my experience. I work and have worked with several disk arrays (EMC, IBM, > Sun, etc.) and the failure rates of individual disks were fairly random.My observation is that occasionally -- very occasionally -- you will get a bad batch of disks, which, due to a subtle design or manufacturing defect, will all pass their tests, etc., run fine for some small number of months or years (and, of course, long enough for you to believe they''re ready for production use..), and then start dying in droves. The paranoid in me wonders whether it would be worthwhile to buy pairs of disks of the same size from each of two different manufacturers, and mirror between unlike pairs, to control against this risk... - Bill
On 7/18/06, Bill Sommerfeld <sommerfeld at sun.com> wrote:> On Tue, 2006-07-18 at 15:32, Daniel Rock wrote: > > > Stop right here! :) If you have a large number of identical disks which > > > operate in the same environment[1], and possibly the same enclosure, it''s > > > quite likely that you''ll see 2 or more disks die within the same, > > > relatively short, timeframe. > > > > Not my experience. I work and have worked with several disk arrays (EMC, IBM, > > Sun, etc.) and the failure rates of individual disks were fairly random. > > My observation is that occasionally -- very occasionally -- you will get > a bad batch of disks, which, due to a subtle design or manufacturing > defect, will all pass their tests, etc., run fine for some small number > of months or years (and, of course, long enough for you to believe > they''re ready for production use..), and then start dying in droves. > > The paranoid in me wonders whether it would be worthwhile to buy pairs > of disks of the same size from each of two different manufacturers, and > mirror between unlike pairs, to control against this risk... > > - Bill >well feel free to contact the system builders and sell them on your idea, of how they should procure and install the drives from multiple venders perferably from different lots, into the x4500, they will love the extra hassle ;-p James Dickens uadmin.blogspot.com> > > > > > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Bill Sommerfeld wrote:> On Tue, 2006-07-18 at 15:32, Daniel Rock wrote: >>> Stop right here! :) If you have a large number of identical disks which >>> operate in the same environment[1], and possibly the same enclosure, it''s >>> quite likely that you''ll see 2 or more disks die within the same, >>> relatively short, timeframe. >> Not my experience. I work and have worked with several disk arrays (EMC, IBM, >> Sun, etc.) and the failure rates of individual disks were fairly random. > > My observation is that occasionally -- very occasionally -- you will get > a bad batch of disks, which, due to a subtle design or manufacturing > defect, will all pass their tests, etc., run fine for some small number > of months or years (and, of course, long enough for you to believe > they''re ready for production use..), and then start dying in droves.Yes, something like the sticksion problem that plaqued old Quantum ProDrives, or the phosphorus contamination which plagued some control electronics, or the defect growth rates caused by crystal growth in the mechanics, et.al. In my experience, there are also cases where bad power supplies cause a whole bunch of unhappiness. I''ve also heard horror stories about failed air conditioning and extreme vibration problems (eg. a stamping plant). From a modelling perspective, these are difficult because we don''t know how to assign reasonable failure rates to them. So beware, most availability models assume perfect manufacturing and operating environments as well as bug-free software and firmware. YMMV.> The paranoid in me wonders whether it would be worthwhile to buy pairs > of disks of the same size from each of two different manufacturers, and > mirror between unlike pairs, to control against this risk...First, let''s convince everyone to mirror and not RAID-Z[2] -- boil one ocean at a time, there are only 5 you know... :-) -- richard
On Tue, 18 Jul 2006, Al Hopper wrote:> On Tue, 18 Jul 2006, Daniel Rock wrote: > >> Richard Elling schrieb: >>> Jeff Bonwick wrote: >>>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z >>>>> or RAID-Z2. >>>> >>>> Maybe I''m missing something, but it ought to be the other way around. >>>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas >>>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible >>>> two-disk failure scenarios, three of them are fatal. >>> >>> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states >>> versus 22/64 for RAID-Z2. This accounts for the cases where you could >>> lose 3 disks and survive with RAID-1+0. >> >> I think this type of calculation is flawed. Disk failures are rare and >> multiple disk failures at the same time are even more rare. > > Stop right here! :) If you have a large number of identical disks which > operate in the same environment[1], and possibly the same enclosure, it''s > quite likely that you''ll see 2 or more disks die within the same, > relatively short, timeframe. > > Also, with todays higher density disk enclosures, a fan failure, which > goes un-noticed for a period of time, is likely to affect more than one > drive - again leading to multiple disks failing in the same general > timeframe. > > This is also why I advocate having cold spares available - so that the > probability of the spare failing within the same timeframe is greatly > diminished.A good SMART implementation combined with a decent sensor framework can also be useful for dealing with these conditions. Smartmontools is currently able to send E-amil when the ambient temperature of a disk drive goes beyond the recommended thresholds. I am hopeful the Solaris SMART implementation will take temperature into account, since modern disk drives run hot, and fan failures aren''t all that uncommon. - Ryan -- UNIX Administrator http://prefetch.net
On Wed, Jul 19, 2006 at 12:43:21AM -0400, Matty wrote:> > A good SMART implementation combined with a decent sensor framework can > also be useful for dealing with these conditions. Smartmontools is > currently able to send E-amil when the ambient temperature of a disk > drive goes beyond the recommended thresholds. I am hopeful the Solaris > SMART implementation will take temperature into account, since modern > disk drives run hot, and fan failures aren''t all that uncommon. >Hopefully, but I believe the only supported (public) SMART interface is the ''predictive failure bit''. Each drive vendor has a slew of other internal variables, but they don''t publish the specs because they generally don''t want folks second-guessing their internal algorithms. However, I believe there is also a SCSI environmental sensor protocol that can do things like temperature monitoring that we''ll also want to incorporate in future diagnosis engines. The current thumper-specific diagnosis engine does this, but we''re working on generalizing the framework and more tightly integrating with ZFS. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Richard Elling schrieb:> First, let''s convince everyone to mirror and not RAID-Z[2] -- boil one > ocean at a time, there are only 5 you know... :-)For maximum protection 4-disk RAID-Z2 is *always* better than 4-disk RAID-1+0. With more disks use multiple 4-disk RAID-Z2 packs. Daniel
Eric Schrock wrote:> One thing I would pay attention to is the future world of native ZFS > root. On a thumper, you only have two drives which are bootable from > the BIOS. For any application in which reliability is important, you > would have these two drives mirrored as your root filesystem. There can > be no hot spares for this pool, because any device you hot spare in will > not be readable from the BIOS.> For all the Thumper raidz2 models, I would assume only having 46 disks. > This gives a nice bias towards one of the following configurations: > > - 5x(7+2), 1 hot spare, 21.0TB > - 4x(9+2), 2 hot spares, 18.0TB > - 6x(5+2), 4 hot spares, 15.0TBAnd in order to mitigate the impact of the lack of root spares in the scenario above, I''d go for plenty of hot spares, and do a manual swap of one hot-spare with the failing root mirror. Henk
> This gives a nice bias towards one of the following > configurations: > > - 5x(7+2), 1 hot spare, 17.5TB [corrected] > - 4x(9+2), 2 hot spares, 18.0TB > - 6x(5+2), 4 hot spares, 15.0TBIn addition to Eric''s suggestions, I would be interested in these configs for 46 disks: 5 x (8+1) 1 hot spare 20.0 TB 4 x (10+1) 2 hot spares 20.0 TB 6 x (6+1) 4 hot spares 18.0 TB In a few cases, we might want more space rather than 2-disk parity. Thanks. This message posted from opensolaris.org
perhaps these are good picks: 5 x (7+2) 1 hot spare 35 data disks <- best safety 5 x (8+1) 1 hot spare 40 data disks <- best space 9 x (4+1) 1 hot spare 36 data disks <- best speed 1 x (45+1) 0 hot spare 45 data disks <- max space 23x (1+1) 0 hot spare 23 data disks <- max speed it would be nice to see some kinda metadata test, `du` or `find` http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054
Thanks Rob, one comment below. Rob Logan wrote:> perhaps these are good picks: > > 5 x (7+2) 1 hot spare 35 data disks <- best safety > 5 x (8+1) 1 hot spare 40 data disks <- best space > 9 x (4+1) 1 hot spare 36 data disks <- best speed > 1 x (45+1) 0 hot spare 45 data disks <- max spaceThis one stretches the models a bit. In one model, the MTTDL is ~1200 years and in a more detailed model, it is 6 years. Most people will be very unhappy with an MTTDL of 6 years. To put this in perspective, a 46-disk RAID-0 has an MTTDL of less than 2 years in all models. I''d like to hear from the ZFS team how such a wide stripe would be expected to perform :-) -- richard
On Sat, 22 Jul 2006, Richard Elling wrote:> This one stretches the models a bit. In one model, the MTTDL isFor us storage newbies, what is MTTDL? I would guess Mean Time To Data Loss, which presumably is some multiple of the drives'' MTBF (Mean Time Between Failures)? -- Rich Teer, SCNA, SCSA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich
Rich Teer wrote:> On Sat, 22 Jul 2006, Richard Elling wrote: > >> This one stretches the models a bit. In one model, the MTTDL is > > For us storage newbies, what is MTTDL? I would guess Mean Time > To Data Loss, which presumably is some multiple of the drives'' > MTBF (Mean Time Between Failures)?Correct, MTTDL = Mean Time To Data Loss MTBF = Mean Time Between Failures MTTR = Mean Time To Recover MTBS = Mean Time Between Services (eg. repair action) MTBSI = Mean Time Between Service Interruption When we talk about retention, we worry about MTTDL. When we talk about data availability, we worry about MTBSI. When we talk about spares stocking or service intervals, MTBS. Systems architecture, component selection, and configuration all interact with each other. It would be nice to have some really good dependability benchmarks to apply, but that discipline is still in its early stages. -- richard