thr3ads.net - zfs discuss - [zfs-discuss] Big JBOD: what would you do? [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Richard Elling

2006-Jul-17 21:30 UTC

[zfs-discuss] Big JBOD: what would you do?

ZFS fans,
I''m preparing some analyses on RAS for large JBOD systems such as
the Sun Fire X4500 (aka Thumper).  Since there are zillions of possible
permutations, I need to limit the analyses to some common or desirable
scenarios.  Naturally, I''d like your opinions.  I''ve already
got a few
scenarios in analysis, and I don''t want to spoil the brain storming, so
feel free to think outside of the box.

If you had 46 disks to deploy, what combinations would you use?  Why?

Examples,
	46-way RAID-0  (I''ll do this just to show why you shouldn''t
do this)
	22x2-way RAID-1+0 + 2 hot spares
	15x3-way RAID-Z2+0 + 1 hot spare
	...

Because some people get all wrapped up with the controllers, assume 5
8-disk SATA controllers plus 1 6-disk controller.  Note: the reliability of
the controllers is much greater than the reliability of the disks, so
the data availability and MTTDL analysis will be dominated by the disks
themselves.  In part, this is due to using SATA/SAS (point-to-point disk
connections) rather than a parallel bus or FC-AL where we would also have
to worry about bus or loop common cause failures.

I will be concentrating on data availability and MTTDL as two views of RAS.
The intention is that the interesting combinations will also be analyzed
for performance and we can complete a full performability analysis on them.
Thanks
  -- richard

Gregory Shaw

2006-Jul-17 21:39 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

To maximize the throughput, I''d go with 8 5-disk raid-z{2} luns.    
Using that configuration, a full-width stripe write should be a  
single operation for each controller.

In production, the application needs would probably dictate the  
resulting disk layout.  If the application doesn''t need tons of i/o,  
you could bind more disks together for larger luns...

On Jul 17, 2006, at 3:30 PM, Richard Elling wrote:
> ZFS fans,
> I''m preparing some analyses on RAS for large JBOD systems such as
> the Sun Fire X4500 (aka Thumper).  Since there are zillions of  
> possible
> permutations, I need to limit the analyses to some common or desirable
> scenarios.  Naturally, I''d like your opinions.  I''ve
already got a few
> scenarios in analysis, and I don''t want to spoil the brain  
> storming, so
> feel free to think outside of the box.
>
> If you had 46 disks to deploy, what combinations would you use?  Why?
>
> Examples,
> 	46-way RAID-0  (I''ll do this just to show why you
shouldn''t do this)
> 	22x2-way RAID-1+0 + 2 hot spares
> 	15x3-way RAID-Z2+0 + 1 hot spare
> 	...
>
> Because some people get all wrapped up with the controllers, assume 5
> 8-disk SATA controllers plus 1 6-disk controller.  Note: the  
> reliability of
> the controllers is much greater than the reliability of the disks, so
> the data availability and MTTDL analysis will be dominated by the  
> disks
> themselves.  In part, this is due to using SATA/SAS (point-to-point  
> disk
> connections) rather than a parallel bus or FC-AL where we would  
> also have
> to worry about bus or loop common cause failures.
>
> I will be concentrating on data availability and MTTDL as two views  
> of RAS.
> The intention is that the interesting combinations will also be  
> analyzed
> for performance and we can complete a full performability analysis  
> on them.
> Thanks
>  -- richard
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                    shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060717/4057d759/attachment.html>

Jim Mauro

2006-Jul-18 00:07 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

I agree with Greg - For ZFS, I''d recommend a larger number of raidz 
luns, with a smaller number
of disks per LUN, up to 6 disks per raidz lun.

This will more closely align with performance best practices, so it 
would be cool to find
common ground in terms of a sweet-spot for performance and RAS.

/jim


Gregory Shaw wrote:> To maximize the throughput, I''d go with 8 5-disk raid-z{2} luns.  
>  Using that configuration, a full-width stripe write should be a 
> single operation for each controller.
>
> In production, the application needs would probably dictate the 
> resulting disk layout.  If the application doesn''t need tons of
i/o,
> you could bind more disks together for larger luns...
>
> On Jul 17, 2006, at 3:30 PM, Richard Elling wrote:
>
>> ZFS fans,
>> I''m preparing some analyses on RAS for large JBOD systems such
as
>> the Sun Fire X4500 (aka Thumper).  Since there are zillions of possible
>> permutations, I need to limit the analyses to some common or desirable
>> scenarios.  Naturally, I''d like your opinions.  I''ve
already got a few
>> scenarios in analysis, and I don''t want to spoil the brain
storming, so
>> feel free to think outside of the box.
>>
>> If you had 46 disks to deploy, what combinations would you use?  Why?
>>
>> Examples,
>> 46-way RAID-0  (I''ll do this just to show why you
shouldn''t do this)
>> 22x2-way RAID-1+0 + 2 hot spares
>> 15x3-way RAID-Z2+0 + 1 hot spare
>> ...
>>
>> Because some people get all wrapped up with the controllers, assume 5
>> 8-disk SATA controllers plus 1 6-disk controller.  Note: the 
>> reliability of
>> the controllers is much greater than the reliability of the disks, so
>> the data availability and MTTDL analysis will be dominated by the disks
>> themselves.  In part, this is due to using SATA/SAS (point-to-point
disk
>> connections) rather than a parallel bus or FC-AL where we would also
have
>> to worry about bus or loop common cause failures.
>>
>> I will be concentrating on data availability and MTTDL as two views 
>> of RAS.
>> The intention is that the interesting combinations will also be
analyzed
>> for performance and we can complete a full performability analysis on 
>> them.
>> Thanks
>>  -- richard
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org <mailto:zfs-discuss at
opensolaris.org>
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> -----
> Gregory Shaw, IT Architect
> Phone: (303) 673-8273        Fax: (303) 673-2773
> ITCTO Group, Sun Microsystems Inc.
> 1 StorageTek Drive ULVL4-382              greg.shaw at sun.com 
> <mailto:greg.shaw at sun.com> (work)
> Louisville, CO 80028-4382                    shaw at fmsoft.com 
> <mailto:shaw at fmsoft.com> (home)
> "When Microsoft writes an application for Linux, I''ve
Won." - Linus
> Torvalds
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2006-Jul-18 01:15 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

[stirring the pot a little...]

Jim Mauro wrote:> I agree with Greg - For ZFS, I''d recommend a larger number of
raidz
> luns, with a smaller number
> of disks per LUN, up to 6 disks per raidz lun.
For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
or RAID-Z2.  For 3-5 disks, RAID-Z2 offers better resiliency, even
over split-disk RAID-1+0.
> This will more closely align with performance best practices, so it 
> would be cool to find
> common ground in terms of a sweet-spot for performance and RAS.
It is clear that a single 46-way RAID-Z or RAID-Z2 zpool won''t be
popular :-)
  -- richard

Jeff Bonwick

2006-Jul-18 08:36 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
> or RAID-Z2.
Maybe I''m missing something, but it ought to be the other way around.
With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
two-disk failure scenarios, three of them are fatal.

Jeff

Richard Elling

2006-Jul-18 15:58 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

Jeff Bonwick wrote:>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
>> or RAID-Z2.
> 
> Maybe I''m missing something, but it ought to be the other way
around.
> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
> two-disk failure scenarios, three of them are fatal.
For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
versus 22/64 for RAID-Z2.  This accounts for the cases where you could
lose 3 disks and survive with RAID-1+0.
  -- richard

Ed Gould

2006-Jul-18 16:35 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Jul 18, 2006, at 8:58, Richard Elling wrote:> Jeff Bonwick wrote:
>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
>>> or RAID-Z2.
>> Maybe I''m missing something, but it ought to be the other way
around.
>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
>> two-disk failure scenarios, three of them are fatal.
>
> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
> versus 22/64 for RAID-Z2.  This accounts for the cases where you could
> lose 3 disks and survive with RAID-1+0.
It seems to me that a useful resiliency calculation must include the 
probability of the failures.  Just because there are more potential 
failure states for RAID-Z doesn''t mean, in practical terms, at least, 
that it is less resilient.  Yes, there is one case of 3-disk failure 
that the 3x2 arrangement will survive that RAID-Z2 won''t, but there are
(as Jeff pointed out) three 2-disk failures that are fatal to 3x2.  
Three different 2-failure scenarios total a much more likely occurrence 
than than the net five (all requiring three or more failures) scenarios 
that would be fatal to RAID-Z2 but not 3x2.

	--Ed

Richard Elling

2006-Jul-18 16:57 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

more below...

Ed Gould wrote:> On Jul 18, 2006, at 8:58, Richard Elling wrote:
>> Jeff Bonwick wrote:
>>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than
RAID-Z
>>>> or RAID-Z2.
>>> Maybe I''m missing something, but it ought to be the other
way around.
>>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
>>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
>>> two-disk failure scenarios, three of them are fatal.
>>
>> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
>> versus 22/64 for RAID-Z2.  This accounts for the cases where you could
>> lose 3 disks and survive with RAID-1+0.
> 
> It seems to me that a useful resiliency calculation must include the 
> probability of the failures.  Just because there are more potential 
> failure states for RAID-Z doesn''t mean, in practical terms, at
least,
> that it is less resilient.  Yes, there is one case of 3-disk failure 
> that the 3x2 arrangement will survive that RAID-Z2 won''t, but
there are
> (as Jeff pointed out) three 2-disk failures that are fatal to 3x2.  
> Three different 2-failure scenarios total a much more likely occurrence 
> than than the net five (all requiring three or more failures) scenarios 
> that would be fatal to RAID-Z2 but not 3x2.
A combinatorial resiliency analysis has no concept of time.  To consider
reliability, you need to consider time.  Ergo, the combinatorial analysis
is only suitable when the reliability of the components are the same, such
as the JBOD disk case.

As usual, we also do a large number of models for data availability and
MTTDL which are based on reliability, recovery, spares, etc.  Nevertheless,
there are some valid cases where the combinatorial analysis is particularly
useful: those where you cannot service or cannot service for long periods
of time.  As you would expect, those cases also tend towards triple
redundant (3-way RAID-1) designs.

It is worth noting that RAID-Z2 is more resilient than 2-way RAID-1 when
the number of disks is <= 5, but not as the number of disks grows beyond 6.
This is in line with Roch''s performance optimization observations,
where
we may recommend something like 2x6-way RAID-Z2 over 12-way RAID-Z2 for
performance and resiliency, sacrificing space.
  -- richard

Daniel Rock

2006-Jul-18 17:16 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

Richard Elling schrieb:> Jeff Bonwick wrote:
>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
>>> or RAID-Z2.
>>
>> Maybe I''m missing something, but it ought to be the other way
around.
>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
>> two-disk failure scenarios, three of them are fatal.
> 
> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
> versus 22/64 for RAID-Z2.  This accounts for the cases where you could
> lose 3 disks and survive with RAID-1+0.
I think this type of calculation is flawed. Disk failures are rare and 
multiple disk failures at the same time are even more rare.

Let''s do some other calculation:

1. Assume each disk reliability independent of the others.

For ease of calculation:

2. One week between disk failure and its replacement (including resilvering)
3. Failure rate of 1% per week for each disk.


Compare:

a. 6 disk RAID-1+0
b. 6 disk RAID-Z2


i. 1 disk failures have a probability of
    ~5.7 % per week


but more interesting:

ii. 2 disk failures
     0.14 % probability per week
     a. fatal probability: 20%
     b. fatal probability:  0%

iii. 3 disk failures
     0.002% probability per week
     a. fatal probability:  60%
     b. fatal probability: 100%

The remaining probabilites become more and more unlikely.

In summary:

Probability for a fatal loss
a. 0.14% * 20% + 0.002% *  60% = 0.03%  per week
b. 0.14% *  0% + 0.002% * 100% = 0.002% per week


Daniel

Gregory Shaw

2006-Jul-18 17:19 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Jul 18, 2006, at 10:35 AM, Ed Gould wrote:
> On Jul 18, 2006, at 8:58, Richard Elling wrote:
>> Jeff Bonwick wrote:
>>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than
RAID-Z
>>>> or RAID-Z2.
>>> Maybe I''m missing something, but it ought to be the other
way
>>> around.
>>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
>>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
>>> two-disk failure scenarios, three of them are fatal.
>>
>> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
>> versus 22/64 for RAID-Z2.  This accounts for the cases where you  
>> could
>> lose 3 disks and survive with RAID-1+0.
>
> It seems to me that a useful resiliency calculation must include  
> the probability of the failures.  Just because there are more  
> potential failure states for RAID-Z doesn''t mean, in practical  
> terms, at least, that it is less resilient.  Yes, there is one case  
> of 3-disk failure that the 3x2 arrangement will survive that RAID- 
> Z2 won''t, but there are (as Jeff pointed out) three 2-disk
failures
> that are fatal to 3x2.  Three different 2-failure scenarios total a  
> much more likely occurrence than than the net five (all requiring  
> three or more failures) scenarios that would be fatal to RAID-Z2  
> but not 3x2.
>
> 	--Ed
>
To add to Ed''s comments:
	It would be good to add serviceability to the picture as well.  If  
we can detect a failed disk and fix it without impact, the  
probability of a multi-disk failure decreases.
	When it comes to cascade failures (lots of disks going bad, whether  
that be real (bad disks) or perceived (i/o subsystem failures)), the  
only real solution is to use discrete disk solutions with separate  
power, controllers, etc.
	It goes back to the ILM model.   If the value of your data justifies  
multiple disk subsystems, so be it.  If it doesn''t, and you have a  
cascade failure, I hope your backups are intact.

-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                    shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060718/2e92fbd0/attachment.html>

Richard Elling

2006-Jul-18 17:37 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

Daniel,
When we take into account time, the models we use are Markov models
which consider the amount of space used, disk size, block and whole
disk failures, RAID scheme, recovery from tape time, spares, etc.
All of these views of the system are being analyzed.  Needless to say,
with the zillions of permutations of RAID schemes possible with a
system such as the Sun Fire X4500, we''ll never model all of them.
Hence my request for the popular configs which I will model in detail.
  -- richard

Daniel Rock wrote:> Richard Elling schrieb:
>> Jeff Bonwick wrote:
>>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than
RAID-Z
>>>> or RAID-Z2.
>>>
>>> Maybe I''m missing something, but it ought to be the other
way around.
>>> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
>>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
>>> two-disk failure scenarios, three of them are fatal.
>>
>> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
>> versus 22/64 for RAID-Z2.  This accounts for the cases where you could
>> lose 3 disks and survive with RAID-1+0.
> 
> I think this type of calculation is flawed. Disk failures are rare and 
> multiple disk failures at the same time are even more rare.
> 
> Let''s do some other calculation:
> 
> 1. Assume each disk reliability independent of the others.
> 
> For ease of calculation:
> 
> 2. One week between disk failure and its replacement (including 
> resilvering)
> 3. Failure rate of 1% per week for each disk.
> 
> 
> Compare:
> 
> a. 6 disk RAID-1+0
> b. 6 disk RAID-Z2
> 
> 
> i. 1 disk failures have a probability of
>    ~5.7 % per week
> 
> 
> but more interesting:
> 
> ii. 2 disk failures
>     0.14 % probability per week
>     a. fatal probability: 20%
>     b. fatal probability:  0%
> 
> iii. 3 disk failures
>     0.002% probability per week
>     a. fatal probability:  60%
>     b. fatal probability: 100%
> 
> The remaining probabilites become more and more unlikely.
> 
> In summary:
> 
> Probability for a fatal loss
> a. 0.14% * 20% + 0.002% *  60% = 0.03%  per week
> b. 0.14% *  0% + 0.002% * 100% = 0.002% per week
> 
> 
> Daniel

Al Hopper

2006-Jul-18 17:46 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Tue, 18 Jul 2006, Daniel Rock wrote:
> Richard Elling schrieb:
> > Jeff Bonwick wrote:
> >>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than
RAID-Z
> >>> or RAID-Z2.
> >>
> >> Maybe I''m missing something, but it ought to be the other
way around.
> >> With 6 disks, RAID-Z2 can tolerate any two disk failures, whereas
> >> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15 possible
> >> two-disk failure scenarios, three of them are fatal.
> >
> > For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
> > versus 22/64 for RAID-Z2.  This accounts for the cases where you could
> > lose 3 disks and survive with RAID-1+0.
>
> I think this type of calculation is flawed. Disk failures are rare and
> multiple disk failures at the same time are even more rare.
Stop right here! :)  If you have a large number of identical disks which
operate in the same environment[1], and possibly the same enclosure,
it''s
quite likely that you''ll see 2 or more disks die within the same,
relatively short, timeframe.

Also, with todays higher density disk enclosures, a fan failure, which
goes un-noticed for a period of time, is likely to affect more than one
drive - again leading to multiple disks failing in the same general
timeframe.

This is also why I advocate having cold spares available - so that the
probability of the spare failing within the same timeframe is greatly
diminished.

[1] Same ambient temp, same power quality, same IOPS (load), same
vibration etc.
> Let''s do some other calculation:
>
> 1. Assume each disk reliability independent of the others.
>
> For ease of calculation:
>
> 2. One week between disk failure and its replacement (including
resilvering)
> 3. Failure rate of 1% per week for each disk.
>
>
> Compare:
>
> a. 6 disk RAID-1+0
> b. 6 disk RAID-Z2
>
>
> i. 1 disk failures have a probability of
>     ~5.7 % per week
>
>
> but more interesting:
>
> ii. 2 disk failures
>      0.14 % probability per week
>      a. fatal probability: 20%
>      b. fatal probability:  0%
>
> iii. 3 disk failures
>      0.002% probability per week
>      a. fatal probability:  60%
>      b. fatal probability: 100%
>
> The remaining probabilites become more and more unlikely.
>
> In summary:
>
> Probability for a fatal loss
> a. 0.14% * 20% + 0.002% *  60% = 0.03%  per week
> b. 0.14% *  0% + 0.002% * 100% = 0.002% per week
>
>
> Daniel
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
                OpenSolaris Governing Board (OGB) Member - Feb 2006

Eric Schrock

2006-Jul-18 17:59 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Tue, Jul 18, 2006 at 10:37:35AM -0700, Richard Elling
wrote:> Daniel,
> When we take into account time, the models we use are Markov models
> which consider the amount of space used, disk size, block and whole
> disk failures, RAID scheme, recovery from tape time, spares, etc.
> All of these views of the system are being analyzed.  Needless to say,
> with the zillions of permutations of RAID schemes possible with a
> system such as the Sun Fire X4500, we''ll never model all of them.
> Hence my request for the popular configs which I will model in detail.
My two cents -

One thing I would pay attention to is the future world of native ZFS
root.  On a thumper, you only have two drives which are bootable from
the BIOS.  For any application in which reliability is important, you
would have these two drives mirrored as your root filesystem.  There can
be no hot spares for this pool, because any device you hot spare in will
not be readable from the BIOS.

So, you should assume that for any RAID configuration, you''re going to
have a mirror of c3t0 and c3t4 (disks 0 and 1) as your root pool, with
the remaining 46 disks available for user data.  The loss of both root
disks would imply that the system becomes unavailable, but wouldn''t
necessarily result in loss of user data.  If the model supports
distinguishing these two outcomes, it could potentially cover such
things as motherboard failure or controller failure, which would bring
the system down but would not result in loss of data.  A truly complete
model would also take into account the loss of fans (thumper has 5x2
fans covering 12 rows of disks), though I doubt that anyone has any
reliable data on the effect of running with only one fan in a redundant
group.

For all the Thumper raidz2 models, I would assume only having 46 disks.
This gives a nice bias towards one of the following configurations:

	- 5x(7+2), 1 hot spare, 21.0TB
	- 4x(9+2), 2 hot spares, 18.0TB
	- 6x(5+2), 4 hot spares, 15.0TB

The performance characteristics of these configurations would be equally
interesting.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Eric Schrock

2006-Jul-18 18:03 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Tue, Jul 18, 2006 at 10:59:59AM -0700, Eric Schrock
wrote:> 
> One thing I would pay attention to is the future world of native ZFS
> root.  On a thumper, you only have two drives which are bootable from
> the BIOS.  For any application in which reliability is important, you
> would have these two drives mirrored as your root filesystem.  There can
> be no hot spares for this pool, because any device you hot spare in will
> not be readable from the BIOS.
Of course, now I went back and checked the original message and noted
that you are only dealing with 46 disks, not 48.  So you''re one step
ahead of me ;-)

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Eric Schrock

2006-Jul-18 18:11 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Tue, Jul 18, 2006 at 10:59:59AM -0700, Eric Schrock
wrote:> 
> 	- 5x(7+2), 1 hot spare, 21.0TB                                ^^^^^

Sigh, that should also be ''17.5TB'', not 21TB.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Daniel Rock

2006-Jul-18 22:32 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

Al Hopper schrieb:> On Tue, 18 Jul 2006, Daniel Rock wrote:
>> I think this type of calculation is flawed. Disk failures are rare and
>> multiple disk failures at the same time are even more rare.
> 
> Stop right here! :)  If you have a large number of identical disks which
> operate in the same environment[1], and possibly the same enclosure,
it''s
> quite likely that you''ll see 2 or more disks die within the same,
> relatively short, timeframe.
Not my experience. I work and have worked with several disk arrays (EMC, IBM, 
Sun, etc.) and the failure rates of individual disks were fairly random.

> Also, with todays higher density disk enclosures, a fan failure, which
> goes un-noticed for a period of time, is likely to affect more than one
> drive - again leading to multiple disks failing in the same general
> timeframe.
Then make sure not more than 2 disks of the same raidz2 pack are in the same 
airflow path (or equivalent for RAID-1).



Daniel

Rich Teer

2006-Jul-19 00:30 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Tue, 18 Jul 2006, Richard Elling wrote:
> Daniel,
> When we take into account time, the models we use are Markov models
> which consider the amount of space used, disk size, block and whole
> disk failures, RAID scheme, recovery from tape time, spares, etc.
> All of these views of the system are being analyzed.  Needless to say,
> with the zillions of permutations of RAID schemes possible with a
> system such as the Sun Fire X4500, we''ll never model all of them.
> Hence my request for the popular configs which I will model in detail.
This, for me, is a fascinating thread.  What I''d like to see, please,
is some detailed notes on the methodology employed, for those of us who
have no idea what a Markov model is.  That, and the thought process
behind it, and some of terminology explained.  For example, when one
refers to an N-way RAID-Z pool, is that N disks + 1 for parity, or
does N include the parity disk (or disks in the case of RAID-Z2)?

-- 
Rich Teer, SCNA, SCSA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich

Bill Sommerfeld

2006-Jul-19 00:57 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Tue, 2006-07-18 at 15:32, Daniel Rock wrote:> > Stop right here! :)  If you have a large number of identical disks
which
> > operate in the same environment[1], and possibly the same enclosure,
it''s
> > quite likely that you''ll see 2 or more disks die within the
same,
> > relatively short, timeframe.
> 
> Not my experience. I work and have worked with several disk arrays (EMC,
IBM,
> Sun, etc.) and the failure rates of individual disks were fairly random.
My observation is that occasionally -- very occasionally -- you will get
a bad batch of disks, which, due to a subtle design or manufacturing
defect,  will all pass their tests, etc., run fine for some small number
of months or years (and, of course, long enough for you to believe
they''re ready for production use..), and then start dying in droves.

The paranoid in me wonders whether it would be worthwhile to buy pairs
of disks of the same size from each of two different manufacturers, and
mirror between unlike pairs, to control against this risk...

						- Bill

James Dickens

2006-Jul-19 01:19 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On 7/18/06, Bill Sommerfeld <sommerfeld at sun.com>
wrote:> On Tue, 2006-07-18 at 15:32, Daniel Rock wrote:
> > > Stop right here! :)  If you have a large number of identical
disks which
> > > operate in the same environment[1], and possibly the same
enclosure, it''s
> > > quite likely that you''ll see 2 or more disks die within
the same,
> > > relatively short, timeframe.
> >
> > Not my experience. I work and have worked with several disk arrays
(EMC, IBM,
> > Sun, etc.) and the failure rates of individual disks were fairly
random.
>
> My observation is that occasionally -- very occasionally -- you will get
> a bad batch of disks, which, due to a subtle design or manufacturing
> defect,  will all pass their tests, etc., run fine for some small number
> of months or years (and, of course, long enough for you to believe
> they''re ready for production use..), and then start dying in
droves.
>
> The paranoid in me wonders whether it would be worthwhile to buy pairs
> of disks of the same size from each of two different manufacturers, and
> mirror between unlike pairs, to control against this risk...
>
>                                                 - Bill
>well feel free to contact the system builders and sell them on your
idea, of how  they should procure and install the drives from multiple
venders perferably from different lots, into the x4500, they will love
the extra hassle  ;-p

James Dickens
uadmin.blogspot.com

>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2006-Jul-19 02:14 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

Bill Sommerfeld wrote:> On Tue, 2006-07-18 at 15:32, Daniel Rock wrote:
>>> Stop right here! :)  If you have a large number of identical disks
which
>>> operate in the same environment[1], and possibly the same
enclosure, it''s
>>> quite likely that you''ll see 2 or more disks die within
the same,
>>> relatively short, timeframe.
>> Not my experience. I work and have worked with several disk arrays
(EMC, IBM,
>> Sun, etc.) and the failure rates of individual disks were fairly
random.
> 
> My observation is that occasionally -- very occasionally -- you will get
> a bad batch of disks, which, due to a subtle design or manufacturing
> defect,  will all pass their tests, etc., run fine for some small number
> of months or years (and, of course, long enough for you to believe
> they''re ready for production use..), and then start dying in
droves.
Yes, something like the sticksion problem that plaqued old Quantum ProDrives,
or the phosphorus contamination which plagued some control electronics, or
the defect growth rates caused by crystal growth in the mechanics, et.al.
In my experience, there are also cases where bad power supplies cause
a whole bunch of unhappiness.  I''ve also heard horror stories about
failed
air conditioning and extreme vibration problems (eg. a stamping plant).
 From a modelling perspective, these are difficult because we don''t
know
how to assign reasonable failure rates to them.  So beware, most availability
models assume perfect manufacturing and operating environments as well as
bug-free software and firmware.  YMMV.
> The paranoid in me wonders whether it would be worthwhile to buy pairs
> of disks of the same size from each of two different manufacturers, and
> mirror between unlike pairs, to control against this risk...
First, let''s convince everyone to mirror and not RAID-Z[2] -- boil one
ocean at a time, there are only 5 you know... :-)
  -- richard

Matty

2006-Jul-19 04:43 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Tue, 18 Jul 2006, Al Hopper wrote:
> On Tue, 18 Jul 2006, Daniel Rock wrote:
>
>> Richard Elling schrieb:
>>> Jeff Bonwick wrote:
>>>>> For 6 disks, 3x2-way RAID-1+0 offers better resiliency than
RAID-Z
>>>>> or RAID-Z2.
>>>>
>>>> Maybe I''m missing something, but it ought to be the
other way around.
>>>> With 6 disks, RAID-Z2 can tolerate any two disk failures,
whereas
>>>> for 3x2-way mirroring, of the (6 choose 2) = 6*5/2 = 15
possible
>>>> two-disk failure scenarios, three of them are fatal.
>>>
>>> For the 6-disk case, with RAID-1+0 you get 27/64 surviving states
>>> versus 22/64 for RAID-Z2.  This accounts for the cases where you
could
>>> lose 3 disks and survive with RAID-1+0.
>>
>> I think this type of calculation is flawed. Disk failures are rare and
>> multiple disk failures at the same time are even more rare.
>
> Stop right here! :)  If you have a large number of identical disks which
> operate in the same environment[1], and possibly the same enclosure,
it''s
> quite likely that you''ll see 2 or more disks die within the same,
> relatively short, timeframe.
>
> Also, with todays higher density disk enclosures, a fan failure, which
> goes un-noticed for a period of time, is likely to affect more than one
> drive - again leading to multiple disks failing in the same general
> timeframe.
>
> This is also why I advocate having cold spares available - so that the
> probability of the spare failing within the same timeframe is greatly
> diminished.
A good SMART implementation combined with a decent sensor framework can 
also be useful for dealing with these conditions. Smartmontools is 
currently able to send E-amil when the ambient temperature of a disk 
drive goes beyond the recommended thresholds. I am hopeful the Solaris 
SMART implementation will take temperature into account, since modern 
disk drives run hot, and fan failures aren''t all that uncommon.

- Ryan
--
UNIX Administrator
http://prefetch.net

Eric Schrock

2006-Jul-19 05:32 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

On Wed, Jul 19, 2006 at 12:43:21AM -0400, Matty wrote:> 
> A good SMART implementation combined with a decent sensor framework can 
> also be useful for dealing with these conditions. Smartmontools is 
> currently able to send E-amil when the ambient temperature of a disk 
> drive goes beyond the recommended thresholds. I am hopeful the Solaris 
> SMART implementation will take temperature into account, since modern 
> disk drives run hot, and fan failures aren''t all that uncommon.
> 
Hopefully, but I believe the only supported (public) SMART interface is
the ''predictive failure bit''.  Each drive vendor has a slew of
other
internal variables, but they don''t publish the specs because they
generally don''t want folks second-guessing their internal algorithms.

However, I believe there is also a SCSI environmental sensor protocol
that can do things like temperature monitoring that we''ll also want to
incorporate in future diagnosis engines.

The current thumper-specific diagnosis engine does this, but we''re
working on generalizing the framework and more tightly integrating with
ZFS.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Daniel Rock

2006-Jul-19 20:20 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

Richard Elling schrieb:> First, let''s convince everyone to mirror and not RAID-Z[2] -- boil
one
> ocean at a time, there are only 5 you know... :-)
For maximum protection 4-disk RAID-Z2 is *always* better than 4-disk RAID-1+0. 
With more disks use multiple 4-disk RAID-Z2 packs.


Daniel

Henk Langeveld

2006-Jul-19 21:07 UTC

head link

[zfs-discuss] Big JBOD: what would you do?

Eric Schrock wrote:> One thing I would pay attention to is the future world of native ZFS
> root.  On a thumper, you only have two drives which are bootable from
> the BIOS.  For any application in which reliability is important, you
> would have these two drives mirrored as your root filesystem.  There can
> be no hot spares for this pool, because any device you hot spare in will
> not be readable from the BIOS. 
> For all the Thumper raidz2 models, I would assume only having 46 disks.
> This gives a nice bias towards one of the following configurations:
> 
> 	- 5x(7+2), 1 hot spare, 21.0TB
> 	- 4x(9+2), 2 hot spares, 18.0TB
> 	- 6x(5+2), 4 hot spares, 15.0TB
And in order to mitigate the impact of the lack of root spares in the scenario 
above, I''d go for plenty of hot spares, and do a manual swap of one
hot-spare
with the failing root mirror.

Henk

Lance

2006-Jul-22 01:40 UTC

head link

[zfs-discuss] Re: Big JBOD: what would you do?

> This gives a nice bias towards one of the following
> configurations:
> 
> 	- 5x(7+2), 1 hot spare, 17.5TB [corrected]
> 	- 4x(9+2), 2 hot spares, 18.0TB
> 	- 6x(5+2), 4 hot spares, 15.0TB
In addition to Eric''s suggestions, I would be interested in these
configs for 46 disks:

5 x (8+1)    1 hot spare    20.0 TB
4 x (10+1)  2 hot spares   20.0 TB
6 x (6+1)    4 hot spares   18.0 TB

In a few cases, we might want more space rather than 2-disk parity.  Thanks.
 
 
This message posted from opensolaris.org

Rob Logan

2006-Jul-22 06:26 UTC

head link

[zfs-discuss] Re: Big JBOD: what would you do?

perhaps these are good picks:

5 x (7+2)  1 hot spare  35 data disks  <- best safety
5 x (8+1)  1 hot spare  40 data disks  <- best space
9 x (4+1)  1 hot spare  36 data disks  <- best speed
1 x (45+1) 0 hot spare  45 data disks  <- max space
23x (1+1)  0 hot spare  23 data disks  <- max speed

it would be nice to see some kinda metadata test, `du` or `find`
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6437054

Richard Elling

2006-Jul-23 04:35 UTC

head link

[zfs-discuss] Re: Big JBOD: what would you do?

Thanks Rob, one comment below.

Rob Logan wrote:> perhaps these are good picks:
> 
> 5 x (7+2)  1 hot spare  35 data disks  <- best safety
> 5 x (8+1)  1 hot spare  40 data disks  <- best space
> 9 x (4+1)  1 hot spare  36 data disks  <- best speed
> 1 x (45+1) 0 hot spare  45 data disks  <- max space
This one stretches the models a bit.  In one model, the MTTDL is
~1200 years and in a more detailed model, it is 6 years.  Most
people will be very unhappy with an MTTDL of 6 years.  To put
this in perspective, a 46-disk RAID-0 has an MTTDL of less than
2 years in all models.

I''d like to hear from the ZFS team how such a wide stripe would
be expected to perform :-)
  -- richard

Rich Teer

2006-Jul-23 17:44 UTC

head link

[zfs-discuss] Re: Big JBOD: what would you do?

On Sat, 22 Jul 2006, Richard Elling wrote:
> This one stretches the models a bit.  In one model, the MTTDL is
For us storage newbies, what is MTTDL?  I would guess Mean Time
To Data Loss, which presumably is some multiple of the drives''
MTBF (Mean Time Between Failures)?

-- 
Rich Teer, SCNA, SCSA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich

Richard Elling

2006-Jul-23 22:45 UTC

head link

[zfs-discuss] Re: Big JBOD: what would you do?

Rich Teer wrote:> On Sat, 22 Jul 2006, Richard Elling wrote:
> 
>> This one stretches the models a bit.  In one model, the MTTDL is
> 
> For us storage newbies, what is MTTDL?  I would guess Mean Time
> To Data Loss, which presumably is some multiple of the drives''
> MTBF (Mean Time Between Failures)?
Correct,
MTTDL = Mean Time To Data Loss
MTBF = Mean Time Between Failures
MTTR = Mean Time To Recover
MTBS = Mean Time Between Services (eg. repair action)
MTBSI = Mean Time Between Service Interruption

When we talk about retention, we worry about MTTDL.
When we talk about data availability, we worry about MTBSI.
When we talk about spares stocking or service intervals, MTBS.
Systems architecture, component selection, and configuration all
interact with each other.  It would be nice to have some really
good dependability benchmarks to apply, but that discipline is
still in its early stages.
  -- richard

Seemingly Similar Threads

Search for more possibly parallel threads

zfs discuss - Jul 2006 - Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Big JBOD: what would you do?

[zfs-discuss] Re: Big JBOD: what would you do?

[zfs-discuss] Re: Big JBOD: what would you do?

[zfs-discuss] Re: Big JBOD: what would you do?

[zfs-discuss] Re: Big JBOD: what would you do?

[zfs-discuss] Re: Big JBOD: what would you do?

Seemingly Similar Threads