thr3ads.net - zfs discuss - [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ) [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Matthew Angelo

2011-Feb-07 02:45 UTC

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

I require a new high capacity 8 disk zpool. ?The disks I will be
purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
bits read) of 1 in 10^14 and will be 2TB. ?I''m staying clear of WD
because they have the new 2048b sectors which don''t play nice with ZFS
at the moment.

My question is, how do I determine which of the following zpool and
vdev configuration I should run to maximize space whilst mitigating
rebuild failure risk?

1. 2x RAIDZ(3+1) vdev
2. 1x RAIDZ(7+1) vdev
3. 1x RAIDZ2(7+1) vdev


I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ) with 8x
2TB disks.

Cheers

Ian Collins

2011-Feb-07 04:18 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

On 02/ 7/11 03:45 PM, Matthew Angelo wrote:> I require a new high capacity 8 disk zpool.  The disks I will be
> purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
> bits read) of 1 in 10^14 and will be 2TB.  I''m staying clear of WD
> because they have the new 2048b sectors which don''t play nice with
ZFS
> at the moment.
>
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?
>
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(7+1) vdev
>I assume 3 was 6+2.

A bigger issue than drive error rates is how long a new 2TB drive will 
take to resilver if one dies.  How long are you willing to run without 
redundancy in your pool?

-- 
Ian.

Edward Ned Harvey

2011-Feb-07 04:48 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Matthew Angelo
> 
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?
> 
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(6+2) vdev
> 
> I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ)
with 8x
> 2TB disks.
(Corrected type-o, 6+2 for you).
Sounds like you made up your mind already.  Nothing wrong with that.  You
are apparently uncomfortable running with only 1 disk worth of redundancy.
There is nothing fundamentally wrong with the raidz1 configuration, but the
probability of failure is obviously higher.

Question is how do you calculate the probability?  Because if we''re
talking
abou 5e-21 versus 3e-19 then you probably don''t care about the
difference...
They''re both essentially zero probability...  Well...  There''s
no good
answer to that.  

With the cited probability of bit error rate, you''re just representing
the
probability of a bit error.  You''re not representing the probability of
a
failed drive.  And you''re not representing the probability of a drive
failure within a specified time window.  What you really care about is the
probability of two drives (or 3 drives) failing concurrently...  In which
case, you need to model the probability of any one drive failing within a
specified time window.  And even if you want to model that probability, in
reality it''s not linear.  The probability of a drive failing between
1yr and
1yr+3hrs is smaller than the probability of the drive failing between 3yr
and 3yr+3hrs.  Because after 3yrs, the failure rate will be higher.  So
after 3 yrs, the probability of multiple simultaneous failures is higher. 

I recently saw some seagate data sheets which specified the annual disk
failure rate to be 0.3%.  Again, this is a linear model, representing a
nonlinear reality.

Suppose one disk fails...  How many weeks does it take to get a replacement
onsite under the 3yr limited mail-in warranty?

But then again after 3 years, you''re probably considering this your
antique
hardware, and all the stuff you care about is on a newer server.  Etc.

There''s no good answer to your question.  

You are obviously uncomfortable with a single disk worth of redundancy.  Go
with your gut.  Sleep well at night.  It only costs you $100.  You probably
have a cell phone with no backups worth more than that in your pocket right
now.

Matthew Angelo

2011-Feb-07 06:22 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

Yes I did mean 6+2, Thank you for fixing the typo.

I''m actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the "scales would tip".   I understand
RAIDZ2 protects against failures during a rebuild process.  Currently,
my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
and worse case assuming this is 2 days this is my ''exposure''
time.

For example, I would hazard a confident guess that 7+1 RAIDZ1 with 6TB
drives wouldn''t be a smart idea.  I''m just trying to
extrapolate down.

I will be running hot (or maybe cold) spare.  So I don''t need to
factor in "Time it takes for a manufacture to replace the drive".



On Mon, Feb 7, 2011 at 2:48 PM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Matthew Angelo
>>
>> My question is, how do I determine which of the following zpool and
>> vdev configuration I should run to maximize space whilst mitigating
>> rebuild failure risk?
>>
>> 1. 2x RAIDZ(3+1) vdev
>> 2. 1x RAIDZ(7+1) vdev
>> 3. 1x RAIDZ2(6+2) vdev
>>
>> I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ)
with 8x
>> 2TB disks.
>
> (Corrected type-o, 6+2 for you).
> Sounds like you made up your mind already. ?Nothing wrong with that. ?You
> are apparently uncomfortable running with only 1 disk worth of redundancy.
> There is nothing fundamentally wrong with the raidz1 configuration, but the
> probability of failure is obviously higher.
>
> Question is how do you calculate the probability? ?Because if
we''re talking
> abou 5e-21 versus 3e-19 then you probably don''t care about the
difference...
> They''re both essentially zero probability... ?Well...
?There''s no good
> answer to that.
>
> With the cited probability of bit error rate, you''re just
representing the
> probability of a bit error. ?You''re not representing the
probability of a
> failed drive. ?And you''re not representing the probability of a
drive
> failure within a specified time window. ?What you really care about is the
> probability of two drives (or 3 drives) failing concurrently... ?In which
> case, you need to model the probability of any one drive failing within a
> specified time window. ?And even if you want to model that probability, in
> reality it''s not linear. ?The probability of a drive failing
between 1yr and
> 1yr+3hrs is smaller than the probability of the drive failing between 3yr
> and 3yr+3hrs. ?Because after 3yrs, the failure rate will be higher. ?So
> after 3 yrs, the probability of multiple simultaneous failures is higher.
>
> I recently saw some seagate data sheets which specified the annual disk
> failure rate to be 0.3%. ?Again, this is a linear model, representing a
> nonlinear reality.
>
> Suppose one disk fails... ?How many weeks does it take to get a replacement
> onsite under the 3yr limited mail-in warranty?
>
> But then again after 3 years, you''re probably considering this
your antique
> hardware, and all the stuff you care about is on a newer server. ?Etc.
>
> There''s no good answer to your question.
>
> You are obviously uncomfortable with a single disk worth of redundancy. ?Go
> with your gut. ?Sleep well at night. ?It only costs you $100. ?You probably
> have a cell phone with no backups worth more than that in your pocket right
> now.
>
>

Richard Elling

2011-Feb-07 07:01 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

On Feb 6, 2011, at 6:45 PM, Matthew Angelo wrote:
> I require a new high capacity 8 disk zpool.  The disks I will be
> purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
> bits read) of 1 in 10^14 and will be 2TB.  I''m staying clear of WD
> because they have the new 2048b sectors which don''t play nice with
ZFS
> at the moment.
> 
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?
The MTTDL[2] model will work.
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl
As described, this model doesn''t scale well for N > 3 or 4, but it
will get
you in the ballpark.

You will also need to know the MTBF from the data sheet, but if you
don''t have that info, that is ok because you are asking the right
question:
given a single drive type, what is the best configuration for preventing
data loss. Finally, to calculate the raidz2 result, you need to know the 
mean time to recovery (MTTR) which includes the logistical replacement
time and resilver time.

Basically, the model calculates the probability of a data loss event during
reconstruction. This is different for ZFS and most other LVMs because ZFS
will only resilver data and the total data <= disk size.
> 
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(7+1) vdev
> 
> 
> I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ)
with 8x
> 2TB disks.
Double parity will win over single parity. Intuitively, when you add parity you
multiply by the MTBF. When you add disks to a set, you change the denominator
by a few digits. Obviously multiplication is a good thing, dividing not so much.
In short, raidz2 is the better choice.
 -- richard

Sandon Van Ness

2011-Feb-07 13:23 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

I think as far as data integrity and complete volume loss is most likely 
in the following order:

1. 1x Raidz(7+1)
2. 2x RaidZ(3+1)
3. 1x Raidz2(6+2)

Simple raidz certainly is an option with only 8 disks (8 is about the 
maximum I would go) but to be honest I would feel safer going raidz2. 
The 2x raidz (3+1) would probably perform the best but I would prefer 
going with the 3rd option (raidz2) as it is better for redundancy. With 
raidz2 any two disks can fail and with dual parity if you get some 
un-recoverable read errors during a scrub you have a much better chance 
of not having corruption due to the double parity on the same set of data.

On 02/06/2011 06:45 PM, Matthew Angelo wrote:> I require a new high capacity 8 disk zpool.  The disks I will be
> purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
> bits read) of 1 in 10^14 and will be 2TB.  I''m staying clear of WD
> because they have the new 2048b sectors which don''t play nice with
ZFS
> at the moment.
>
> My question is, how do I determine which of the following zpool and
> vdev configuration I should run to maximize space whilst mitigating
> rebuild failure risk?
>
> 1. 2x RAIDZ(3+1) vdev
> 2. 1x RAIDZ(7+1) vdev
> 3. 1x RAIDZ2(7+1) vdev
>
>
> I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ)
with 8x
> 2TB disks.
>
> Cheers
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Peter Jeremy

2011-Feb-07 21:07 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers at gmail.com>
wrote:>I''m actually more leaning towards running a simple 7+1 RAIDZ1.
>Running this with 1TB is not a problem but I just wanted to
>investigate at what TB size the "scales would tip".
It''s not that simple.  Whilst resilver time is proportional to device
size, it''s far more impacted by the degree of fragmentation of the
pool.  And there''s no ''tipping point'' - it''s
a gradual slope so it''s
really up to you to decide where you want to sit on the probability
curve.
>   I understand
>RAIDZ2 protects against failures during a rebuild process.
This would be its current primary purpose.
>  Currently,
>my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
>and worse case assuming this is 2 days this is my
''exposure'' time.
Unless this is a write-once pool, you can probably also assume that
your pool will get more fragmented over time, so by the time your
pool gets to twice it''s current capacity, it might well take 3 days
to rebuild due to the additional fragmentation.

One point I haven''t seen mentioned elsewhere in this thread is that
all the calculations so far have assumed that drive failures were
independent.  In practice, this probably isn''t true.  All HDD
manufacturers have their "off" days - where whole batches or models of
disks are cr*p and fail unexpectedly early.  The WD EARS is simply a
demonstration that it''s WD''s turn to turn out junk.  Your best
protection against this is to have disks from enough different batches
that a batch failure won''t take out your pool.

PSU, fan and SATA controller failures are likely to take out multiple
disks but it''s far harder to include enough redundancy to handle this
and your best approach is probably to have good backups.
>I will be running hot (or maybe cold) spare.  So I don''t need to
>factor in "Time it takes for a manufacture to replace the drive".
In which case, the question is more whether 8-way RAIDZ1 with a
hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).  In the latter
case, your "hot spare" is already part of the pool so you
don''t
lose the time-to-notice plus time-to-resilver before regaining
redundancy.  The downside is that actively using the "hot spare"
may increase the probability of it failing.

-- 
Peter Jeremy

Richard Elling

2011-Feb-08 00:53 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:
> On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers at gmail.com>
wrote:
>> I''m actually more leaning towards running a simple 7+1 RAIDZ1.
>> Running this with 1TB is not a problem but I just wanted to
>> investigate at what TB size the "scales would tip".
> 
> It''s not that simple.  Whilst resilver time is proportional to
device
> size, it''s far more impacted by the degree of fragmentation of the
> pool.  And there''s no ''tipping point'' -
it''s a gradual slope so it''s
> really up to you to decide where you want to sit on the probability
> curve.
The "tipping point" won''t occur for similar configurations.
The tip
occurs for different configurations. In particular, if the size of the 
N+M parity scheme is very large and the resilver times become
very, very large (weeks) then a (M-1)-way mirror scheme can provide
better performance and dependability. But I consider these to be
extreme cases.
>>  I understand
>> RAIDZ2 protects against failures during a rebuild process.
> 
> This would be its current primary purpose.
> 
>> Currently,
>> my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
>> and worse case assuming this is 2 days this is my
''exposure'' time.
> 
> Unless this is a write-once pool, you can probably also assume that
> your pool will get more fragmented over time, so by the time your
> pool gets to twice it''s current capacity, it might well take 3
days
> to rebuild due to the additional fragmentation.
> 
> One point I haven''t seen mentioned elsewhere in this thread is
that
> all the calculations so far have assumed that drive failures were
> independent.  In practice, this probably isn''t true.  All HDD
> manufacturers have their "off" days - where whole batches or
models of
> disks are cr*p and fail unexpectedly early.  The WD EARS is simply a
> demonstration that it''s WD''s turn to turn out junk.  Your
best
> protection against this is to have disks from enough different batches
> that a batch failure won''t take out your pool.
The problem with considering the failures as interdependent is that 
you cannot get the failure rate information from the vendors.  You could
guess, or use your own, but it would not always help you make a better design
decision.
> 
> PSU, fan and SATA controller failures are likely to take out multiple
> disks but it''s far harder to include enough redundancy to handle
this
> and your best approach is probably to have good backups.
The top 4 items that fail most often, in no particular order, are: fans,
power supplies, memory, and disk. This is why you will see the enterprise
class servers use redundant fans, multiple high-quality power supplies,
ECC memory, and some sort of RAID.
> 
>> I will be running hot (or maybe cold) spare.  So I don''t need
to
>> factor in "Time it takes for a manufacture to replace the
drive".
> 
> In which case, the question is more whether 8-way RAIDZ1 with a
> hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).  
In this case, raidz2 is much better for dependability because the
"spare"
is already "resilvered."  It also performs better, though the
dependability
gains tend to be bigger than the performance gains.
> In the latter
> case, your "hot spare" is already part of the pool so you
don''t
> lose the time-to-notice plus time-to-resilver before regaining
> redundancy.  The downside is that actively using the "hot spare"
> may increase the probability of it failing.
No. The disk failure rate data does not conclusively show that activity
causes premature failure. Other failure modes dominate.
 -- richard

Paul Kraus

2011-Feb-14 12:55 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

On Mon, Feb 7, 2011 at 7:53 PM, Richard Elling <richard.elling at
gmail.com> wrote:> On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:
>
>> On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers at
gmail.com> wrote:
>>> I''m actually more leaning towards running a simple 7+1
RAIDZ1.
>>> Running this with 1TB is not a problem but I just wanted to
>>> investigate at what TB size the "scales would tip".
>>
>> It''s not that simple. ?Whilst resilver time is proportional to
device
>> size, it''s far more impacted by the degree of fragmentation of
the
>> pool. ?And there''s no ''tipping point'' -
it''s a gradual slope so it''s
>> really up to you to decide where you want to sit on the probability
>> curve.
>
> The "tipping point" won''t occur for similar
configurations. The tip
> occurs for different configurations. In particular, if the size of the
> N+M parity scheme is very large and the resilver times become
> very, very large (weeks) then a (M-1)-way mirror scheme can provide
> better performance and dependability. But I consider these to be
> extreme cases.
    Empirically it seems that resilver time is related to number of
objects as much (if not more than) amount of data. zpools (mirrors)
with similar amounts of data but radically different numbers of
objects take very different amounts of time to resilver. I have NOT
(yet) started actually measuring and tracking this, but the above is
based on casual observation.

P.S. I am measuring number of objects via `zdb -d` as that is faster
than trying to count files and directories and I expect is a much
better measure of what the underlying zfs code is dealing with (a
particular dataset may have lots of snapshot data that does not
(easily) show up).

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Nico Williams

2011-Feb-14 13:12 UTC

head link

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

On Feb 14, 2011 6:56 AM, "Paul Kraus" <paul at kraus-haus.org>
wrote:> P.S. I am measuring number of objects via `zdb -d` as that is faster
> than trying to count files and directories and I expect is a much
> better measure of what the underlying zfs code is dealing with (a
> particular dataset may have lots of snapshot data that does not
> (easily) show up).
It''s faster because; a) no atime updates, b) no ZPL overhead.

Nico
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110214/41539243/attachment.html>

zfs discuss - Feb 2011 - RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)