thr3ads.net - zfs discuss - [zfs-discuss] please help with raid / failure / rebuild calculations [Jul 2008]

If this information is useful, please help other people find it:
Share via:

User Name

2008-Jul-11 04:14 UTC

[zfs-discuss] please help with raid / failure / rebuild calculations

I am building a 14 disk raid 6 array with 1 TB seagate AS (non-enterprise)
drives.

So there will be 14 disks total, 2 of them will be parity, 12 TB space
available.

My drives have a BER of 10^14

I am quite scared by my calculations - it appears that if one drive fails, and I
do a rebuild, I will perform:

13*8*10^12 = 104000000000000

reads.  But my BER is smaller:

10^14 = 100000000000000

So I am (theoretically) guaranteed to lose another drive on raid rebuild.  Then
the calculation for _that_ rebuild is:

12*8*10^12 = 96000000000000

So no longer guaranteed, but 96% isn''t good.

I have looked all over, and these seem to be the accepted calculations - which
means if I ever have to rebuild, I''m toast.

But here is the question - the part I am having trouble understanding:

The 13*8*10^12 operations required for the first rebuild .... isn''t
that the number for _the entire array_ ?  Any given 1 TB disk only has 10^12
bits on it _total_.  So why would I ever do more than 10^12 operations on the
disk ?

It seems very odd to me that a raid controller would have to access any given
bit more than once to do a rebuild ... and the total number of bits on a drive
is 10^12, which is far below the 10^14 BER number.

So I guess my question is - why are we all doing this calculation, wherein we
apply the total operations across an entire array rebuild to a single drives BER
number ?

Thanks.
 
 
This message posted from opensolaris.org

Richard Elling

2008-Jul-11 05:47 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

User Name wrote:> I am building a 14 disk raid 6 array with 1 TB seagate AS (non-enterprise)
drives.
>
> So there will be 14 disks total, 2 of them will be parity, 12 TB space
available.
>
> My drives have a BER of 10^14
>
> I am quite scared by my calculations - it appears that if one drive fails,
and I do a rebuild, I will perform:
>
> 13*8*10^12 = 104000000000000
>
> reads.  But my BER is smaller:
>
> 10^14 = 100000000000000
>
> So I am (theoretically) guaranteed to lose another drive on raid rebuild. 
Then the calculation for _that_ rebuild is:
>
> 12*8*10^12 = 96000000000000
>
> So no longer guaranteed, but 96% isn''t good.
>
> I have looked all over, and these seem to be the accepted calculations -
which means if I ever have to rebuild, I''m toast.
>   
If you were using RAID-5, you might be concerned.  For RAID-6,
or at least raidz2, you could recover an unrecoverable read during
the rebuild of one disk.
> But here is the question - the part I am having trouble understanding:
>
> The 13*8*10^12 operations required for the first rebuild ....
isn''t that the number for _the entire array_ ?  Any given 1 TB disk
only has 10^12 bits on it _total_.  So why would I ever do more than 10^12
operations on the disk ?
>   
Actually, ZFS only rebuilds the data.  So you need to multiply by
the space utilization of the pool, which will usually be less than
100%.
> It seems very odd to me that a raid controller would have to access any
given bit more than once to do a rebuild ... and the total number of bits on a
drive is 10^12, which is far below the 10^14 BER number.
>
> So I guess my question is - why are we all doing this calculation, wherein
we apply the total operations across an entire array rebuild to a single drives
BER number ?
>   
You might also be interested in this blog
http://blogs.zdnet.com/storage/?p=162

A couple of things seem to be at work here.  I study field data
failure rates.  We tend to see unrecoverable read failure rates
at least an order of magnitude better than the specifications.
This is a good thing, but simply points out that the specifications
are often sand-bagged -- they are not a guarantee.  However,
you are quite right in your intuition that if you have a lot of
bits of data, then you need to pay attention to the bit-error
rate (BER) of unrecoverable reads on disks. This sort of model
can be used to determine a mean time to data loss (MTTDL) as
I explain here:
    http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

Perhaps it would help if we changed the math to show the risk
as a function of the amount of data given the protection scheme? 
hmmm.... something like probability of data loss per year for
N TBytes with configuration XYZ.  Would that be more
useful for evaluating configurations?
 -- richard

Ross

2008-Jul-11 07:25 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

Without checking your math, I believe you may be confusing the risk of *any*
data corruption with the risk of a total drive failure, but I do agree that the
calculation should just be for the data on the drive, not the whole array.

My feeling on this from the various analyses I''ve read on the web is
that you''re reasonably likely to find some corruption on a drive during
a rebuild, but raid-6 protects you from this nicely.  From memory, I think the
stats were something like a 5% chance of an error on a 500GB drive, which would
mean something like a 10% chance with your 1TB drives.  That would tie in with
your figures if you took out the multiplier for the whole raid''s data. 
Instead of a guaranteed failure, you''ve calculated around 1 in 10 odds.

So, during any rebuild you''ve around a 1 in 10 chance of the rebuild
encountering *some* corruption, but that''s very likely going to be just
a few bits of data, which can be easily recovered using raid-6 and the rest of
the rebuild can carry on as normal.

Of course there''s always a risk of a second drive failing, which is why
we have backups, but I believe that risk is miniscule in comparison, and also
offset by the ability to regularly scrub your data, which helps to ensure that
any problems with drives are caught early on.  Early replacement of failing
drives means it''s far less likely that you''ll ever have two
fail together.
 
 
This message posted from opensolaris.org

User Name

2008-Jul-11 12:30 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

Hello relling,

Thanks for your comments.  FWIW, I am building an actual hardware array, so een
though I _may_ put ZFS on top of the hardware arrays 22TB "drive" that
the OS sees (I may not) I am focusing purely on the controller rebuild.

So, setting aside ZFS for the moment, am I still correct in my intuition that
there is no way a _controller_ needs to touch a disk more times than there are
bits on the entire disk, and that this calculation people are doing is faulty ?

I will check out that blog - thanks.
 
 
This message posted from opensolaris.org

Richard Elling

2008-Jul-11 16:06 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

User Name wrote:> Hello relling,
>
> Thanks for your comments.  FWIW, I am building an actual hardware array, so
een though I _may_ put ZFS on top of the hardware arrays 22TB "drive"
that the OS sees (I may not) I am focusing purely on the controller rebuild.
>
> So, setting aside ZFS for the moment, am I still correct in my intuition
that there is no way a _controller_ needs to touch a disk more times than there
are bits on the entire disk, and that this calculation people are doing is
faulty ?
>   
I think the calculation is correct, at least for the general case.
At FAST this year there was an interesting paper which tried to
measure this exposure in a large field sample by using checksum
verifications.  I like this paper and it validates what we see in the
field -- the most common failure mode is unrecoverable read.
http://www.usenix.org/event/fast08/tech/ 
full_papers/bairavasundaram/bairavasundaram.pdf

I should also point out that ZFS is already designed to offer some
diversity which should help guard against spatially clustered
media failures.  hmmm... another blog topic in my queue...
 -- richard

Akhilesh Mritunjai

2008-Jul-11 17:18 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

> Thanks for your comments.  FWIW, I am building an
> actual hardware array, so een though I _may_ put ZFS
> on top of the hardware arrays 22TB "drive" that the
> OS sees (I may not) I am focusing purely on the
> controller rebuild.
Not letting ZFS handle (at least one level of) redundancy is a bad idea.
Don''t do that!
 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Jul-11 21:05 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

On Fri, 11 Jul 2008, Akhilesh Mritunjai wrote:
>> Thanks for your comments.  FWIW, I am building an
>> actual hardware array, so een though I _may_ put ZFS
>> on top of the hardware arrays 22TB "drive" that the
>> OS sees (I may not) I am focusing purely on the
>> controller rebuild.
>
> Not letting ZFS handle (at least one level of) redundancy is a bad 
> idea. Don''t do that!
Agreed.

A further issue to consider is mean time to recover/restore.  This has 
quite a lot to do with actual uptime.  For example, if you decide to 
create two huge 22TB LUNs and mirror across them, if ZFS needs to 
resilver one of the LUNs it will take a *long* time.  A good design 
will try to keep any storage area which needs to be resilvered small 
enough that it may be restored quickly and risk of secondary failure 
is minimized.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross

2008-Jul-15 05:58 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

Hey,

I just had a "D''oh!" moment I''m afraid, woke up this
morning realising my previous post about the chances of failure was completely
wrong.

You do need to multiply the chance of failure by the number of remaining disks,
because you''re reading the data of every one of them, and you risk
loosing data from any one of them.  However, I''m not sure where the 8
is coming from in your calculations.  To my mind, the chance of failure on any
one drive is:

amount of data reads / chance of failure
= 1TB / 10^14 
~ 10^12 / 10^14 or a 1 in 100 chance of failure

So then, once one of your 14 disks fail, you have 13 left and for raid-z you
need to read the data of every single one of them to survive without errors,
which means the calculation is now:

no of disks * amount of data reads / chance of failure

In this case approximately 13/100 or around 1 in 8 odds.

So with raid-z you have around a 1 in 8 chance of *some kind* of data error
during the rebuild of the raid.  So your odds calculations weren''t far
off, but the key point is that you''re not calculating entire drive
failure here, you''re calculating the odds of having a single bit of
data fail.  Now that bit could be in a vital file, but it could just as easily
be in an unimportant file, or even blank space.

And I can also give you the correct math for raid-z2.  Keeping in mind that
these figures are for a *single piece of data*, not the entire drive, the chance
of raid-z2 failing during the rebuild is very small.  I agree that the odds of
having at least one piece of data fail during the raid-z2 rebuild are reasonably
high (1 in 8), but for the rebuild to fail, you need two failures in the same
place which means the calculation is for the failure rate for that particular
bit, not for every bit on the drive:

no of disks / chance of failure

So the chance of your raid-z2 failing during the rebuild is approximately 12 in
10^14.  Which I think you''ll agree are much better odds :D

Ross
 
 
This message posted from opensolaris.org

Will Murnane

2008-Jul-15 06:30 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

On Tue, Jul 15, 2008 at 01:58, Ross <myxiplx at hotmail.com>
wrote:> However, I''m not sure where the 8 is coming from in your
calculations.Bits per byte ;)
> In this case approximately 13/100 or around 1 in 8 odds.Taking into account the factor 8, and it''s around 8 in 8.

Another possible factor to consider in calculations of this nature is
that you probably won''t get a single bit flipped here or there.  If
drives take 512-byte sectors and apply Hamming codes to those 512
bytes to get, say, 548 bytes of coded data that are actually written
to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you
cannot correct them from the data you have.  Thus, rather than getting
one incorrect bit in a particular 4096-bit sector, you''re likely to
get all good sectors and one that''s complete garbage.  Unless the
manufacturers'' specifications account for this, I would say the sector
error rate of the drive is about 1 in 4*(10**17).  I have no idea
whether they account for this or not, but it''d be interesting (and
fairly doable) to test.  Write a 1TB disk full of known data, then
read it and verify.  Then repeat until you have seen incorrect sectors
a few times for a decent sample size, and store elsewhere what the
sector was supposed to be and what it actually was.

Will

Ross Smith

2008-Jul-15 08:38 UTC

head link

[zfs-discuss] FW: please help with raid / failure / rebuild calculations

bits vs bytes.... D''oh! again.  It''s a good job I
don''t do these calculations professionally. :-)> Date: Tue, 15 Jul
2008 02:30:33 -0400> From: will.murnane at gmail.com> To: myxiplx at
hotmail.com> Subject: Re: [zfs-discuss] please help with raid / failure /
rebuild calculations> CC: zfs-discuss at opensolaris.org> > On Tue, Jul
15, 2008 at 01:58, Ross <myxiplx at hotmail.com> wrote:> > However,
I''m not sure where the 8 is coming from in your calculations.> Bits
per byte ;)> > > In this case approximately 13/100 or around 1 in 8
odds.> Taking into account the factor 8, and it''s around 8 in 8.>
> Another possible factor to consider in calculations of this nature is>
that you probably won''t get a single bit flipped here or there. If>
drives take 512-byte sectors and apply Hamming codes to those 512> bytes to
get, say, 548 bytes of coded data that are actually written> to disk, you
need to flip (548-512)/2=16 bytes = 128 bits before you> cannot correct them
from the data you have. Thus, rather than getting> one incorrect bit in a
particular 4096-bit sector, you''re likely to> get all good sectors
and one that''s complete garbage. Unless the> manufacturers''
specifications account for this, I would say the sector> error rate of the
drive is about 1 in 4*(10**17). I have no idea> whether they account for this
or not, but it''d be interesting (and> fairly doable) to test. Write
a 1TB disk full of known data, then> read it and verify. Then repeat until
you have seen incorrect sectors> a few times for a decent sample size, and
store elsewhere what the> sector was supposed to be and what it actually
was.> > Will

Get Hotmail on your Mobile! Try it Now! 
_________________________________________________________________
The John Lewis Clearance - save up to 50% with FREE delivery
http://clk.atdmt.com/UKM/go/101719806/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080715/bf5c5462/attachment.html>

Richard Elling

2008-Jul-15 16:23 UTC

head link

[zfs-discuss] please help with raid / failure / rebuild calculations

Will Murnane wrote:> On Tue, Jul 15, 2008 at 01:58, Ross <myxiplx at hotmail.com> wrote:
>   
>> However, I''m not sure where the 8 is coming from in your
calculations.
>>     
> Bits per byte ;)
>
>   
>> In this case approximately 13/100 or around 1 in 8 odds.
>>     
> Taking into account the factor 8, and it''s around 8 in 8.
>
> Another possible factor to consider in calculations of this nature is
> that you probably won''t get a single bit flipped here or there. 
If
> drives take 512-byte sectors and apply Hamming codes to those 512
> bytes to get, say, 548 bytes of coded data that are actually written
> to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you
> cannot correct them from the data you have.  Thus, rather than getting
> one incorrect bit in a particular 4096-bit sector, you''re likely
to
> get all good sectors and one that''s complete garbage.  Unless the
> manufacturers'' specifications account for this, I would say the
sector
> error rate of the drive is about 1 in 4*(10**17).  I have no idea
> whether they account for this or not, but it''d be interesting (and
> fairly doable) to test.  Write a 1TB disk full of known data, then
> read it and verify.  Then repeat until you have seen incorrect sectors
> a few times for a decent sample size, and store elsewhere what the
> sector was supposed to be and what it actually was.
>   
The specification is for unrecoverable reads per bits read.  I think
most people expect this to be as delivered to host, which is how we
count them.  I would expect many, many more recoverable read events.

You can also adjust by the amount of space used in ZFS and the number
of copies of the data.
 -- richard

zfs discuss - Jul 2008 - please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations

[zfs-discuss] FW: please help with raid / failure / rebuild calculations

[zfs-discuss] please help with raid / failure / rebuild calculations