User Name
2008-Jul-11  04:14 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
I am building a 14 disk raid 6 array with 1 TB seagate AS (non-enterprise) drives. So there will be 14 disks total, 2 of them will be parity, 12 TB space available. My drives have a BER of 10^14 I am quite scared by my calculations - it appears that if one drive fails, and I do a rebuild, I will perform: 13*8*10^12 = 104000000000000 reads. But my BER is smaller: 10^14 = 100000000000000 So I am (theoretically) guaranteed to lose another drive on raid rebuild. Then the calculation for _that_ rebuild is: 12*8*10^12 = 96000000000000 So no longer guaranteed, but 96% isn''t good. I have looked all over, and these seem to be the accepted calculations - which means if I ever have to rebuild, I''m toast. But here is the question - the part I am having trouble understanding: The 13*8*10^12 operations required for the first rebuild .... isn''t that the number for _the entire array_ ? Any given 1 TB disk only has 10^12 bits on it _total_. So why would I ever do more than 10^12 operations on the disk ? It seems very odd to me that a raid controller would have to access any given bit more than once to do a rebuild ... and the total number of bits on a drive is 10^12, which is far below the 10^14 BER number. So I guess my question is - why are we all doing this calculation, wherein we apply the total operations across an entire array rebuild to a single drives BER number ? Thanks. This message posted from opensolaris.org
Richard Elling
2008-Jul-11  05:47 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
User Name wrote:> I am building a 14 disk raid 6 array with 1 TB seagate AS (non-enterprise) drives. > > So there will be 14 disks total, 2 of them will be parity, 12 TB space available. > > My drives have a BER of 10^14 > > I am quite scared by my calculations - it appears that if one drive fails, and I do a rebuild, I will perform: > > 13*8*10^12 = 104000000000000 > > reads. But my BER is smaller: > > 10^14 = 100000000000000 > > So I am (theoretically) guaranteed to lose another drive on raid rebuild. Then the calculation for _that_ rebuild is: > > 12*8*10^12 = 96000000000000 > > So no longer guaranteed, but 96% isn''t good. > > I have looked all over, and these seem to be the accepted calculations - which means if I ever have to rebuild, I''m toast. >If you were using RAID-5, you might be concerned. For RAID-6, or at least raidz2, you could recover an unrecoverable read during the rebuild of one disk.> But here is the question - the part I am having trouble understanding: > > The 13*8*10^12 operations required for the first rebuild .... isn''t that the number for _the entire array_ ? Any given 1 TB disk only has 10^12 bits on it _total_. So why would I ever do more than 10^12 operations on the disk ? >Actually, ZFS only rebuilds the data. So you need to multiply by the space utilization of the pool, which will usually be less than 100%.> It seems very odd to me that a raid controller would have to access any given bit more than once to do a rebuild ... and the total number of bits on a drive is 10^12, which is far below the 10^14 BER number. > > So I guess my question is - why are we all doing this calculation, wherein we apply the total operations across an entire array rebuild to a single drives BER number ? >You might also be interested in this blog http://blogs.zdnet.com/storage/?p=162 A couple of things seem to be at work here. I study field data failure rates. We tend to see unrecoverable read failure rates at least an order of magnitude better than the specifications. This is a good thing, but simply points out that the specifications are often sand-bagged -- they are not a guarantee. However, you are quite right in your intuition that if you have a lot of bits of data, then you need to pay attention to the bit-error rate (BER) of unrecoverable reads on disks. This sort of model can be used to determine a mean time to data loss (MTTDL) as I explain here: http://blogs.sun.com/relling/entry/a_story_of_two_mttdl Perhaps it would help if we changed the math to show the risk as a function of the amount of data given the protection scheme? hmmm.... something like probability of data loss per year for N TBytes with configuration XYZ. Would that be more useful for evaluating configurations? -- richard
Ross
2008-Jul-11  07:25 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
Without checking your math, I believe you may be confusing the risk of *any* data corruption with the risk of a total drive failure, but I do agree that the calculation should just be for the data on the drive, not the whole array. My feeling on this from the various analyses I''ve read on the web is that you''re reasonably likely to find some corruption on a drive during a rebuild, but raid-6 protects you from this nicely. From memory, I think the stats were something like a 5% chance of an error on a 500GB drive, which would mean something like a 10% chance with your 1TB drives. That would tie in with your figures if you took out the multiplier for the whole raid''s data. Instead of a guaranteed failure, you''ve calculated around 1 in 10 odds. So, during any rebuild you''ve around a 1 in 10 chance of the rebuild encountering *some* corruption, but that''s very likely going to be just a few bits of data, which can be easily recovered using raid-6 and the rest of the rebuild can carry on as normal. Of course there''s always a risk of a second drive failing, which is why we have backups, but I believe that risk is miniscule in comparison, and also offset by the ability to regularly scrub your data, which helps to ensure that any problems with drives are caught early on. Early replacement of failing drives means it''s far less likely that you''ll ever have two fail together. This message posted from opensolaris.org
User Name
2008-Jul-11  12:30 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
Hello relling, Thanks for your comments. FWIW, I am building an actual hardware array, so een though I _may_ put ZFS on top of the hardware arrays 22TB "drive" that the OS sees (I may not) I am focusing purely on the controller rebuild. So, setting aside ZFS for the moment, am I still correct in my intuition that there is no way a _controller_ needs to touch a disk more times than there are bits on the entire disk, and that this calculation people are doing is faulty ? I will check out that blog - thanks. This message posted from opensolaris.org
Richard Elling
2008-Jul-11  16:06 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
User Name wrote:> Hello relling, > > Thanks for your comments. FWIW, I am building an actual hardware array, so een though I _may_ put ZFS on top of the hardware arrays 22TB "drive" that the OS sees (I may not) I am focusing purely on the controller rebuild. > > So, setting aside ZFS for the moment, am I still correct in my intuition that there is no way a _controller_ needs to touch a disk more times than there are bits on the entire disk, and that this calculation people are doing is faulty ? >I think the calculation is correct, at least for the general case. At FAST this year there was an interesting paper which tried to measure this exposure in a large field sample by using checksum verifications. I like this paper and it validates what we see in the field -- the most common failure mode is unrecoverable read. http://www.usenix.org/event/fast08/tech/ full_papers/bairavasundaram/bairavasundaram.pdf I should also point out that ZFS is already designed to offer some diversity which should help guard against spatially clustered media failures. hmmm... another blog topic in my queue... -- richard
Akhilesh Mritunjai
2008-Jul-11  17:18 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
> Thanks for your comments. FWIW, I am building an > actual hardware array, so een though I _may_ put ZFS > on top of the hardware arrays 22TB "drive" that the > OS sees (I may not) I am focusing purely on the > controller rebuild.Not letting ZFS handle (at least one level of) redundancy is a bad idea. Don''t do that! This message posted from opensolaris.org
Bob Friesenhahn
2008-Jul-11  21:05 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
On Fri, 11 Jul 2008, Akhilesh Mritunjai wrote:>> Thanks for your comments. FWIW, I am building an >> actual hardware array, so een though I _may_ put ZFS >> on top of the hardware arrays 22TB "drive" that the >> OS sees (I may not) I am focusing purely on the >> controller rebuild. > > Not letting ZFS handle (at least one level of) redundancy is a bad > idea. Don''t do that!Agreed. A further issue to consider is mean time to recover/restore. This has quite a lot to do with actual uptime. For example, if you decide to create two huge 22TB LUNs and mirror across them, if ZFS needs to resilver one of the LUNs it will take a *long* time. A good design will try to keep any storage area which needs to be resilvered small enough that it may be restored quickly and risk of secondary failure is minimized. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ross
2008-Jul-15  05:58 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
Hey, I just had a "D''oh!" moment I''m afraid, woke up this morning realising my previous post about the chances of failure was completely wrong. You do need to multiply the chance of failure by the number of remaining disks, because you''re reading the data of every one of them, and you risk loosing data from any one of them. However, I''m not sure where the 8 is coming from in your calculations. To my mind, the chance of failure on any one drive is: amount of data reads / chance of failure = 1TB / 10^14 ~ 10^12 / 10^14 or a 1 in 100 chance of failure So then, once one of your 14 disks fail, you have 13 left and for raid-z you need to read the data of every single one of them to survive without errors, which means the calculation is now: no of disks * amount of data reads / chance of failure In this case approximately 13/100 or around 1 in 8 odds. So with raid-z you have around a 1 in 8 chance of *some kind* of data error during the rebuild of the raid. So your odds calculations weren''t far off, but the key point is that you''re not calculating entire drive failure here, you''re calculating the odds of having a single bit of data fail. Now that bit could be in a vital file, but it could just as easily be in an unimportant file, or even blank space. And I can also give you the correct math for raid-z2. Keeping in mind that these figures are for a *single piece of data*, not the entire drive, the chance of raid-z2 failing during the rebuild is very small. I agree that the odds of having at least one piece of data fail during the raid-z2 rebuild are reasonably high (1 in 8), but for the rebuild to fail, you need two failures in the same place which means the calculation is for the failure rate for that particular bit, not for every bit on the drive: no of disks / chance of failure So the chance of your raid-z2 failing during the rebuild is approximately 12 in 10^14. Which I think you''ll agree are much better odds :D Ross This message posted from opensolaris.org
Will Murnane
2008-Jul-15  06:30 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
On Tue, Jul 15, 2008 at 01:58, Ross <myxiplx at hotmail.com> wrote:> However, I''m not sure where the 8 is coming from in your calculations.Bits per byte ;)> In this case approximately 13/100 or around 1 in 8 odds.Taking into account the factor 8, and it''s around 8 in 8. Another possible factor to consider in calculations of this nature is that you probably won''t get a single bit flipped here or there. If drives take 512-byte sectors and apply Hamming codes to those 512 bytes to get, say, 548 bytes of coded data that are actually written to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you cannot correct them from the data you have. Thus, rather than getting one incorrect bit in a particular 4096-bit sector, you''re likely to get all good sectors and one that''s complete garbage. Unless the manufacturers'' specifications account for this, I would say the sector error rate of the drive is about 1 in 4*(10**17). I have no idea whether they account for this or not, but it''d be interesting (and fairly doable) to test. Write a 1TB disk full of known data, then read it and verify. Then repeat until you have seen incorrect sectors a few times for a decent sample size, and store elsewhere what the sector was supposed to be and what it actually was. Will
Ross Smith
2008-Jul-15  08:38 UTC
[zfs-discuss] FW: please help with raid / failure / rebuild calculations
bits vs bytes.... D''oh! again. It''s a good job I don''t do these calculations professionally. :-)> Date: Tue, 15 Jul 2008 02:30:33 -0400> From: will.murnane at gmail.com> To: myxiplx at hotmail.com> Subject: Re: [zfs-discuss] please help with raid / failure / rebuild calculations> CC: zfs-discuss at opensolaris.org> > On Tue, Jul 15, 2008 at 01:58, Ross <myxiplx at hotmail.com> wrote:> > However, I''m not sure where the 8 is coming from in your calculations.> Bits per byte ;)> > > In this case approximately 13/100 or around 1 in 8 odds.> Taking into account the factor 8, and it''s around 8 in 8.> > Another possible factor to consider in calculations of this nature is> that you probably won''t get a single bit flipped here or there. If> drives take 512-byte sectors and apply Hamming codes to those 512> bytes to get, say, 548 bytes of coded data that are actually written> to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you> cannot correct them from the data you have. Thus, rather than getting> one incorrect bit in a particular 4096-bit sector, you''re likely to> get all good sectors and one that''s complete garbage. Unless the> manufacturers'' specifications account for this, I would say the sector> error rate of the drive is about 1 in 4*(10**17). I have no idea> whether they account for this or not, but it''d be interesting (and> fairly doable) to test. Write a 1TB disk full of known data, then> read it and verify. Then repeat until you have seen incorrect sectors> a few times for a decent sample size, and store elsewhere what the> sector was supposed to be and what it actually was.> > Will Get Hotmail on your Mobile! Try it Now! _________________________________________________________________ The John Lewis Clearance - save up to 50% with FREE delivery http://clk.atdmt.com/UKM/go/101719806/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080715/bf5c5462/attachment.html>
Richard Elling
2008-Jul-15  16:23 UTC
[zfs-discuss] please help with raid / failure / rebuild calculations
Will Murnane wrote:> On Tue, Jul 15, 2008 at 01:58, Ross <myxiplx at hotmail.com> wrote: > >> However, I''m not sure where the 8 is coming from in your calculations. >> > Bits per byte ;) > > >> In this case approximately 13/100 or around 1 in 8 odds. >> > Taking into account the factor 8, and it''s around 8 in 8. > > Another possible factor to consider in calculations of this nature is > that you probably won''t get a single bit flipped here or there. If > drives take 512-byte sectors and apply Hamming codes to those 512 > bytes to get, say, 548 bytes of coded data that are actually written > to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you > cannot correct them from the data you have. Thus, rather than getting > one incorrect bit in a particular 4096-bit sector, you''re likely to > get all good sectors and one that''s complete garbage. Unless the > manufacturers'' specifications account for this, I would say the sector > error rate of the drive is about 1 in 4*(10**17). I have no idea > whether they account for this or not, but it''d be interesting (and > fairly doable) to test. Write a 1TB disk full of known data, then > read it and verify. Then repeat until you have seen incorrect sectors > a few times for a decent sample size, and store elsewhere what the > sector was supposed to be and what it actually was. >The specification is for unrecoverable reads per bits read. I think most people expect this to be as delivered to host, which is how we count them. I would expect many, many more recoverable read events. You can also adjust by the amount of space used in ZFS and the number of copies of the data. -- richard