Matthew Angelo
2011-Feb-07 02:45 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
I require a new high capacity 8 disk zpool. ?The disks I will be purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable, bits read) of 1 in 10^14 and will be 2TB. ?I''m staying clear of WD because they have the new 2048b sectors which don''t play nice with ZFS at the moment. My question is, how do I determine which of the following zpool and vdev configuration I should run to maximize space whilst mitigating rebuild failure risk? 1. 2x RAIDZ(3+1) vdev 2. 1x RAIDZ(7+1) vdev 3. 1x RAIDZ2(7+1) vdev I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ) with 8x 2TB disks. Cheers
Ian Collins
2011-Feb-07 04:18 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On 02/ 7/11 03:45 PM, Matthew Angelo wrote:> I require a new high capacity 8 disk zpool. The disks I will be > purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable, > bits read) of 1 in 10^14 and will be 2TB. I''m staying clear of WD > because they have the new 2048b sectors which don''t play nice with ZFS > at the moment. > > My question is, how do I determine which of the following zpool and > vdev configuration I should run to maximize space whilst mitigating > rebuild failure risk? > > 1. 2x RAIDZ(3+1) vdev > 2. 1x RAIDZ(7+1) vdev > 3. 1x RAIDZ2(7+1) vdev >I assume 3 was 6+2. A bigger issue than drive error rates is how long a new 2TB drive will take to resilver if one dies. How long are you willing to run without redundancy in your pool? -- Ian.
Edward Ned Harvey
2011-Feb-07 04:48 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Matthew Angelo > > My question is, how do I determine which of the following zpool and > vdev configuration I should run to maximize space whilst mitigating > rebuild failure risk? > > 1. 2x RAIDZ(3+1) vdev > 2. 1x RAIDZ(7+1) vdev > 3. 1x RAIDZ2(6+2) vdev > > I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ) with 8x > 2TB disks.(Corrected type-o, 6+2 for you). Sounds like you made up your mind already. Nothing wrong with that. You are apparently uncomfortable running with only 1 disk worth of redundancy. There is nothing fundamentally wrong with the raidz1 configuration, but the probability of failure is obviously higher. Question is how do you calculate the probability? Because if we''re talking abou 5e-21 versus 3e-19 then you probably don''t care about the difference... They''re both essentially zero probability... Well... There''s no good answer to that. With the cited probability of bit error rate, you''re just representing the probability of a bit error. You''re not representing the probability of a failed drive. And you''re not representing the probability of a drive failure within a specified time window. What you really care about is the probability of two drives (or 3 drives) failing concurrently... In which case, you need to model the probability of any one drive failing within a specified time window. And even if you want to model that probability, in reality it''s not linear. The probability of a drive failing between 1yr and 1yr+3hrs is smaller than the probability of the drive failing between 3yr and 3yr+3hrs. Because after 3yrs, the failure rate will be higher. So after 3 yrs, the probability of multiple simultaneous failures is higher. I recently saw some seagate data sheets which specified the annual disk failure rate to be 0.3%. Again, this is a linear model, representing a nonlinear reality. Suppose one disk fails... How many weeks does it take to get a replacement onsite under the 3yr limited mail-in warranty? But then again after 3 years, you''re probably considering this your antique hardware, and all the stuff you care about is on a newer server. Etc. There''s no good answer to your question. You are obviously uncomfortable with a single disk worth of redundancy. Go with your gut. Sleep well at night. It only costs you $100. You probably have a cell phone with no backups worth more than that in your pocket right now.
Matthew Angelo
2011-Feb-07 06:22 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
Yes I did mean 6+2, Thank you for fixing the typo. I''m actually more leaning towards running a simple 7+1 RAIDZ1. Running this with 1TB is not a problem but I just wanted to investigate at what TB size the "scales would tip". I understand RAIDZ2 protects against failures during a rebuild process. Currently, my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks and worse case assuming this is 2 days this is my ''exposure'' time. For example, I would hazard a confident guess that 7+1 RAIDZ1 with 6TB drives wouldn''t be a smart idea. I''m just trying to extrapolate down. I will be running hot (or maybe cold) spare. So I don''t need to factor in "Time it takes for a manufacture to replace the drive". On Mon, Feb 7, 2011 at 2:48 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Matthew Angelo >> >> My question is, how do I determine which of the following zpool and >> vdev configuration I should run to maximize space whilst mitigating >> rebuild failure risk? >> >> 1. 2x RAIDZ(3+1) vdev >> 2. 1x RAIDZ(7+1) vdev >> 3. 1x RAIDZ2(6+2) vdev >> >> I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ) with 8x >> 2TB disks. > > (Corrected type-o, 6+2 for you). > Sounds like you made up your mind already. ?Nothing wrong with that. ?You > are apparently uncomfortable running with only 1 disk worth of redundancy. > There is nothing fundamentally wrong with the raidz1 configuration, but the > probability of failure is obviously higher. > > Question is how do you calculate the probability? ?Because if we''re talking > abou 5e-21 versus 3e-19 then you probably don''t care about the difference... > They''re both essentially zero probability... ?Well... ?There''s no good > answer to that. > > With the cited probability of bit error rate, you''re just representing the > probability of a bit error. ?You''re not representing the probability of a > failed drive. ?And you''re not representing the probability of a drive > failure within a specified time window. ?What you really care about is the > probability of two drives (or 3 drives) failing concurrently... ?In which > case, you need to model the probability of any one drive failing within a > specified time window. ?And even if you want to model that probability, in > reality it''s not linear. ?The probability of a drive failing between 1yr and > 1yr+3hrs is smaller than the probability of the drive failing between 3yr > and 3yr+3hrs. ?Because after 3yrs, the failure rate will be higher. ?So > after 3 yrs, the probability of multiple simultaneous failures is higher. > > I recently saw some seagate data sheets which specified the annual disk > failure rate to be 0.3%. ?Again, this is a linear model, representing a > nonlinear reality. > > Suppose one disk fails... ?How many weeks does it take to get a replacement > onsite under the 3yr limited mail-in warranty? > > But then again after 3 years, you''re probably considering this your antique > hardware, and all the stuff you care about is on a newer server. ?Etc. > > There''s no good answer to your question. > > You are obviously uncomfortable with a single disk worth of redundancy. ?Go > with your gut. ?Sleep well at night. ?It only costs you $100. ?You probably > have a cell phone with no backups worth more than that in your pocket right > now. > >
Richard Elling
2011-Feb-07 07:01 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On Feb 6, 2011, at 6:45 PM, Matthew Angelo wrote:> I require a new high capacity 8 disk zpool. The disks I will be > purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable, > bits read) of 1 in 10^14 and will be 2TB. I''m staying clear of WD > because they have the new 2048b sectors which don''t play nice with ZFS > at the moment. > > My question is, how do I determine which of the following zpool and > vdev configuration I should run to maximize space whilst mitigating > rebuild failure risk?The MTTDL[2] model will work. http://blogs.sun.com/relling/entry/a_story_of_two_mttdl As described, this model doesn''t scale well for N > 3 or 4, but it will get you in the ballpark. You will also need to know the MTBF from the data sheet, but if you don''t have that info, that is ok because you are asking the right question: given a single drive type, what is the best configuration for preventing data loss. Finally, to calculate the raidz2 result, you need to know the mean time to recovery (MTTR) which includes the logistical replacement time and resilver time. Basically, the model calculates the probability of a data loss event during reconstruction. This is different for ZFS and most other LVMs because ZFS will only resilver data and the total data <= disk size.> > 1. 2x RAIDZ(3+1) vdev > 2. 1x RAIDZ(7+1) vdev > 3. 1x RAIDZ2(7+1) vdev > > > I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ) with 8x > 2TB disks.Double parity will win over single parity. Intuitively, when you add parity you multiply by the MTBF. When you add disks to a set, you change the denominator by a few digits. Obviously multiplication is a good thing, dividing not so much. In short, raidz2 is the better choice. -- richard
Sandon Van Ness
2011-Feb-07 13:23 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
I think as far as data integrity and complete volume loss is most likely in the following order: 1. 1x Raidz(7+1) 2. 2x RaidZ(3+1) 3. 1x Raidz2(6+2) Simple raidz certainly is an option with only 8 disks (8 is about the maximum I would go) but to be honest I would feel safer going raidz2. The 2x raidz (3+1) would probably perform the best but I would prefer going with the 3rd option (raidz2) as it is better for redundancy. With raidz2 any two disks can fail and with dual parity if you get some un-recoverable read errors during a scrub you have a much better chance of not having corruption due to the double parity on the same set of data. On 02/06/2011 06:45 PM, Matthew Angelo wrote:> I require a new high capacity 8 disk zpool. The disks I will be > purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable, > bits read) of 1 in 10^14 and will be 2TB. I''m staying clear of WD > because they have the new 2048b sectors which don''t play nice with ZFS > at the moment. > > My question is, how do I determine which of the following zpool and > vdev configuration I should run to maximize space whilst mitigating > rebuild failure risk? > > 1. 2x RAIDZ(3+1) vdev > 2. 1x RAIDZ(7+1) vdev > 3. 1x RAIDZ2(7+1) vdev > > > I just want to prove I shouldn''t run a plain old RAID5 (RAIDZ) with 8x > 2TB disks. > > Cheers > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Peter Jeremy
2011-Feb-07 21:07 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers at gmail.com> wrote:>I''m actually more leaning towards running a simple 7+1 RAIDZ1. >Running this with 1TB is not a problem but I just wanted to >investigate at what TB size the "scales would tip".It''s not that simple. Whilst resilver time is proportional to device size, it''s far more impacted by the degree of fragmentation of the pool. And there''s no ''tipping point'' - it''s a gradual slope so it''s really up to you to decide where you want to sit on the probability curve.> I understand >RAIDZ2 protects against failures during a rebuild process.This would be its current primary purpose.> Currently, >my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks >and worse case assuming this is 2 days this is my ''exposure'' time.Unless this is a write-once pool, you can probably also assume that your pool will get more fragmented over time, so by the time your pool gets to twice it''s current capacity, it might well take 3 days to rebuild due to the additional fragmentation. One point I haven''t seen mentioned elsewhere in this thread is that all the calculations so far have assumed that drive failures were independent. In practice, this probably isn''t true. All HDD manufacturers have their "off" days - where whole batches or models of disks are cr*p and fail unexpectedly early. The WD EARS is simply a demonstration that it''s WD''s turn to turn out junk. Your best protection against this is to have disks from enough different batches that a batch failure won''t take out your pool. PSU, fan and SATA controller failures are likely to take out multiple disks but it''s far harder to include enough redundancy to handle this and your best approach is probably to have good backups.>I will be running hot (or maybe cold) spare. So I don''t need to >factor in "Time it takes for a manufacture to replace the drive".In which case, the question is more whether 8-way RAIDZ1 with a hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2). In the latter case, your "hot spare" is already part of the pool so you don''t lose the time-to-notice plus time-to-resilver before regaining redundancy. The downside is that actively using the "hot spare" may increase the probability of it failing. -- Peter Jeremy
Richard Elling
2011-Feb-08 00:53 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:> On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers at gmail.com> wrote: >> I''m actually more leaning towards running a simple 7+1 RAIDZ1. >> Running this with 1TB is not a problem but I just wanted to >> investigate at what TB size the "scales would tip". > > It''s not that simple. Whilst resilver time is proportional to device > size, it''s far more impacted by the degree of fragmentation of the > pool. And there''s no ''tipping point'' - it''s a gradual slope so it''s > really up to you to decide where you want to sit on the probability > curve.The "tipping point" won''t occur for similar configurations. The tip occurs for different configurations. In particular, if the size of the N+M parity scheme is very large and the resilver times become very, very large (weeks) then a (M-1)-way mirror scheme can provide better performance and dependability. But I consider these to be extreme cases.>> I understand >> RAIDZ2 protects against failures during a rebuild process. > > This would be its current primary purpose. > >> Currently, >> my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks >> and worse case assuming this is 2 days this is my ''exposure'' time. > > Unless this is a write-once pool, you can probably also assume that > your pool will get more fragmented over time, so by the time your > pool gets to twice it''s current capacity, it might well take 3 days > to rebuild due to the additional fragmentation. > > One point I haven''t seen mentioned elsewhere in this thread is that > all the calculations so far have assumed that drive failures were > independent. In practice, this probably isn''t true. All HDD > manufacturers have their "off" days - where whole batches or models of > disks are cr*p and fail unexpectedly early. The WD EARS is simply a > demonstration that it''s WD''s turn to turn out junk. Your best > protection against this is to have disks from enough different batches > that a batch failure won''t take out your pool.The problem with considering the failures as interdependent is that you cannot get the failure rate information from the vendors. You could guess, or use your own, but it would not always help you make a better design decision.> > PSU, fan and SATA controller failures are likely to take out multiple > disks but it''s far harder to include enough redundancy to handle this > and your best approach is probably to have good backups.The top 4 items that fail most often, in no particular order, are: fans, power supplies, memory, and disk. This is why you will see the enterprise class servers use redundant fans, multiple high-quality power supplies, ECC memory, and some sort of RAID.> >> I will be running hot (or maybe cold) spare. So I don''t need to >> factor in "Time it takes for a manufacture to replace the drive". > > In which case, the question is more whether 8-way RAIDZ1 with a > hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).In this case, raidz2 is much better for dependability because the "spare" is already "resilvered." It also performs better, though the dependability gains tend to be bigger than the performance gains.> In the latter > case, your "hot spare" is already part of the pool so you don''t > lose the time-to-notice plus time-to-resilver before regaining > redundancy. The downside is that actively using the "hot spare" > may increase the probability of it failing.No. The disk failure rate data does not conclusively show that activity causes premature failure. Other failure modes dominate. -- richard
On Mon, Feb 7, 2011 at 7:53 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote: > >> On 2011-Feb-07 14:22:51 +0800, Matthew Angelo <bangers at gmail.com> wrote: >>> I''m actually more leaning towards running a simple 7+1 RAIDZ1. >>> Running this with 1TB is not a problem but I just wanted to >>> investigate at what TB size the "scales would tip". >> >> It''s not that simple. ?Whilst resilver time is proportional to device >> size, it''s far more impacted by the degree of fragmentation of the >> pool. ?And there''s no ''tipping point'' - it''s a gradual slope so it''s >> really up to you to decide where you want to sit on the probability >> curve. > > The "tipping point" won''t occur for similar configurations. The tip > occurs for different configurations. In particular, if the size of the > N+M parity scheme is very large and the resilver times become > very, very large (weeks) then a (M-1)-way mirror scheme can provide > better performance and dependability. But I consider these to be > extreme cases.Empirically it seems that resilver time is related to number of objects as much (if not more than) amount of data. zpools (mirrors) with similar amounts of data but radically different numbers of objects take very different amounts of time to resilver. I have NOT (yet) started actually measuring and tracking this, but the above is based on casual observation. P.S. I am measuring number of objects via `zdb -d` as that is faster than trying to count files and directories and I expect is a much better measure of what the underlying zfs code is dealing with (a particular dataset may have lots of snapshot data that does not (easily) show up). -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Nico Williams
2011-Feb-14 13:12 UTC
[zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On Feb 14, 2011 6:56 AM, "Paul Kraus" <paul at kraus-haus.org> wrote:> P.S. I am measuring number of objects via `zdb -d` as that is faster > than trying to count files and directories and I expect is a much > better measure of what the underlying zfs code is dealing with (a > particular dataset may have lots of snapshot data that does not > (easily) show up).It''s faster because; a) no atime updates, b) no ZPL overhead. Nico -- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110214/41539243/attachment.html>