Most discussions I have seen about RAID 5/6 and why it stops "working" seem to base their conclusions solely on single drive characteristics and statistics. It seems to me there is a missing component in the discussion of drive failures in the real world context of a system that lives in an environment shared by all the system components - for instance, the video of the disks slowing down when they are yelled at is a good visual example of the negative effect of vibration on drives. http://www.youtube.com/watch?v=tDacjrSCeq4 I thought the google and CMU papers talked about a surprisingly high (higher than expected) rate of multiple drive failures of drives "nearby" each other, but I couldn''t find it when I re-=skimmed the papers now. What are peoples'' experiences with multiple drive failures? Given that we often use same brand/model/batch drives (even though we are not supposed to), same enclosure, same rack, etc for a given raid 5/6/z1/z2/z3 system, should we be paying more attention to harmonics, vibration/isolation and non-intuitive system level statistics that might be inducing close proximity drive failures rather than just throwing more parity drives at the problem? What if our enclosure and environmental factors increase the system level statistics for multiple drive failures beyond the (used by everyone) single drive failure statistics to the point where it is essentially negating the positive effect of adding parity drives? I realize this issue is not addressed because there is too much variability in the enviroments, etc but I thought it would be interesting to see if anyone has experienced much in terms of close time proximity, multiple drive failures.
Richard Elling
2010-Mar-20 20:25 UTC
[zfs-discuss] sympathetic (or just multiple) drive failures
On Mar 19, 2010, at 7:07 PM, zfs ml wrote:> Most discussions I have seen about RAID 5/6 and why it stops "working" seem to base their conclusions solely on single drive characteristics and statistics. > It seems to me there is a missing component in the discussion of drive failures in the real world context of a system that lives in an environment shared by all the system components - for instance, the video of the disks slowing down when they are yelled at is a good visual example of the negative effect of vibration on drives. http://www.youtube.com/watch?v=tDacjrSCeq4 > > I thought the google and CMU papers talked about a surprisingly high (higher than expected) rate of multiple drive failures of drives "nearby" each other, but I couldn''t find it when I re-=skimmed the papers now. > > What are peoples'' experiences with multiple drive failures? Given that we often use same brand/model/batch drives (even though we are not supposed to), same enclosure, same rack, etc for a given raid 5/6/z1/z2/z3 system, should we be paying more attention to harmonics, vibration/isolation and non-intuitive system level statistics that might be inducing close proximity drive failures rather than just throwing more parity drives at the problem?Yes :-) Or to put this another way, when you have components in a system that are very reliable, the system failures become dominated by failures that are not directly attributed to the components. This is fallout from the notion of "synergy" or the whole is greater than the sum of the parts. synergy (noun) the interaction or cooperation of two or more organizations, substances, or other agents to produce a combined effect greater than the sum of their separate effects.> What if our enclosure and environmental factors increase the system level statistics for multiple drive failures beyond the (used by everyone) single drive failure statistics to the point where it is essentially negating the positive effect of adding parity drives?Statistical studies or reliability predictions for components do not take into account causes such as factory contamination, environment, shipping/handling events, etc. The math is a lot easier if you can forget about such things.> I realize this issue is not addressed because there is too much variability in the enviroments, etc but I thought it would be interesting to see if anyone has experienced much in terms of close time proximity, multiple drive failures.I see this on occasion. However, the cause is rarely attributed to a bad batch of drives. More common is power supplies, HBA firmware, cables, Pepsi syndrome, or similar. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Bob Friesenhahn
2010-Mar-20 20:52 UTC
[zfs-discuss] sympathetic (or just multiple) drive failures
On Fri, 19 Mar 2010, zfs ml wrote:> same enclosure, same rack, etc for a given raid 5/6/z1/z2/z3 system, should > we be paying more attention to harmonics, vibration/isolation and > non-intuitive system level statistics that might be inducing close proximity > drive failures rather than just throwing more parity drives at the problem?Yes. Perfect symmetry is: a) wonderful b) evil ? What is the meaning of "forklift impalement"? What is the standard spacing between forklift tines? What is the hight of a Hoover vaccum cleaner handle? Many things that logical engineers do are exessively "Michelangelo" when "Picasso" is what is needed. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Erik Trimble
2010-Mar-20 23:14 UTC
[zfs-discuss] sympathetic (or just multiple) drive failures
Richard Elling wrote:> I see this on occasion. However, the cause is rarely attributed to a bad > batch of drives. More common is power supplies, HBA firmware, cables, > Pepsi syndrome, or similar. > -- richard >Mmmm. Pepsi Syndrome. I take it this is similar to the Coke addiction many of my keyboards have displayed, going to great lengths to make sure that I pour at least a half can into them at the least convenient time? Also, see the related disease, C>N>S (Coke through Nose onto Screen). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Svein Skogen
2010-Mar-20 23:17 UTC
[zfs-discuss] sympathetic (or just multiple) drive failures
On 21.03.2010 00:14, Erik Trimble wrote:> Richard Elling wrote: >> I see this on occasion. However, the cause is rarely attributed to a bad >> batch of drives. More common is power supplies, HBA firmware, cables, >> Pepsi syndrome, or similar. >> -- richard > Mmmm. Pepsi Syndrome. I take it this is similar to the Coke addiction > many of my keyboards have displayed, going to great lengths to make sure > that I pour at least a half can into them at the least convenient time? > > Also, see the related disease, C>N>S (Coke through Nose onto Screen).Not to mention "sysadmin having a bad day, tower frontdoor dented"-syndrome. ;) //Svein -- Sending mail from a temporary set up workstation, as my primary W500 is off for service. PGP not installed.
Bill Sommerfeld
2010-Mar-21 00:38 UTC
[zfs-discuss] sympathetic (or just multiple) drive failures
On 03/19/10 19:07, zfs ml wrote:> What are peoples'' experiences with multiple drive failures?1985-1986. DEC RA81 disks. Bad glue that degraded at the disk''s operating temperature. Head crashes. No more need be said. - Bill