Below is another paper on drive failure analysis, this one won best paper at usenix: http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/ index.html What I found most interesting was the idea that drives don''t fail outright most of the time. They can slow down operations, and slowly die. With this behavior in mind, I had an idea for a new feature in ZFS: If a disk fitness test were available to verify disk read/write and performance, future drive problems could be avoided. Some example tests: - full disk read - 8kb r/w iops - 1mb r/w iops - raw throughput Since one disk may be different than others, I thought a comparison between two presumably similar disks would be useful. The command would be something like: zpool dft c1t0d0 c1t1d0 Or: zpool dft all I think this would be a great feature, as only zfs can do fitness tests on live running disks behind the scenes. With the ability to compare individual disk performance, not only will you find bad disks, it''s entirely possible you''ll find mis- configurations (such as bad connections) as well. And yes, I do know about SMART. SMART can pre-indicate a disk failure. However, I''ve run SMART on drives with bearings that were gravel that passed smart, even though I knew the 10k drive was running at about 3k rpm due to the bearings. ----- Gregory Shaw, IT Architect Phone: (303) 272-8817 (x78817) ITCTO Group, Sun Microsystems Inc. 500 Eldorado Blvd, UBRM02-157 greg.shaw at sun.com (work) Broomfield, CO 80021 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070221/bb981bbc/attachment.html>
Gregory Shaw wrote:> Below is another paper on drive failure analysis, this one won best > paper at usenix: > > http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html > > What I found most interesting was the idea that drives don''t fail > outright most of the time. They can slow down operations, and slowly die.Yes, this is what my data shows, too. You are most likely to see an unrecoverable read which leads to a retry (slow response symptom).> With this behavior in mind, I had an idea for a new feature in ZFS: > > If a disk fitness test were available to verify disk read/write and > performance, future drive problems could be avoided. > > Some example tests: > - full disk read > - 8kb r/w iops > - 1mb r/w iops > - raw throughputSome problems can be seen by doing a simple sequential read and comparing it to historical data. It depends on the failure mode, though.> Since one disk may be different than others, I thought a comparison > between two presumably similar disks would be useful. > > The command would be something like: > zpool dft c1t0d0 c1t1d0 > > Or: > zpool dft all > > I think this would be a great feature, as only zfs can do fitness tests > on live running disks behind the scenes.I like the concept, but don''t see why ZFS would be required.> With the ability to compare individual disk performance, not only will > you find bad disks, it''s entirely possible you''ll find > mis-configurations (such as bad connections) as well.A few years ago we looked at unusual changes in response time as a leading indicator, but I don''t recall the details as to why we dropped the effort. Perhaps we should take a look again?> And yes, I do know about SMART. SMART can pre-indicate a disk > failure. However, I''ve run SMART on drives with bearings that were > gravel that passed smart, even though I knew the 10k drive was running > at about 3k rpm due to the bearings.ditto. -- richard
On Feb 21, 2007, at 4:59 PM, Richard Elling wrote:>> With this behavior in mind, I had an idea for a new feature in ZFS: >> If a disk fitness test were available to verify disk read/write >> and performance, future drive problems could be avoided. >> Some example tests: >> - full disk read >> - 8kb r/w iops >> - 1mb r/w iops >> - raw throughput > > Some problems can be seen by doing a simple sequential read and > comparing > it to historical data. It depends on the failure mode, though. >I agree. Having this feature could provide that history.>> Since one disk may be different than others, I thought a >> comparison between two presumably similar disks would be useful. >> The command would be something like: >> zpool dft c1t0d0 c1t1d0 >> Or: >> zpool dft all >> I think this would be a great feature, as only zfs can do fitness >> tests on live running disks behind the scenes. > > I like the concept, but don''t see why ZFS would be required. >I''m thinking of production systems. Since you can''t evacuate the disk, ZFS can do read/write tests on unused portion of the disk. I don''t think that would be possible via another solution, such as SVM/ UFS.>> With the ability to compare individual disk performance, not only >> will you find bad disks, it''s entirely possible you''ll find mis- >> configurations (such as bad connections) as well. > > A few years ago we looked at unusual changes in response time as a > leading indicator, but I don''t recall the details as to why we dropped > the effort. Perhaps we should take a look again? >More information is good in my book. Anything that can tell me that things-aren''t-quite-right is more uptime that can be provided.>> And yes, I do know about SMART. SMART can pre-indicate a disk >> failure. However, I''ve run SMART on drives with bearings that >> were gravel that passed smart, even though I knew the 10k drive >> was running at about 3k rpm due to the bearings. > > ditto. > -- richard----- Gregory Shaw, IT Architect Phone: (303) 272-8817 (x78817) ITCTO Group, Sun Microsystems Inc. 500 Eldorado Blvd, UBRM02-157 greg.shaw at sun.com (work) Broomfield, CO 80021 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070221/def4ef60/attachment.html>
On Wed, Feb 21, 2007 at 03:35:06PM -0700, Gregory Shaw wrote:> Below is another paper on drive failure analysis, this one won best > paper at usenix: > > http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/ > index.html > > What I found most interesting was the idea that drives don''t fail > outright most of the time. They can slow down operations, and > slowly die.Seems like there are a two pieces you''re suggesting here: 1. Some sort of background process to proactively find errors on disks in use by ZFS. This will be accomplished by a background scrubbing option, dependent on the block-rewriting work Matt and Mark are working on. This will allow something like "zpool set scrub=2weeks", which will tell ZFS to "scrub my data at an interval such that all data is touched over a 2 week period". This will test reading from every block and verifying checksums. Stressing write failures is a little more difficult. 2. Distinguish "slow" drives from "normal" drives and proactively mark them faulted. This shouldn''t require an explicit "zpool dft", as we should be watching the response times of the various drives and keep this as a statistic. We want to incorporate this information to allow better allocation amongst slower and faster drives. Determining that a drive is "abnormally slow" is much more difficult, though it could theoretically be done if we had some basis - either historical performance for the same drive or comparison to identical drives (manufacturer/model) within the pool. While we''ve thought about these same issues, there is currently no active effort to keep track of these statistics or do anything with them. These two things combined should avoid the need for an explicit fitness test. Hope that helps, - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Feb 21, 2007, at 5:20 PM, Eric Schrock wrote:> On Wed, Feb 21, 2007 at 03:35:06PM -0700, Gregory Shaw wrote: >> Below is another paper on drive failure analysis, this one won best >> paper at usenix: >> >> http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/ >> index.html >> >> What I found most interesting was the idea that drives don''t fail >> outright most of the time. They can slow down operations, and >> slowly die. > > Seems like there are a two pieces you''re suggesting here: > > 1. Some sort of background process to proactively find errors on disks > in use by ZFS. This will be accomplished by a background scrubbing > option, dependent on the block-rewriting work Matt and Mark are > working on. This will allow something like "zpool set > scrub=2weeks", > which will tell ZFS to "scrub my data at an interval such that all > data is touched over a 2 week period". This will test reading from > every block and verifying checksums. Stressing write failures is a > little more difficult. >I was thinking of something similar to a scrub. An ongoing process seemed too intrusive. I''d envisioned a cron job similar to a scrub (or defrag) that could be run periodically to show any differences between disk performance over time.> 2. Distinguish "slow" drives from "normal" drives and proactively mark > them faulted. This shouldn''t require an explicit "zpool dft", as > we should be watching the response times of the various drives and > keep this as a statistic. We want to incorporate this information > to allow better allocation amongst slower and faster drives. > Determining that a drive is "abnormally slow" is much more > difficult, > though it could theoretically be done if we had some basis - either > historical performance for the same drive or comparison to > identical > drives (manufacturer/model) within the pool. While we''ve thought > about these same issues, there is currently no active effort to > keep > track of these statistics or do anything with them. >I thought this would be very difficult to determine, as a slow disk could be a transient problem. Me, I like tools that give me information I can work with. Fully automated systems always seem to cause more problems than they solve. For instance, if I have a drive on a pc using a shared ide bus, is it the disk that is slow, or the connection method? It''s obviously the second, but finding that programatically will be very difficult. I like the idea of a dft for testing a disk in a subjective manner. One benefit of this could be an objective performance test baseline for disks and arrays. Btw, it does help. :-)> These two things combined should avoid the need for an explicit > fitness > test. > > Hope that helps, > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ > eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 272-8817 (x78817) ITCTO Group, Sun Microsystems Inc. 500 Eldorado Blvd, UBRM02-157 greg.shaw at sun.com (work) Broomfield, CO 80021 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070221/5f0e7798/attachment.html>
On 2/22/07, Gregory Shaw <Greg.Shaw at sun.com> wrote:> > > I was thinking of something similar to a scrub. An ongoing process > seemed too intrusive. I''d envisioned a cron job similar to a scrub (or > defrag) that could be run periodically to show any differences between disk > performance over time. >... I thought this would be very difficult to determine, as a slow disk could be> a transient problem. > > Me, I like tools that give me information I can work with. Fully > automated systems always seem to cause more problems than they solve. >If the stats are publishable, then something like cacti or any monitoring tool should provide most admins with enough tools to spot potential issues. Nicholas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070222/e3526956/attachment.html>
All, I think dtrace could be a viable option here. crond to run a dtrace script on a regular basis that times a series of reads and then provides that info to Cacti or rrdtool. It''s not quite the one-size-fits-all that the OP was looking for, but if you want trends, this should get ''em. $0.02 Regards, TJ Easter On 2/21/07, Nicholas Lee <emptysands at gmail.com> wrote:> > On 2/22/07, Gregory Shaw <Greg.Shaw at sun.com> wrote: > > > > > > > > I was thinking of something similar to a scrub. An ongoing process > seemed too intrusive. I''d envisioned a cron job similar to a scrub (or > defrag) that could be run periodically to show any differences between disk > performance over time. > > ... > > > > > > > I thought this would be very difficult to determine, as a slow disk could > be a transient problem. > > > > > > Me, I like tools that give me information I can work with. Fully > automated systems always seem to cause more problems than they solve. > > If the stats are publishable, then something like cacti or any monitoring > tool should provide most admins with enough tools to spot potential issues. > > Nicholas > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- "Being a humanist means trying to behave decently without expectation of rewards or punishment after you are dead." -- Kurt Vonnegut http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x31185D8E
Correct me if I''m wrong but fma seems like a more appropriate tool to track disk errors. -- Just me, Wire ... On 2/22/07, TJ Easter <tjeaster at gmail.com> wrote:> All, > I think dtrace could be a viable option here. crond to run a > dtrace script on a regular basis that times a series of reads and then > provides that info to Cacti or rrdtool. It''s not quite the > one-size-fits-all that the OP was looking for, but if you want trends, > this should get ''em. > > $0.02 > > Regards, > TJ Easter
Richard Elling <Richard.Elling at Sun.COM> wrote:> > If a disk fitness test were available to verify disk read/write and > > performance, future drive problems could be avoided. > > > > Some example tests: > > - full disk read > > - 8kb r/w iops > > - 1mb r/w iops > > - raw throughput > > Some problems can be seen by doing a simple sequential read and comparing > it to historical data. It depends on the failure mode, though.Something that people often forget about are the bearings. Sometimes, the disks do write too early asuming that the head is already on track. The work out bearing however causes a track following problem.... For this reason, you need to run a random write test on old disks. sformat includes such a test... J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Wed, Feb 21, 2007 at 04:20:58PM -0800, Eric Schrock wrote:> Seems like there are a two pieces you''re suggesting here: > > 1. Some sort of background process to proactively find errors on disks > in use by ZFS. This will be accomplished by a background scrubbing > option, dependent on the block-rewriting work Matt and Mark are > working on. This will allow something like "zpool set scrub=2weeks", > which will tell ZFS to "scrub my data at an interval such that all > data is touched over a 2 week period". This will test reading from > every block and verifying checksums. Stressing write failures is a > little more difficult.I got the impression that testing free disk space was also desired.> 2. Distinguish "slow" drives from "normal" drives and proactively mark > them faulted. This shouldn''t require an explicit "zpool dft", as > we should be watching the response times of the various drives and > keep this as a statistic. We want to incorporate this information > to allow better allocation amongst slower and faster drives. > Determining that a drive is "abnormally slow" is much more difficult, > though it could theoretically be done if we had some basis - either > historical performance for the same drive or comparison to identical > drives (manufacturer/model) within the pool. While we''ve thought > about these same issues, there is currently no active effort to keep > track of these statistics or do anything with them.I would imagine that "slow" as in "long average seek times" should be relatively easy to detect, whereas "slow" as in "low bandwidth" might be harder (since I/O bandwidth might depend on characteristics of the device path and how saturated it is). Are long average seek times an indication of trouble? Nico --
Eric Schrock wrote:> 1. Some sort of background process to proactively find errors on disks > in use by ZFS. This will be accomplished by a background scrubbing > option, dependent on the block-rewriting work Matt and Mark are > working on. This will allow something like "zpool set scrub=2weeks", > which will tell ZFS to "scrub my data at an interval such that all > data is touched over a 2 week period".Obviously, scrubbing and correcting "hard" errors that result in ZFS checksum errors is very beneficial. However, it won''t address the case of "soft" errors when the disk returns correct data but observes some problems reading it. There are at least two good reasons to pay attention to these "soft" errors: a) Preemptive detection and rewriting of partially defective but still correctable sectors may prevent future data loss. Thus, it improves perceived reliability of disk drives, which is especially important in the JBOD case (including a single-drive JBOD). b) It is not uncommon for such successful reads of partially defective media to happen only after several retries. It is somewhat unfortunate that there is no simple way to tell the drive how many times to retry. Firmware in ATA/SATA drives, used predominantly in single-disk PCs, will typically do a heroic effort to retrieve the data. It will make numerous attempts to reposition the actuator, recalibrate the head current, etc. It can take up to 20-40 seconds! Such strategy is reasonable for a desktop PC but in it happens in an busy enterprise file server it results in a temporary availability loss (the drive freezes for up 20-40 seconds every time you try to read this sector). Also, this strategy does not make any sense if a RAID group in which the drive participates has redundant data elsewhere, which is why SCSI/FC drives give up after a few retries. One can detect (and repair) such problematic areas on disk by monitoring the SMART counters during scrubbing, and/or by monitoring physical read timings (looking for abnormally slow ones). -- Olaf
On Thu, Feb 22, 2007 at 10:45:04AM -0800, Olaf Manczak wrote:> > Obviously, scrubbing and correcting "hard" errors that result in > ZFS checksum errors is very beneficial. However, it won''t address the > case of "soft" errors when the disk returns correct data but > observes some problems reading it. There are at least two good reasons > to pay attention to these "soft" errors: > > a) Preemptive detection and rewriting of partially defective but > still correctable sectors may prevent future data loss. Thus, > it improves perceived reliability of disk drives, which is > especially important in the JBOD case (including a single-drive JBOD).These types of soft errors will be logged, managed, and (eventually) diagnosed by SCSI FMA work currently in development. If the SCSI DE diagnoses a disk as faulty, then the ZFS agent will be able to respond appropriately.> b) It is not uncommon for such successful reads of partially defective > media to happen only after several retries. It is somewhat unfortunate > that there is no simple way to tell the drive how many times to retry. > Firmware in ATA/SATA drives, used predominantly in single-disk PCs, > will typically do a heroic effort to retrieve the data. It will > make numerous attempts to reposition the actuator, recalibrate the > head current, etc. It can take up to 20-40 seconds! Such strategy > is reasonable for a desktop PC but in it happens in an busy > enterprise file server it results in a temporary availability loss > (the drive freezes for up 20-40 seconds every time you try to > read this sector). Also, this strategy does not make any sense if > a RAID group in which the drive participates has redundant data > elsewhere, which is why SCSI/FC drives give up after a few retries. > > One can detect (and repair) such problematic areas on disk by monitoring > the SMART counters during scrubbing, and/or by monitoring physical > read timings (looking for abnormally slow ones).Solaris currently has a disk monitoring FMA module that is specific to Thumper (x4500) and monitors only the most basic information (overtemp, self-test fail, predictive failure). I have separated this out into a common FMA transport module which will bring this functionality to all platforms (though support for SCSI devices will depend on the aforementioned SCSI FMA portfolio). This should be putback soon. Future work could expand this beyond the simple indicators into more detailed analysis of various counters. All of this is really a common FMA problem, not ZFS-specific. All that is needed in ZFS is an agent actively responding to external diagnoses. I am laying the groundwork for this as part of ongoing ZFS/FMA work mentioned in other threads. For more information on ongoing FMA work, I recommend visiting the FMA discussion forum. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Feb 22, 2007, at 11:55 AM, Eric Schrock wrote: [ ... ]> >> b) It is not uncommon for such successful reads of partially >> defective >> media to happen only after several retries. It is somewhat >> unfortunate >> that there is no simple way to tell the drive how many times to >> retry. >> Firmware in ATA/SATA drives, used predominantly in single-disk >> PCs, >> will typically do a heroic effort to retrieve the data. It will >> make numerous attempts to reposition the actuator, recalibrate the >> head current, etc. It can take up to 20-40 seconds! Such strategy >> is reasonable for a desktop PC but in it happens in an busy >> enterprise file server it results in a temporary availability loss >> (the drive freezes for up 20-40 seconds every time you try to >> read this sector). Also, this strategy does not make any sense if >> a RAID group in which the drive participates has redundant data >> elsewhere, which is why SCSI/FC drives give up after a few >> retries. >> >> One can detect (and repair) such problematic areas on disk by >> monitoring >> the SMART counters during scrubbing, and/or by monitoring physical >> read timings (looking for abnormally slow ones). > > Solaris currently has a disk monitoring FMA module that is specific to > Thumper (x4500) and monitors only the most basic information > (overtemp, > self-test fail, predictive failure). I have separated this out into a > common FMA transport module which will bring this functionality to all > platforms (though support for SCSI devices will depend on the > aforementioned SCSI FMA portfolio). This should be putback soon. > Future work could expand this beyond the simple indicators into more > detailed analysis of various counters. > > All of this is really a common FMA problem, not ZFS-specific. All > that > is needed in ZFS is an agent actively responding to external > diagnoses. > I am laying the groundwork for this as part of ongoing ZFS/FMA work > mentioned in other threads. For more information on ongoing FMA > work, I > recommend visiting the FMA discussion forum. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ > eschrockI disagree. Originally, I asked for the following: - Objective performance reporting in a simple to parse format (similar to scrub) - The ability to schedule non-data-intrusive disk tests to verify disk performance. - The ability to compare two similar disks for performance. In the above, you''ve taken pro-active capabilities and turned them into failure mitigation, or, re-active capabilities. From the paper, the problem isn''t outright disk failure, but disk performance degradation. I asked for the above to easily determine whether a disk is performing similarly to others, or may be degrading. The need for ZFS to do this is two-fold: 1. ZFS can write to the disk non-intrusively. Any subsystem outside of the native filesystem will be able to execute read tests only, which is only part of the analysis. 2. If the command is available at the zfs (or pool) level, it becomes an easy method for diagnosis. When you must ''roll your own'' via script or dtrace, the objectivity goes away and comparisons between systems become increasingly difficult. My concern for moving this into exclusively FMA has to do with focus. I''ve found that most fault mitigation systems concentrate on just that: faults. Performance degradation isn''t treated as a fault, and usually falls out of any fault management system as a "we''d like to do that, but we''ve got bigger things to do." ----- Gregory Shaw, IT Architect IT CTO Group, Sun Microsystems Inc. Phone: (303)-272-8817 (x78817) 500 Eldorado Blvd, UBRM02-157 greg.shaw at sun.com (work) Broomfield, CO 80021 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070223/3a5154a2/attachment.html>