http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery Is there a way except for buying enterprise (RAID specific) drives for a array to use normal drives? Does anyone have any success stories regarding a particular model? The TLER cannot be edited on newer drives from Western Digital unfortunately. Are there some settings in ZFS that can be used to compensate for this? -- This message posted from opensolaris.org
Nathan wrote:> http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery > > Is there a way except for buying enterprise (RAID specific) drives for a array to use normal drives? > > Does anyone have any success stories regarding a particular model? > > The TLER cannot be edited on newer drives from Western Digital unfortunately. Are there some settings in ZFS that can be used to compensate for this?What is the problem you are trying to solve that makes you think you need this or a similar feature ? -- Darren J Moffat
Sorry I probably didn''t make myself exactly clear. Basically drives without particular TLER settings drop out of RAID randomly. * Error Recovery - This is called various things by various manufacturers (TLER, ERC, CCTL). In a Desktop drive, the goal is to do everything possible to recover the data. In an Enterprise, the goal is to ALWAYS return SOMETHING within the timeout period; if the data can''t be recovered within that time, let the RAID controller reconstruct it. Wikipedia article. Does this happen in ZFS? Maybe it is particular to hardware RAID controllers. -- This message posted from opensolaris.org
http://www.stringliterals.com/?p=77 This guy talks about it too under "Hard Drives". -- This message posted from opensolaris.org
Yeah, this is my main concern with moving from my cheap Linux server with no redundancy to ZFS RAID on OpenSolaris; I don''t really want to have to pay twice as much to buy the ''enterprise'' disks which appear to be exactly the same drives with a flag set in the firmware to limit read retries, but I also don''t want to lose all my data because a sector fails and the drive hangs for a minute trying to relocate it, causing the file system to fall over. I haven''t found a definitive answer as to whether this will kill a ZFS RAID like it kills traditional hardware RAID or whether ZFS will recover after the drive stops attempting to relocate the sector. At least with a single drive setup the OS will eventually get an error response and the other files on the disk will be readable when I copy them over to a new drive. -- This message posted from opensolaris.org
Mark Grant wrote:> Yeah, this is my main concern with moving from my cheap Linux server with no redundancy to ZFS RAID on OpenSolaris; I don''t really want to have to pay twice as much to buy the ''enterprise'' disks which appear to be exactly the same drives with a flag set in the firmware to limit read retries, but I also don''t want to lose all my data because a sector fails and the drive hangs for a minute trying to relocate it, causing the file system to fall over.So use the same cheap hardware you used on Linux.> I haven''t found a definitive answer as to whether this will kill a ZFS RAID like it kills traditional hardware RAID or whether ZFS will recover after the drive stops attempting to relocate the sector. At least with a single drive setup the OS will eventually get an error response and the other files on the disk will be readable when I copy them over to a new drive.A combination of ZFS and FMA on OpenSolaris means it will recover. Depending on many factors - not just the hard drive and its firmware - will depend on how long the time outs actually. -- Darren J Moffat
Mark Grant wrote:> Yeah, this is my main concern with moving from my cheap Linux server with no redundancy to ZFS RAID on OpenSolaris; I don''t really want to have to pay twice as much to buy the ''enterprise'' disks which appear to be exactly the same drives with a flag set in the firmware to limit read retries, but I also don''t want to lose all my data because a sector fails and the drive hangs for a minute trying to relocate it, causing the file system to fall over. > > I haven''t found a definitive answer as to whether this will kill a ZFS RAID like it kills traditional hardware RAID or whether ZFS will recover after the drive stops attempting to relocate the sector. At least with a single drive setup the OS will eventually get an error response and the other files on the disk will be readable when I copy them over to a new drive. >I don''t think ZFS does any timing out. It''s up to the drivers underneath to timeout and send an error back to ZFS - only they know what''s reasonable for a given disk type and bus type. So I guess this may depend which drivers you are using. I don''t know what the timeouts are, but I have observed them to be long in some cases when things do go wrong and timeouts and retries are triggered. -- Andrew
Mark Grant wrote:> Yeah, this is my main concern with moving from my cheap Linux server with no redundancy to ZFS RAID on OpenSolaris; I don''t really want to have to pay twice as much to buy the ''enterprise'' disks which appear to be exactly the same drives with a flag set in the firmware to limit read retries, but I also don''t want to lose all my data because a sector fails and the drive hangs for a minute trying to relocate it, causing the file system to fall over. > > I haven''t found a definitive answer as to whether this will kill a ZFS RAID like it kills traditional hardware RAID or whether ZFS will recover after the drive stops attempting to relocate the sector. At least with a single drive setup the OS will eventually get an error response and the other files on the disk will be readable when I copy them over to a new drive. >The issue is excessive error recovery times INTERNAL to the hard drive. So, worst case scenario is that ZFS marks the drive as "bad" during a write, causing the zpool to be degraded. It''s not going to lose your data. It just may case a "premature" marking of a drive as bad. None of this kills a RAID (ZFS, traditional SW Raid, or HW Raid). It doesn''t cause data corruption. The issue is sub-optimal disk fault determination. If you suspect that the drive really isn''t bad, you can simply re-add it back into the zpool, and have it resilvered, which should take considerably less time than a full-drive resilver. That said, if your drive really is taking 10-15 seconds to remap bad sectors, maybe you _should_ replace it. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
>From what I remember the problem with the hardware RAID controller is that the long delay before the drive responds causes the drive to be dropped from the RAID and then if you get another error on a different drive while trying to repair the RAID then that disk is also marked failed and your whole filesystem is gone even though most of the data is still readable on the disks; odds are you could have recovered 100% of the data using what is still readable on the complete set of drives, since the bad sectors on the two failed drives probably wouldn''t be in the same place. The end result is worse than not using RAID because you lose everything rather than just the files with bad sectors (though if you''re using mirroring rather than parity then you could presumably recover most of the data eventually).Certainly if the disk was taking that long to respond I''d be replacing it ASAP, but ASAP may not be fast enough if a second drive has bad sectors too. And I have seen a consumer SATA drive repeatedly lock up a system for a minute doing retries when there was no indication at all beforehand that the drive had problems. -- This message posted from opensolaris.org
On Dec 10, 2009, at 8:36 AM, Mark Grant wrote:>> From what I remember the problem with the hardware RAID controller >> is that the long delay before the drive responds causes the drive >> to be dropped from the RAID and then if you get another error on a >> different drive while trying to repair the RAID then that disk is >> also marked failed and your whole filesystem is gone even though >> most of the data is still readable on the disks; odds are you could >> have recovered 100% of the data using what is still readable on the >> complete set of drives, since the bad sectors on the two failed >> drives probably wouldn''t be in the same place. The end result is >> worse than not using RAID because you lose everything rather than >> just the files with bad sectors (though if you''re using mirroring >> rather than parity then you could presumably recover most of the >> data eventually). > > Certainly if the disk was taking that long to respond I''d be > replacing it ASAP, but ASAP may not be fast enough if a second drive > has bad sectors too. And I have seen a consumer SATA drive > repeatedly lock up a system for a minute doing retries when there > was no indication at all beforehand that the drive had problems.For the Solaris sd(7d) driver, the default timeout is 60 seconds with 3 or 5 retries, depending on the hardware. Whether you notice this at the application level depends on other factors: reads vs writes, etc. You can tune this, of course, and you have access to the source. -- richard
Thanks, sounds like it should handle all but the worst faults OK then; I believe the maximum retry timeout is typically set to about 60 seconds in consumer drives. -- This message posted from opensolaris.org
> Mark Grant wrote:> I don''t think ZFS does any timing out. > It''s up to the drivers underneath to timeout and send > an error back to > ZFS - only they know what''s reasonable for a given > disk type and bus > type.I think that is the issue. By my reading, many (if not most) consumer drives don''t put any internal time limit on how long they will continue trying to read a sector. In other words, if the drive encounters a read error, it may sit there forever trying to read the bad block without actually reporting an error. This is behavior is supposedly preferred (in consumer drives) because in a single drive setup, most users would rather have the drive try indefinitely to read a sector in the hopes of getting their data back rather than having less chance of recovery if the error recovery times out and the sector is marked as bad. Unless the controller, driver, or higher layers of the software stack time out a read request, the drive can hang the system. This issue has been discussed on the AVForums among other places. Richard -- This message posted from opensolaris.org
I''m also planning on building a home file server using ZFS, and this issue has also come to my attention during my research. I''m afraid that I''m a complete ZFS/NAS/RAID newbie, so honestly half the things discussed in this thread went over my head. :) For a complete newbie, can someone simply answer the following: will using non-enterprise level drives affect ZFS like it affects hardware RAID? -- This message posted from opensolaris.org
On Fri, 11 Dec 2009, Bob wrote:> For a complete newbie, can someone simply answer the following: will > using non-enterprise level drives affect ZFS like it affects > hardware RAID?Yes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Thanks. Any alternatives, other than using enterprise-level drives? -- This message posted from opensolaris.org
On Fri, 11 Dec 2009, Bob wrote:> Thanks. Any alternatives, other than using enterprise-level drives?You can of course use normal consumer drives. Just don''t expect them to recover from an read error very quickly. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Most manufacturers have a utility available that sets this behavior. For WD drives, it''s called WDTLER.EXE. You have to make a bootable USB stick to run the app, but it is simple to change the setting to the enterprise behavior. -- This message posted from opensolaris.org
Note you don''t get the better vibration control and other improvements the enterprise drives have. So it''s not exactly that easy. :) -- This message posted from opensolaris.org
Actually, recent batches of WD drives don''t let you change the TLER setting anymore, which is why I was concerned about this. -- This message posted from opensolaris.org
Been lurking for about a week and a half and this is my first post... --- bfriesen at simple.dallas.tx.us wrote:>On Fri, 11 Dec 2009, Bob wrote:>> Thanks. Any alternatives, other than using enterprise-level drives?>You can of course use normal consumer drives. Just don''t expect them >to recover from an read error very quickly.Any way to tell ZFS that these drives are of "lower quality" and shouldn''t be kicked out as faulted so quickly? I personally setup my home server to use OpenSolaris so I could have ZFS safeguard my data. I am willing to trade away performance for more stability, and less "yes that drive is perfectly fine" type management. I''m also willing to have more redundancy and less storage with the same number of drives, but that has to wait until I have enough unused drives to setup a new pool with the new options (either raidz3 or full mirroring) and copy over as there is no method to make this change inplace.
On Dec 13, 2009, at 11:28 PM, Yaverot wrote:> Been lurking for about a week and a half and this is my first post... > > --- bfriesen at simple.dallas.tx.us wrote: >> On Fri, 11 Dec 2009, Bob wrote: > >>> Thanks. Any alternatives, other than using enterprise-level drives? > >> You can of course use normal consumer drives. Just don''t expect them >> to recover from an read error very quickly. > > Any way to tell ZFS that these drives are of "lower quality" and > shouldn''t be kicked out as faulted so quickly? I personally setup > my home server to use OpenSolaris so I could have ZFS safeguard my > data. I am willing to trade away performance for more stability, > and less "yes that drive is perfectly fine" type management.FMA (not ZFS, directly) looks for a number of failures over a period of time. By default that is 10 failures in 10 minutes. If you have an error that trips on TLER, the best it can see is 2-3 failures in 10 minutes. The symptom you will see is that when these long timeouts happen, they take a long time because, by default, the drive will be reset and the I/O retried after 60 seconds.> > I''m also willing to have more redundancy and less storage with the > same number of drives, but that has to wait until I have enough > unused drives to setup a new pool with the new options (either > raidz3 or full mirroring) and copy over as there is no method to > make this change inplace.This is a good idea anyway :-) -- richard
> FMA (not ZFS, directly) looks for a number of > failures over a period of time. > By default that is 10 failures in 10 minutes. If you > have an error that trips > on TLER, the best it can see is 2-3 failures in 10 > minutes. The symptom > you will see is that when these long timeouts happen, > they take a long time > because, by default, the drive will be reset and the > I/O retried after 60 seconds.That''s very good news. I''m trying to get the stuff together to set up my zfs server, and I''m also perfectly willing to trade slower operation and more disks to get zfs'' scrubbing and other operations. The recent discovery that WD has decided to up its prices in a back-door manner by making sure that the DIY RAID folks can''t modify TLER on cheaper drives was a real slap in the face, potentially more than doubling the price of storage. I''ve dealt with the MBA mentality before, and I don''t like it. >:-| This discovery was bad enough to almost put me off building a server entirely, with the apparent options of paying 100% more for the disks or having the array suffer 100% data loss on any significant read/write error. So let me be sure I understand. If I''m using solaris/zfs, I can use FMA to set the level of retries/time to be waited if I get a disk error before taking a disk out of the array. Is that correct? If it is, and that can be set to allow an array of disks to tolerate most instances of read/write errors without corrupting an entire array, then I''m back on with the server scheme. The whole point of going to solaris/zfs is background scrubbing for me. I''m willing for it to be slow - however slow it is, it''s much faster than finding the backup DVDs in the closet, pilfering through them to find the right one, then finding out the DVD set has bit-rot too. I apologize for the baby-simple questions. I''m reading documentation as hard as I can, but there''s a world of difference between reading documentation and understanding and using the tools described. -- This message posted from opensolaris.org
How you can setup these values to fma? Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of R.G. Keen Sent: 14. joulukuuta 2009 20:14 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL> FMA (not ZFS, directly) looks for a number of > failures over a period of time. > By default that is 10 failures in 10 minutes. If you > have an error that trips > on TLER, the best it can see is 2-3 failures in 10 > minutes. The symptom > you will see is that when these long timeouts happen, > they take a long time > because, by default, the drive will be reset and the > I/O retried after 60 seconds.That''s very good news. I''m trying to get the stuff together to set up my zfs server, and I''m also perfectly willing to trade slower operation and more disks to get zfs'' scrubbing and other operations. The recent discovery that WD has decided to up its prices in a back-door manner by making sure that the DIY RAID folks can''t modify TLER on cheaper drives was a real slap in the face, potentially more than doubling the price of storage. I''ve dealt with the MBA mentality before, and I don''t like it. >:-| This discovery was bad enough to almost put me off building a server entirely, with the apparent options of paying 100% more for the disks or having the array suffer 100% data loss on any significant read/write error. So let me be sure I understand. If I''m using solaris/zfs, I can use FMA to set the level of retries/time to be waited if I get a disk error before taking a disk out of the array. Is that correct? If it is, and that can be set to allow an array of disks to tolerate most instances of read/write errors without corrupting an entire array, then I''m back on with the server scheme. The whole point of going to solaris/zfs is background scrubbing for me. I''m willing for it to be slow - however slow it is, it''s much faster than finding the backup DVDs in the closet, pilfering through them to find the right one, then finding out the DVD set has bit-rot too. I apologize for the baby-simple questions. I''m reading documentation as hard as I can, but there''s a world of difference between reading documentation and understanding and using the tools described. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Dec 14, 2009, at 10:18 AM, Markus Kovero wrote:> How you can setup these values to fma?UTSL http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c#775 Standard caveats for adjusting timeouts applies. -- richard
>>>>> "n" == Nathan <nathan at passivekid.com> writes:n> http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery This sounds silly. Does it actually work for you? It seems like comparing 7 seconds to the normal 30 seconds would be useless. Instead you want to compare (7 seconds * n levels * of cargo-cult retry in OS storage stack) to the 0.01 seconds it normally takes to read a sector. 3 orders of magnitude difference here is what makes slowly-failing drives useless, not the tiny difference between 7 and 30. A smart feature would be ``mark unreadable blocks in the drive''s onboard DRAM read cache and fail them instantly without an attempt on the medium, to work around broken OS storage stacks that can''t distinguish between cabling errors and drive reports and keep uselessly banging away on dead sectors as errors slowly propogate up an `abstracted'' stack,'''' and ``spend at most 30 seconds out of every 2000 seconds on various degraded error recovery gymnastics. If your time budget''s spent, toss up an error immediately, NO THINKING, immediately after the second time the platter rotated while the head should have been over the data, no matter where the head actually was or what you got or how certain you are the data is there unharmed if you can just recover head servo.'''' but I doubt the EE''s are smart enough to put that feature on the table. actually it''s probably not so much EE''s are dumb as that they assume OS designers can implement such policies in their drivers instead of needing them pushed down to the drive. which is, you know, a pretty reasonable (albeit wrong) assumption. The most interesting thing on that wikipedia page is that freebsd geom is already using a 4-second timeout. Once you''ve done that, I''m not sure if it matters whether the drive signals error by sending an error packet, or signals error by sending nothing for >4 seconds---just so long as you HEAR the signal and REACT. n> Basically drives without particular TLER settings drop out of n> RAID randomly. well...I would guess they''ll drop out whenever they hit a recoverable error. :) Maybe the modern drives are so crappy, this is happening so often, that it seems ``random''''. With these other cards, do the drives ``go back in'''' to the RAID when they start responding to commands again? n> Does this happen in ZFS? No. Any timeouts in ZFS are annoyingly based on the ``desktop'''' storage stack underneath it which is unaware of redundancy and of the possibility of reading data from elsewhere in a redundant stripe rather than waiting 7, 30, or 180 seconds for it. ZFS will bang away on a slow drive for hours, bringing the whole system down with it, rather than read redundant data from elsewhere in the stripe, so you don''t have to worry about drives dropping out randomly. Every last bit will be squeezed from the first place ZFS tried to read it, even if this takes years. however you will get all kinds of analysis and log data generated during those years (assuming the system stays up enough to write the logs which it probably won''t: http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSFailmodeProblem ) Maybe it''s getting better, but there''s a fundamental philosophical position of what piece of code''s responsible for what sort of blocking all this IMHO. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091215/ff251e2c/attachment.bin>
> Thanks, sounds like it should handle all but the > worst faults OK then; I believe the maximum retry > timeout is typically set to about 60 seconds in > consumer drives.Are you sure about this? I thought these consumer level drives would try indefinitely to carry out its operation. Even Samsung''s white paper on CCTL RAID error recovery says it could take a minute or longer (see "Desktop Unsuccessful Error Recovery" diagram) http://www.samsung.com/global/business/hdd/learningresource/whitepapers/LearningResource_CCTL.html -- This message posted from opensolaris.org
On Thu, Dec 31 at 2:14, Willy wrote:>> Thanks, sounds like it should handle all but the >> worst faults OK then; I believe the maximum retry >> timeout is typically set to about 60 seconds in >> consumer drives. > > Are you sure about this? I thought these consumer level drives > would try indefinitely to carry out its operation. Even Samsung''s > white paper on CCTL RAID error recovery says it could take a minute > or longer (see "Desktop Unsuccessful Error Recovery" diagram) > http://www.samsung.com/global/business/hdd/learningresource/whitepapers/LearningResource_CCTL.htmlDepends very much on the firmware and the error type. Each vendor will have their own trade-secret approaches to solving this issue based on their own failure rates and expected usages. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
I''m in full overthink/overresearch mode on this issue, preparatory to ordering disks for my OS/zfs NAS build. So bear with me. I''ve been reading manuals and code, but it''s hard for me to come up to speed on a new OS quickly. The question(s) underlying this thread seem to be: (1) Does zfs raidz/raidz2/etc have the same issue with long recovery times as RAID5? That being dropping a drive from the array because it experiences an error and recovery that lasts longer than the controller (zfs/OS/device driver stack in this case) waits for an error message? and (2) Can non "raid edition" drives be set to have shorter error recovery for raid use? On (1), I pick out the following answers: =============================================>From Miles Nordin; n> Does this happen in ZFS? No. Any timeouts in ZFS are annoyingly based on the ``desktop'''' storage stack underneath it which is unaware of redundancy and of the possibility of reading data from elsewhere in a redundant stripe rather than waiting 7, 30, or 180 seconds for it. ZFS will bang away on a slow drive for hours, bringing the whole system down with it, rather than read redundant data from elsewhere in the stripe, so you don''t have to worry about drives dropping out randomly. Every last bit will be squeezed from the first place ZFS tried to read it, even if this takes years. =============================================>From Darren J Moffat; A combination of ZFS and FMA on OpenSolaris means it will recover. Depending on many factors - not just the hard drive and its firmware - will depend on how long the time outs actually. =============================================>From Erik Trimble; The issue is excessive error recovery times INTERNAL to the hard drive. So, worst case scenario is that ZFS marks the drive as "bad" during a write, causing the zpool to be degraded. It''s not going to lose your data. It just may case a "premature" marking of a drive as bad. None of this kills a RAID (ZFS, traditional SW Raid, or HW Raid). It doesn''t cause data corruption. The issue is sub-optimal disk fault determination. =============================================>From Richard Relling; For the Solaris sd(7d) driver, the default timeout is 60 seconds with 3 or 5 retries, depending on the hardware. Whether you notice this at the application level depends on other factors: reads vs writes, etc. You can tune this, of course, and you have access to the source. =============================================>From dubslick; Are you sure about this? I thought these consumer level drives would try indefinitely to carry out its operation. Even Samsung''s white paper on CCTL RAID error recovery says it could take a minute or longer =============================================>From Bob Friesen;> For a complete newbie, can someone simply answer the following: will > using non-enterprise level drives affect ZFS like it affects > hardware RAID?Yes. =============================================So from a group of knowledgeable people I get answers all the way from "no problem, it''ll just work, may take a while though" to "...using non-enterprise raid drives will affect zfs just like it does hardwar raid", that being to unnecessarily drop out a disk, and thereby expose the array to failure from a second read/write fault on another disk. Most of the votes seem to be in the "no problem" range. But beyond me trying to learn all the source code, is there any way to tell how it will really react? My issue is this: I *want* the attributes of consumer-level drives other than the infinite retries. I want slow spin speed for low vibration and low power consumption, am willing to deal with the slower transfer/access speeds to get it. I can pay for (but resent being forced to!) raid-rated drives, but I don''t like the extra power consumption needed to get them to be very fast in access and transfers. I''m fine with whipping in a new drive when one of the existing ones gets flaky. I find that I may be in the curious position of being forced to pay twice the price and expend twice the power to get drives that have many features I don''t want or need and don''t have what I do need, except for the one issue which may (infrequently!) tear up whatever data I have built. ... maybe... On question (2), I believe that my research has led to the following: Drives which support the SMART Command Transport spec, which is many newer disks, appear to allow setting timeouts on read/write operations completing. However, this setting appears not to persist beyond a power cycle. Is there any good reason there can''t be a driver added to the boot sequence that will open a file for which drives need to be SCT-set to have timeouts which are shorter than infinite (one of the issues from above) and also short enough to meet the needs of returning errors in a timely manner so that there is not a huge window for a second fault to corrupt a zfs array? Forgive me if I''m being too literal here. Think of me as the town idiot asking questions. 8-) -- This message posted from opensolaris.org
On Thu, 31 Dec 2009, R.G. Keen wrote:> I''m in full overthink/overresearch mode on this issue, preparatory > to ordering disks for my OS/zfs NAS build. So bear with me. I''ve > been reading manuals and code, but it''s hard for me to come up to > speed on a new OS quickly. > > The question(s) underlying this thread seem to be: > (1) Does zfs raidz/raidz2/etc have the same issue with long recovery > times as RAID5? That being dropping a drive from the array because > it experiences an error and recovery that lasts longer than the > controller (zfs/OS/device driver stack in this case) waits for an > error message? > and > (2) Can non "raid edition" drives be set to have shorter error recovery for raid use?I like the nice and short answer from this "Bob Friesen" fellow the best. :-) I have heard that some vendor''s drives can be re-flashed or set to use short timeouts. Some vendors don''t like this so they are trying to prohibit it or doing so may invalidate the warranty. Unless things have changed (since a couple of years ago when I last looked), there are some vendors (e.g. Seagate) who offer "enterprise" SATA drives with only a small surcharge over astonishingly similar desktop SATA drives. The only actual difference seems to be the firmware which is loaded on the drive. Check out the Barracuda ES.2 series. It does not really matter what Solaris or ZFS does if the drive essentially locks up when it is trying to recover a bad sector. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> On Thu, 31 Dec 2009, Bob Friesenhahn wrote: > I like the nice and short answer from this "Bob > Friesen" fellow the > best. :-)It was succinct, wasn''t it? 8-) Sorry - I pulled the attribution from the ID, not the signature which was waiting below. DOH! When you say:> It does not really matter what Solaris or ZFS does if the drive > essentially locks up when it is trying to recover a bad sector.I''d have to say that it depends. If Solaris/zfs/etc. is restricted to actions which consist of marking the disk semi-permanently bad and continuing, yes, it amounts to the same thing: it opens a yawning chasm of "one more error and you''re dead," until the array can be serviced and un-degraded. At least I think it does, based on what I''ve read, anyway. However, if OS/S/zfs/etc. performs an appropriate fire drill up to and including logging the issues, quiescing the array, and annoying the operator then it closes up the sudden-death window. This gives the operator of the array a chance to do something about it, such as swapping in a spare and starting rebuilding/resilvering/etc. Given the largish aggregate monetary value to RAIDZ builders of sidestepping the doubled-cost of raid specialized drives, it occurs to me that having a special set of actions for desktop-ish drives might be a good idea. Something like a fix-the-failed repair mode which pulls all recoverable data off the purportedly failing drive and onto a new spare to avoid a monster resilvering and the associated vulnerable time to a second or third failure. Viewed in that light, exactly what OS/S/zfs does on a long extended reply from a disk and exactly what can be done to minimize the time when the array runs in a degraded mode where the next step loses the data seems to be a really important issue. Well, OK, it does to me because my purpose here is getting to background scrubbing of errors in the disks. Other things might be more important to others. 8-) And the question might be moot if the SMART SCT architecture in desktop drives lets you do a power-on hack to shorten the reply-failed time for better raid operation. That''s actually the solution I''d like to see in a perfect world - I get back to a redundant array of INEXPENSIVE disks, and I can pick those disks to be big and slow/low power instead of fast/high power. I''d welcome any enlightened speculation on this. I do recognize that I''m an idiot on these matters compared to people with actual experience. 8-) -- This message posted from opensolaris.org
On Thu, 31 Dec 2009, R.G. Keen wrote:> Given the largish aggregate monetary value to RAIDZ builders of > sidestepping the doubled-cost of raid specialized drives, it occurs > to me that having a special set of actions for desktop-ish drives > might be a good idea. Something like a fix-the-failed repair mode > which pulls all recoverable data off the purportedly failing drive > and onto a new spare to avoid a monster resilvering and the associated > vulnerable time to a second or third failure.The problem is that a "desktop-ish drive" may single-mindedly focus on reading the bad data while otherwise responding as if it is alive. So everything just waits a long time while the OS sends new requests to the drive (which are recieved) but the OS does not get the requested data back. To make matters worse, the OS might send another request for the same data, the drive gives up on the last request, and then proceeds with the new request for the same bad data. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Dec 31, 2009, at 6:14 PM, R.G. Keen wrote:>> On Thu, 31 Dec 2009, Bob Friesenhahn wrote: >> I like the nice and short answer from this "Bob >> Friesen" fellow the >> best. :-) > It was succinct, wasn''t it? 8-) > > Sorry - I pulled the attribution from the ID, not the > signature which was waiting below. DOH! > > When you say: >> It does not really matter what Solaris or ZFS does if the drive >> essentially locks up when it is trying to recover a bad sector. > I''d have to say that it depends. If Solaris/zfs/etc. is restricted > to actions which consist of marking the disk semi-permanently > bad and continuing, yes, it amounts to the same thing: it opens > a yawning chasm of "one more error and you''re dead," until the > array can be serviced and un-degraded. At least I think it > does, based on what I''ve read, anyway.Some nits: disks aren''t marked as semi-bad, but if ZFS has trouble with a block, it will try to not use the block again. So there is two levels of recovery at work: whole device and block. The "one more and you''re dead" is really N errors in T time. For disks which don''t return when there is an error, you can reasonably expect that T will be a long time (multiples of 60 seconds) and therefore the N in T threshold will not be triggered. The term "degraded" does not have a consistent definition across the industry. See the zpool man page for the definition used for ZFS. In particular, DEGRADED != FAULTED> However, if OS/S/zfs/etc. performs an appropriate fire drill up > to and including logging the issues, quiescing the array, and > annoying the operator then it closes up the sudden-death window. > This gives the operator of the array a chance to do something > about it, such as swapping in a spare and starting > rebuilding/resilvering/etc.Issues are logged, for sure. If you want to monitor them proactively, you need to configure SNMP traps for FMA.> Given the largish aggregate monetary value to RAIDZ builders of > sidestepping the doubled-cost of raid specialized drives, it occurs > to me that having a special set of actions for desktop-ish drives > might be a good idea. Something like a fix-the-failed repair mode > which pulls all recoverable data off the purportedly failing drive > and onto a new spare to avoid a monster resilvering and the associated > vulnerable time to a second or third failure.It already does this, as long as there are N errors in T time. There is room for improvement here, but I''m not sure how one can set a rule that would explicitly take care of the I/O never returning from a disk while a different I/O to the same disk returns. More research required here...> Viewed in that light, exactly what OS/S/zfs does on a long extended > reply from a disk and exactly what can be done to minimize the > time when the array runs in a degraded mode where the next step > loses the data seems to be a really important issue.Once the state changes to DEGRADED, the admin must zpool clear the errors to return the state to normal. Make sure your definition of degraded matches.> Well, OK, it does to me because my purpose here is getting to > background scrubbing of errors in the disks. Other things might > be more important to others. 8-) > > And the question might be moot if the SMART SCT architecture in > desktop drives lets you do a power-on hack to shorten the reply-failed > time for better raid operation. That''s actually the solution I''d like > to see in a perfect world - I get back to a redundant array of > INEXPENSIVE > disks, and I can pick those disks to be big and slow/low power instead > of fast/high power.In my experience, disk drive firmware quality and feature sets vary widely. I''ve got a bunch of scars from shaky firmware and I even got a new one a few months ago. So perhaps one day the disk vendors will perfect their firmware? :-)> I''d welcome any enlightened speculation on this. I do recognize that > I''m an idiot on these matters compared to people with actual > experience. 8-)So you want some scars too? :-) -- richard
> On Dec 31, 2009, at 6:14 PM, Richard Elling wrote: > Some nits: > disks aren''t marked as semi-bad, but if ZFS has trouble with a > block, it will try to not use the block again. So there is two levels > of recovery at work: whole device and block.Ah. I hadn''t found that yet.> The "one more and you''re dead" is really N errors in T time.I''m interpreting this as "OS/S/zfs/drivers will not mark a disk as failed until it returns N errors in T time," which means - check me on this - that to get a second failed disk, the time to get a second real-or-fake failed disk is T, where T is the time a second soft-failing disk may happen while the system is balled up in worrying about the first disk not responding in T time. This based on a paper I read on line about the increasing need for raidz3 or similar over raidz2 or similar because throughput from disks has not increased concomitantly with their size; this leading to increasing times to recover from first failures using the stored checking data in the array to to rebuild. The notice-an-error time plus the rebuild-the-array time is the window in which losing another disk, soft or hard, will lead to the inability to resilver the array.> For disks which don''t return when there is an error, you can > reasonably expect that T will be a long time (multiples of 60 > seconds) and therefore the N in T threshold will not be triggered.The scenario I had in mind was two disks ready to fail, either soft (long time to return data) or hard (bang! That sector/block or disk is not coming back, period). The first fails and starts trying to recover in desktop-disk fashion, maybe taking hours. This leaves the system with no error report (i.e. the N-count is zero) and the T-timer ticking. Meanwhile the array is spinning. The second fragile disk is going to hit its own personal pothole at some point soon in this scenario. What happens next is not clear to me. Is OS/S/zfs going to suspend disk operations until it finally does hear from first failing disk 1, based on N still being at 0 because the disk hasn''t reported back yet? Or will the array continue with other operations, noting that the operation involving failing disk1 has not completed, and either stack another request on failing disk 1, or access failing disk 2 and get its error too at some point? Or both? If the timeout is truly N errors in T time, and N is never reported back because the disk spends some hours retrying, then it looks like this is a zfs hang if not a system hang. If there is a timeout of some kind which takes place even if N never gets over 0, that would at least unhang the file system/system, but it opens you to the second failing disk fault having occurred, and you''re in for another of either hung-forever or failed-array in the case of raidz.> The term "degraded" does not have a consistent > definition across the industry.Of course not! 8-) Maybe we should use "depraved" 8-)> See the zpool man page for the definition > used for ZFS. In particular, DEGRADED != FAULTED> Issues are logged, for sure. If you want to monitor > them proactively, > you need to configure SNMP traps for FMA.Ok, can deal with that.> It already does this, as long as there are N errors > in T time.OK, I can work that one out. I''m still puzzled on what happens with the "N=0 forever" case. The net result on that one seems to be that you need raid specific disks to get some kind of timeout to happen at the disk level ever (depending on the disk firmware, which as you note later, is likely to have been written by a junior EE as his first assignment 8-) )>There is room for improvement here, but I''m not sure how > one can set a rule that would explicitly take care of the I/O never > returning from a disk while a different I/O to the same disk > returns. More research required here...Yep. I''m thinking that it might be possible to do a policy-based setup section for an array where you could select one of a number of rule-sets for what to do, based on your experience and/or paranoia about the disks in your array. I had good luck with that in a primitive whole-machine hardware diagnosis system I worked with at one point in the dim past. Kind of "if you can''t do the right/perfect thing, then ensure that *something* happens." One of the rules scenarios might be "if one seek to a disk never returns and other actions to that disk to work, then halt the pending action(s) to disk and/or array, increment N, restart that disk or the entire array as needed, and retry that action in a diagnostic loop, which decides whether it''s a soft fail, hard block fail, or hard disk fail" and then take the proper action based on the diagnostic. Or it could be "map that disk out and run diagnostics on it while the hot spare is swapped in" based on whether there''s a hot spare or not. But yes, some thought is needed. I always tend to pick the side of "let the user/admin pick the way they want to fail" which may not be needed or wanted.> Once the state changes to DEGRADED, the admin must > zpool clear the errors to return the state to normal. Make sure > your definition of degraded matches.I still like "depraved"... 8-)> In my experience, disk drive firmware quality and > feature sets vary > widely. I''ve got a bunch of scars from shaky > firmware and I even > got a new one a few months ago. So perhaps one day > the disk vendors will perfect their firmware? :-)Yep - see "junior EE as disk firmware programmer" above.> So you want some scars too? :-)Probably. It''s nice to find someone else who uses the scars analogy. I was just this Christmas pointing out the assortment of thin, straight scars on my hands to a nephew to whom I gave a new knife for the holiday. Another way to put it is that experience is what you have left after you''ve forgotten their name. R.G. -- This message posted from opensolaris.org
On Jan 1, 2010, at 8:11 AM, R.G. Keen wrote:>> On Dec 31, 2009, at 6:14 PM, Richard Elling wrote: >> Some nits: >> disks aren''t marked as semi-bad, but if ZFS has trouble with a >> block, it will try to not use the block again. So there is two >> levels >> of recovery at work: whole device and block. > Ah. I hadn''t found that yet. > >> The "one more and you''re dead" is really N errors in T time. > I''m interpreting this as "OS/S/zfs/drivers will not mark a disk > as failed until it returns N errors in T time," which means - > check me on this - that to get a second failed disk, the time > to get a second real-or-fake failed disk is T, where T is the > time a second soft-failing disk may happen while the system > is balled up in worrying about the first disk not responding in > T time.Perhaps I am not being clear. If a disk is really dead, then there are several different failure modes that can be responsible. For example, if a disk does not respond to selection, then it is diagnosed as failed very quickly. But that is not the TLER case. The TLER case is when the disk cannot read from media without error, so it will continue to retry... perhaps forever or until reset. If a disk does not complete an I/O operation in (default) 60 seconds (for sd driver), then it will be reset and the I/O operation retried. If a disk returns bogus data (failed ZFS checksum), then the N in T algorithm may kick in. I have seen this failure mode many times.> This based on a paper I read on line about the increasing > need for raidz3 or similar over raidz2 or similar because > throughput from disks has not increased concomitantly with > their size; this leading to increasing times to recover from > first failures using the stored checking data in the array to > to rebuild. The notice-an-error time plus the rebuild-the-array > time is the window in which losing another disk, soft or hard, > will lead to the inability to resilver the array.A similar observation is that the error rate (errors/bit) has not changed, but the number of bits continues to increase.>> For disks which don''t return when there is an error, you can >> reasonably expect that T will be a long time (multiples of 60 >> seconds) and therefore the N in T threshold will not be triggered. > The scenario I had in mind was two disks ready to fail, either > soft (long time to return data) or hard (bang! That sector/block > or disk is not coming back, period). The first fails and starts > trying to recover in desktop-disk fashion, maybe taking hours.Yes, this is the case for TLER. The only way around this is to use disks that return failures when they occur.> This leaves the system with no error report (i.e. the N-count is > zero) and the T-timer ticking. Meanwhile the array is spinning. > The second fragile disk is going to hit its own personal pothole > at some point soon in this scenario. > > What happens next is not clear to me. Is OS/S/zfs going to > suspend disk operations until it finally does hear from first > failing disk 1, based on N still being at 0 because the disk > hasn''t reported back yet? Or will the array continue with other > operations, noting that the operation involving failing disk1 > has not completed, and either stack another request on > failing disk 1, or access failing disk 2 and get its error too > at some point? Or both?ZFS issues I/O in parallel. However, that does not prevent an application or ZFS metadata transactions from waiting on a sequence of I/O.> If the timeout is truly N errors in T time, and N is never > reported back because the disk spends some hours retrying, > then it looks like this is a zfs hang if not a system hang.The drivers will retry and fail the I/O. By default, for SATA disks using the sd driver, there are 5 retries of 60 seconds. After 5 minutes, the I/O will be declared failed and that info is passed back up the stack to ZFS, which will start its recovery. This is why the T part of N in T doesn''t work so well for the TLER case.> If there is a timeout of some kind which takes place even > if N never gets over 0, that would at least unhang the > file system/system, but it opens you to the second failing > disk fault having occurred, and you''re in for another of > either hung-forever or failed-array in the case of raidz.I don''t think the second disk scenario adds value to this analysis.>> The term "degraded" does not have a consistent >> definition across the industry. > Of course not! 8-) Maybe we should use "depraved" 8-) > >> See the zpool man page for the definition >> used for ZFS. In particular, DEGRADED != FAULTED > >> Issues are logged, for sure. If you want to monitor >> them proactively, >> you need to configure SNMP traps for FMA. > Ok, can deal with that. > >> It already does this, as long as there are N errors >> in T time. > OK, I can work that one out. I''m still puzzled on what > happens with the "N=0 forever" case. The net result > on that one seems to be that you need raid specific > disks to get some kind of timeout to happen at the > disk level ever (depending on the disk firmware, > which as you note later, is likely to have been written > by a junior EE as his first assignment 8-) )As above, there is no forever case. But some folks get impatient after a few minutes :-)>> There is room for improvement here, but I''m not sure how >> one can set a rule that would explicitly take care of the I/O never >> returning from a disk while a different I/O to the same disk >> returns. More research required here... > Yep. I''m thinking that it might be possible to do a policy-based > setup section for an array where you could select one of a number > of rule-sets for what to do, based on your experience and/or > paranoia about the disks in your array. I had good luck with that > in a primitive whole-machine hardware diagnosis system I worked > with at one point in the dim past. Kind of "if you can''t do the > right/perfect thing, then ensure that *something* happens." > > One of the rules scenarios might be "if one seek to a disk never > returns and other actions to that disk to work, then halt the > pending action(s) to disk and/or array, increment N, restart that > disk or the entire array as needed, and retry that action in a > diagnostic loop, which decides whether it''s a soft fail, hard > block fail, or hard disk fail" and then take the proper action > based on the diagnostic. Or it could be "map that disk out and > run diagnostics on it while the hot spare is swapped in" based > on whether there''s a hot spare or not.The diagnosis engines and sd driver are open source :-) http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/scsi/targets/sd.c> But yes, some thought is needed. I always tend to pick the side > of "let the user/admin pick the way they want to fail" which > may not be needed or wanted.Interesting. If you have thoughts along this line, fm-discuss or driver-discuss can be a better forum than zfs-discuss (ZFS is a consumer of time-related failure notifications). -- richard>> Once the state changes to DEGRADED, the admin must >> zpool clear the errors to return the state to normal. Make sure >> your definition of degraded matches. > I still like "depraved"... 8-) > >> In my experience, disk drive firmware quality and >> feature sets vary >> widely. I''ve got a bunch of scars from shaky >> firmware and I even >> got a new one a few months ago. So perhaps one day >> the disk vendors will perfect their firmware? :-) > Yep - see "junior EE as disk firmware programmer" above. > >> So you want some scars too? :-) > Probably. It''s nice to find someone else who uses the scars > analogy. I was just this Christmas pointing out the assortment > of thin, straight scars on my hands to a nephew to whom I gave > a new knife for the holiday. > > Another way to put it is that experience is what you have > left after you''ve forgotten their name. > > R.G. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Richard Elling wrote: > Perhaps I am not being clear. If a disk is really dead, then > there are several different failure modes that can be responsible. > For example, if a disk does not respond to selection, then it > is diagnosed as failed very quickly. But that is not the TLER > case. The TLER case is when the disk cannot read from > media without error, so it will continue to retry... perhaps > forever or until reset. If a disk does not complete an I/O operation > in (default) 60 seconds (for sd driver), then it will be reset and > the I/O operation retried.I suspect you''re being clear - it''s just that I''m building the runway ahead of the plane as I take off. 8-) So one case is the disk hits an error on a sector read, and retries, potentially for a long, long time. The SD driver waits 60 seconds, then resets the disk if one hasn''t change the timeout. Presumably the I/O operation is retried for the portion of the read that''s on that disk. (...er, that''s what I think would happen, anyway; is that right?) What happens then depends on whether the disk in question does return good on the retried operation. In the case of (a) retry gives good result; does the driver/zfs mark the block as problematic and move it to a different physical sector? Or just note that there was a problem there? This is what I''d all the soft error once scenario. (b) retry takes more than 60 seconds again; I''m not clear what the driver does here. N in T? Two tries and I remap you? Semi- infinite loop of retries? This is what I''d call the soft error forever scenario. And this is the error that would distinguish between TLER/ERC/CCTL disks and desktop disks.> If a disk returns bogus data (failed ZFS checksum), > then the N in T algorithm may kick in. I have seen this > failure mode many times.This is yet another error, not related to TLER/ERC/CCTL. In this case, the disk returns data that is wrong. In my limited understanding, this is what would happen in a scrub operation, where a soft error has happened, or where incorrect data has been correctly written on the disk. The disk does not detect an error and simply go off into the tall grass counting its toes forever, but instead promptly returns bad data. Both desktop and raid version disks would be OK in zfs with this error, in that the error would be handled by the error paths I already (if dimly!) comprehend.> A similar observation is that the error rate (errors/bit) has not > changed, but the number of bits continues to increase.Yes. The paper notes that the bit error rate has improved by two orders of magnitude, but the number of bits has kept slightly ahead. The killer is that the time required to fill a replacement disk with new, correct data has distinctly not kept pace with BER or capacity. that leads to long repair operations and increases the time available and therefore the probability of a successive failure happening.> >> For disks which don''t return when there is an error, you can > >> reasonably expect that T will be a long time (multiples of 60 > >> seconds) and therefore the N in T threshold will not be triggered. > > The scenario I had in mind was two disks ready to fail, either > > soft (long time to return data) or hard (bang! That sector/block > > or disk is not coming back, period). The first fails and starts > > trying to recover in desktop-disk fashion, maybe taking hours. > Yes, this is the case for TLER. The only way around > this is to use disks that return failures when they occur.OK. From the above suppositions, if we had a desktop (infinitely long retry on fail) disk and a soft-fail error in a sector, then the disk would effectively hang each time the sector was accessed. This would lead to (1) ZFS->SD-> disk read of failing sector (2) disk does not reply within 60 seconds (default) (3) disk is reset by SD (4) operation is retried by SD(?) (5) disk does not reply within 60 seconds (default) (6) disk is reset by SD ? then what? If I''m reading you correctly, the following string of events happens:> The drivers will retry and fail the I/O. By default, for SATA > disks using the sd driver, there are 5 retries of 60 seconds. > After 5 minutes, the I/O will be declared failed and that info > is passed back up the stack to ZFS, which will start its > recovery. This is why the T part of N in T doesn''t work so > well for the TLER case.Hmmm... actually, it may be just fine for my personal wants. If I had a desktop drive which went unresponsive for 60 seconds on an I/O soft error, then the timeout would be five minutes. at that time, zfs would... check me here... mark the block as failed, and try to relocate the block on the disk. If that worked fine, the previous sectors would be marked as unusable, and work goes on, but with the actions noted in the logs. If the relocation didn''t work, eventually zfs(?) SD(?) would decide that the disk was unusable, and ... yelp for help?... start rebuilding? roll in the hot spares?... send in the clowns? I want zfs for background scrubbing and am only minimally worried about speed and throughput. So taking five minutes to recover from a disk failure is not a big issue to me. I just want to not lose bits once I put them into the zfs bucket. Again, I apologize for the Ned-and-the-first-reader questions. I''m trying to locate what happens in the manual and code, but I''m kind of building the runway ahead of the plane taking off. I really very much appreciate your taking time to help me understand.> I don''t think the second disk scenario adds value to > this analysis.Only that it is the motivator for wanting recovery to be as short as possible. If each disk has a bit error rate of one bit per X seconds/days/years, then the probability of losing data can be expressed as a function of how long the array spends between the occurrence of the first error and the time until you have resolved the issue back to full redundancy. This would be the time to do any rebuild/resilvering/reintroduction of disks to get back to stable operation.> The diagnosis engines and sd driver are open source > :-)Yeah... all''s you gotta to is be able to read and comprehend OS and drive source code in the language. I''m working on that. 8-)> Interesting. If you have thoughts along this line, fm-discuss or > driver-discuss can be a better forum than zfs-discuss (ZFS is > a consumer of time-related failure notifications).I''ll get to that one day. Thanks again. R.G. -- This message posted from opensolaris.org
On Sat, Jan 2, 2010 at 4:07 PM, R.G. Keen <keen at geofex.com> wrote:> OK. From the above suppositions, if we had a desktop (infinitely > long retry on fail) disk and a soft-fail error in a sector, then the > disk would effectively hang each time the sector was accessed. > This would lead to > (1) ZFS->SD-> disk read of failing sector > (2) disk does not reply within 60 seconds (default) > (3) disk is reset by SD > (4) operation is retried by SD(?) > (5) disk does not reply within 60 seconds (default) > (6) disk is reset by SD ? > > then what? If I''m reading you correctly, the following string of > events happens: > >> The drivers will retry and fail the I/O. By default, for SATA >> disks using the sd driver, there are 5 retries of 60 seconds. >> After 5 minutes, the I/O will be declared failed and that info >> is passed back up the stack to ZFS, which will start its >> recovery. ?This is why the T part of N in T doesn''t work so >> well for the TLER case. > > Hmmm... actually, it may be just fine for my personal wants. > If I had a desktop drive which went unresponsive for 60 seconds > on an I/O soft error, then the timeout would be five minutes. > at that time, zfs would... check me here... mark the block as > failed, and try to relocate the block on the disk. If that worked > fine, the previous sectors would be marked as unusable, and > work goes on, but with the actions noted in the logs.We use Seagate Barracuda ES.2 1TB disks and every time the OS starts to bang on a region of the disk with bad blocks (which essentially degrades the performance of the whole pool) we get a call from our clients complaining about NFS timeouts. They usually last for 5 minutes but I''ve seen it last for a whole hour while the drive is slowly dying. Off-lining the faulty disk fixes it. I''m trying to find out how the disks'' firmware is programmed (timeouts, retries, etc) but so far nothing in the official docs. In this case the disk''s retry timeout seem way too high for our needs and I believe a timeout limit imposed by the OS would help. -- Giovanni P. Tirloni
Giovanni Tirloni <tirloni at gmail.com> wrote:> We use Seagate Barracuda ES.2 1TB disks and every time the OS starts > to bang on a region of the disk with bad blocks (which essentially > degrades the performance of the whole pool) we get a call from our > clients complaining about NFS timeouts. They usually last for 5 > minutes but I''ve seen it last for a whole hour while the drive is > slowly dying. Off-lining the faulty disk fixes it. > > I''m trying to find out how the disks'' firmware is programmed > (timeouts, retries, etc) but so far nothing in the official docs. In > this case the disk''s retry timeout seem way too high for our needs and > I believe a timeout limit imposed by the OS would help.Did you upgrade the firmware last spring? There is a known bug in the firmware that may let them go into alzheimer mode. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On Mon, Jan 4, 2010 at 3:51 PM, Joerg Schilling <Joerg.Schilling at fokus.fraunhofer.de> wrote:> Giovanni Tirloni <tirloni at gmail.com> wrote: > >> We use Seagate Barracuda ES.2 1TB disks and every time the OS starts >> to bang on a region of the disk with bad blocks (which essentially >> degrades the performance of the whole pool) we get a call from our >> clients complaining about NFS timeouts. They usually last for 5 >> minutes but I''ve seen it last for a whole hour while the drive is >> slowly dying. Off-lining the faulty disk fixes it. >> >> I''m trying to find out how the disks'' firmware is programmed >> (timeouts, retries, etc) but so far nothing in the official docs. In >> this case the disk''s retry timeout seem way too high for our needs and >> I believe a timeout limit imposed by the OS would help. > > Did you upgrade the firmware last spring? > > There is a known bug in the firmware that may let them go into alzheimer mode.No, as their "serial number check utility" was not returning any upgrades for the disks I checked... but now I see in the forums that they released some new versions. Thanks for the heads up. I''ll give it a try and hopefully we can see some improvement here. -- Giovanni P. Tirloni
One reason I was so interested in this issue was the double-price of "raid enabled" disks. However, I realized that I am doing the initial proving, not production - even if personal - of the system I''m building. So for that purpose, an array of smaller and cheaper disks might be good. In the process of looking at that, I found that geeks.com has Seagate 750GB Barracuda ES2 drives for $58 each if you''ll put up with them being "factory recertified" and only warrantied for six months. Not great, I don''t trust "refurbished" or "recertified" anything with archival data; but it''s a test system. So I grabbed six of them for the initial build. This gives me a way to compare them against "desktop" systems in an array. May take a while but I can dig some of the issues out. -- This message posted from opensolaris.org
Wow, that is cheap for an "enterprise class" drive. A little over 1/3 of the reviews at newegg rated this drive as very poor http://www.newegg.com/Product/ProductReview.aspx?Item=N82E16822148295 Hopefully, they''ve fixed whatever issues with your drives :-) Be sure to do the firmware update http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963 I''m about to setup a home NAS based on the Intel SS4200 SOHO system and want to try out ZFS. I''ve also been debating between green/desktop drives vs expensive enterprise class drives. -- This message posted from opensolaris.org
Well, there had to be some reason that they had enough of them come back to run a "recertifying" program. 8-) I rather expected something of that sort; thanks for doing the homework for me! I appreciate the help. I probably won''t ever trust these drives; they were just convenient for the test system, and may have the advantage (?!) of more failures to try out the beauties of zfs. The aggregate cost was only a little more than the drives I''d have otherwise used, and offered a chance to try E-drives versus normal. -- This message posted from opensolaris.org
On Wed, 6 Jan 2010, R.G. Keen wrote:> > I probably won''t ever trust these drives; they were just convenient > for the test system, and may have the advantage (?!) of more > failures to try out the beauties of zfs.The drives are probably just fine. Most likely Seagate "unbricked" them and installed new firmware. If they work without fail for a few weeks, then they are likely equivalent to a new drive. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
To those concerned about this issue, there is a patched version of smartmontools that enables the querying and setting of TLER/ERC/CCTL values (well, except for recent desktop drives from Western Digitial). It''s available here http://www.csc.liv.ac.uk/~greg/projects/erc/ Unfortunately, smartmontools has limited SATA drive support in opensolaris, and you cannot query or set the values. I''m looking into booting into linux, setting the values, and then rebooting into opensolaris since the settings will survive a warm reboot (but not a powercycle). -- This message posted from opensolaris.org
On Wed, Jan 20, 2010 at 10:04:34AM -0800, Willy wrote:> To those concerned about this issue, there is a patched version of > smartmontools that enables the querying and setting of TLER/ERC/CCTL > values (well, except for recent desktop drives from Western > Digitial).[Joining together two recent threads...] Can you (or anyone else on the list) confirm if this works with the samsung drives discussed here recently? (HD145UI and the 2Tb version) I''ve been a regular purchaser of WD drives for some time, and they have been good to me. However, I found this recent change disturbing and annoying; now that I realise it is actually against the standards I''m even more annoyed. It''s coming time to purchase another batch of disks, so I have begun paying closer attention again recently. WD may try to "force" customers to buy the more expensive drives, but find instead that their customers choose another drive manufacturer altogether. Users of RAID (to whom this change matters) are, by definition, purchasers of larger numbers of drives. I was also interested in the 4k-sector WD "advanced format" WD-EARS drives, but if they have the same limitation, and the Samsung drives allow ERC, my choice is made.> It''s available here > http://www.csc.liv.ac.uk/~greg/projects/erc/ > > Unfortunately, smartmontools has limited SATA drive support in > opensolaris, and you cannot query or set the values. I''m looking into > booting into linux, setting the values, and then rebooting into > opensolaris since the settings will survive a warm reboot (but not a > powercycle).This clearly needs to be fixed and is a project worth someone taking on! Any volunteers? -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100121/b6f64ace/attachment.bin>
+1 I agree 100% I have a website whose ZFS Home File Server articles are read around 1 million times a year, and so far I have recommended Western Digital drives wholeheartedly, as I have found them to work flawlessly within my RAID system using ZFS. With this recent action by Western Digital of disabling the ability to time-limit the error reporting period, thus effectively forcing consumer RAID users to buy their RAID-version drives at 50%-100% price premium, I have decided not to use Western Digital drives any longer, and have explained why here: http://breden.org.uk/2009/05/01/home-fileserver-a-year-in-zfs/ (look in the Drives section) Like yourself, I too am searching for consumer-priced drives where it''s still possible to set the error reporting period. I''m also looking at the Samsung models at the moment -- either the HD154UI 1.5TB drive or the HD203WI 2TB drives... and if it''s possible to set the error reporting time then these will be my next purchase. They have quite good user ratings at newegg.com... If WD lose money over this, they might rethink their strategy. Until then, bye bye WD. Cheers, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org
And I agree as well. WD was about to get upwards of $500-$700 of my money, and is now getting zero over this issue alone moving me to look harder for other drives. I''m sure a WD rep would tell us about how there are extra unseen goodies in the RE line. Maybe. -- This message posted from opensolaris.org
Thanks! Yep, I was about to buy six or so WD15EADS or WD15EARS drives, but it looks like I will not be ordering them now. The bad news is that after looking at the Samsungs it too seems that they have no way of changing the error reporting time in the ''desktop'' drives. I hope I''m wrong though. I refuse to pay silly money for ''raid editions'' of these drives. -- This message posted from opensolaris.org
I have 4 of the HD154UI Samsung Ecogreens, and was able to set the error reporting time using HDAT2. The settings would survive a warm reboot, but not a powercycle. I too would like to thank you for your blog. It provided a lot guidance for me in setting up OS and ZFS for my home NAS. -- This message posted from opensolaris.org
>>>>> "dc" == Daniel Carosone <dan at geek.com.au> writes: >>>>> "w" == Willy <willy.mene at gmail.com> writes: >>>>> "sb" == Simon Breden <sbreden at gmail.com> writes:First of all, I''ve been so far assembling vdev stripes from different manufacturers, such that one manufacturer can have a bad batch or firmware bug killing all their drives at once without losing my pool. Based on recent drive problems I think this is a really wise idea. w> http://www.csc.liv.ac.uk/~greg/projects/erc/ dead link? w> Unfortunately, smartmontools has limited SATA drive support in w> opensolaris, and you cannot query or set the values. also the driver stack is kind of a mess with different mid-layers depending on which SATA low-level driver you use, and many proprietary no-source low-level drivers, neither of which you have to deal with on Linux. Maybe in a decade it will get better if the oldest driver we have to deal with is AHCI, but yes smartmontools vs. uscsi still needs fixing! w> I have 4 of the HD154UI Samsung Ecogreens, and was able to set w> the error reporting time using HDAT2. The settings would w> survive a warm reboot, but not a powercycle. after stfw this seems to be some MesS-DOS binary-only tool. Maybe you can run it in virtualbox and snoop on its behavior---this worked for me with Wine and a lite-on RPC tool. At least on Linux you can for example run CD burning programs from within Wine---it is that good. sb> RAID-version drives at 50%-100% price premium, I have decided sb> not to use Western Digital drives any longer, and have sb> explained why here: sb> http://breden.org.uk/2009/05/01/home-fileserver-a-year-in-zfs/ IMHO it is just a sucker premium because the feature is worthless anyway. From the discussion I''ve read here, the feature is designed to keep drives which are *reporting failures* to still be considered *GOOD*, and to not drop out of RAIDsets in RAID-on-a-card implementations with RAID-level timeouts <60seconds. It is a workaround for huge modern high-BER drives and RAID-on-card firmware that''s (according to some person''s debateable idea) not well-matched to the drive. Of course they are going to sell it as this big valuable enterprise optimisation, but at its root it has evloved as a workaround for someone else''s broken (from WD POV) software. The solaris timeout, because of m * n * o multiplicative layered speculative retry nonsense, is 60 seconds or 180 seconds or many hours, so solaris is IMHO quite broken in this regard but also does not benefit from the TLER workaround: the long-TLER drives will not drop out of RAIDsets on ZFS even if they report an error now and then. What''s really needed for ZFS or RAID in general is (a) for drives to never spend more than x% of their time attempting recovery, so they don''t effectively lose ALL the data on a partially-damaged drive by degrading performance to the point it would take n years to read out what data they''re able to deliver and (b) RAID-level smarts to dispatch reads for redundant data when a drive becomes slow without reporting failure, and to diagnose drives as failed based on statistical measurements of their speed. TLER does not deliver (a) because reducing error retries to 5 seconds is still 10^3 slowdown instead of 10^4 and thus basically no difference, and the hard drive can never do (b) it''s a ZFS-level feature. so my question is, have you actually found cases where ZFS needs TLER adjustments, or are you just speculating and synthesizing ideas from a mess of whitepaper marketing blurbs? Because a 7-second-per-read drive will fuck your pool just as badly as a 70-second-per-read drive: you''re going to have to find and unplug it before the pool will work again. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/7b631264/attachment.bin>
Thanks for your reply Miles. I think I understand your points, but unfortunately my historical knowledge of the the need for TLER etc solutions is lacking. How I''ve understood it to be (as generic as possible, but possibly inaccurate as a result): 1. In simple non-RAID single drive ''desktop'' PC scenarios where you have one drive, if your drive is experiencing read/write errors, as this is the only drive you have, and therefore you have no alternative redundant source of data to help with required reconstruction/recovery, you REALLY NEED your drive to try as much as possible to try to recover from the error, therefore a long ''deep recovery'' process may be kicked off to try to fix/recover the problematic data being read/written. 2. Historically, hardware RAID arrays, where redundant data *IS* available, you really DON''T want any drive with trivial occasional block read errors to be kicked from the array, so the idea was to have drives experiencing read errors report quickly to the hardware RAID controller that there''s a problem, so that the hardware RAID controller can then quickly reconstruct the missing data by using the redundant parity data. 3. With ZFS, I think you''re saying that if, for example, there''s a block read error, then even with a RAID EDITION (TLER) drive, you''re still looking at a circa 7 second delay before the error is reported to ZFS, and if you''re using a cheapo standard non-RAID edition drive then you''re looking at a likely circa 60/70 second delay before ZFS is notified. Either way, you say that ZFS won''t kick the drive, yes? And worst case is that depending on arbitrary ''unknowns'' relating to the particular drive''s firmware chemistry/storage stack, relating to the storage array''s repsonsiveness, ''some time'' could be ''mucho time'' if you''re unlucky. And to summarise, you don''t see any point in spending a high premium on RAID-edition drives if using with ZFS, yes? And also, you don''t think that using non-RAID edition drives presents a significant additional data loss risk? Cheers, Simon http://breden.org.uk/2009/05/01/home-fileserver-a-year-in-zfs/ -- This message posted from opensolaris.org
Thanks a lot. I''d looked at SO many different RAID boxes and never had a good feeling about them from the point of data safety, that when I read the ''A Conversation with Jeff Bonwick and Bill Moore ? The future of file systems'' article (http://queue.acm.org/detail.cfm?id=1317400), I was convinced that ZFS sounded like what I needed, and thought I''d try to help others see how good ZFS was and how to make their own home systems that work. Publishing the notes as articles had the side-benefit of allowing me to refer back to them when I was reinstalling a new SXCE build etc afresh... :) It''s good to see that you''ve been able to set the error reporting time using HDAT2 for your Samsung HD154UI drives, but it is a pity that the change does not persist through cold starts.>From a brief look, it looks like like the utility runs under DOS, so I wonder if it would be possible to convert the code into C and run it immediately after OpenSolaris has booted? That would seem a reasonable automated workaround. I might take a little look at the code.However, the big questions still remain: 1. Does ZFS benefit from shorter error reporting times? 2. Does having shorter error reporting times provide any significant data safety through, for example, preventing ZFS from kicking a drive from a vdev? Those are the questions I would like to hear somebody give an authoritative answer to. Cheers, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org
On Fri, Jan 22, 2010 at 04:12:48PM -0500, Miles Nordin wrote:> w> http://www.csc.liv.ac.uk/~greg/projects/erc/ > > dead link?Works for me - this is someone who''s written patches for smartctl to set this feature; these are standardised/documented commands, no reverse engineering of DOS tools required.> IMHO it is just a sucker premium because the feature is worthless > anyway.There are two points here: - is the feature worth paying the premium for "raid edition" drives, assuming it''s the main difference between them? If there are other differences, they have to factor into the assessment. For me and others here, the answer is clearly "no". - for two otherwise comparable drives, for comparable price, would I choose the one with this feature? That''s a very different question, and for me and others here, the answer is clearly "yes".> From the discussion I''ve read here, the feature is designed > to keep drives which are *reporting failures* to still be considered > *GOOD*, and to not drop out of RAIDsets in RAID-on-a-card > implementations with RAID-level timeouts <60seconds.No. It is designed to make drives report errors at all, and within predictable time limits, rather than going non-responsive for indeterminate times and possibly reporting an error eventually. The rest of the response process, whether from a raid card or zfs+driver stack, and whether based on timeouts or error reports, is a separate issue. (more on which, below) Consider that a drive that goes totally failed and unresponsive can only be removed by timeout; this lets you tell the difference between failure modes, and know what''s a sensible timeout to consider the drive really-failed.> The solaris timeout, because of m * n * o multiplicative layered > speculative retry nonsense, is 60 seconds or 180 seconds or many > hours, so solaris is IMHO quite broken in this regard but also does > not benefit from the TLER workaround: the long-TLER drives will not > drop out of RAIDsets on ZFS even if they report an error now and then.There''s enough truth in here to make an interesting rant, as always with Miles. I did enjoy it, because I do share the frustration. However, the key point is that concrete reported errors are definitive events to which zfs can respond, rather than relying on timeouts, however abstract or hidden or layered or frustrating. Making the system more deterministic is worthwhile.> What''s really needed for ZFS or RAID in general is (a) for drives to > never spend more than x% of their time attempting recoverySure. When you find where we can buy such a drive, please let us all know.> Because a 7-second-per-read drive will fuck your pool just as badly as > a 70-second-per-read drive: you''re going to have to find and unplug it > before the pool will work again.Agreed, to a point. If the drive repeatedly reports errors, zfs can and will respond by taking it offline. Even if it doesn''t and you have to manually take it offline, at least you will know which drive is having difficulty. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100124/b28806ca/attachment.bin>
On Jan 23, 2010, at 5:06 AM, Simon Breden wrote:> Thanks a lot. > > I''d looked at SO many different RAID boxes and never had a good feeling about them from the point of data safety, that when I read the ''A Conversation with Jeff Bonwick and Bill Moore ? The future of file systems'' article (http://queue.acm.org/detail.cfm?id=1317400), I was convinced that ZFS sounded like what I needed, and thought I''d try to help others see how good ZFS was and how to make their own home systems that work. Publishing the notes as articles had the side-benefit of allowing me to refer back to them when I was reinstalling a new SXCE build etc afresh... :) > > It''s good to see that you''ve been able to set the error reporting time using HDAT2 for your Samsung HD154UI drives, but it is a pity that the change does not persist through cold starts. > > From a brief look, it looks like like the utility runs under DOS, so I wonder if it would be possible to convert the code into C and run it immediately after OpenSolaris has booted? That would seem a reasonable automated workaround. I might take a little look at the code. > > However, the big questions still remain: > 1. Does ZFS benefit from shorter error reporting times?In general, any system which detects and acts upon faults, would like to detect faults sooner rather than later.> 2. Does having shorter error reporting times provide any significant data safety through, for example, preventing ZFS from kicking a drive from a vdev?On Solaris, ZFS doesn''t kick out drives, FMA does. You can see the currently loaded diagnosis engines using "pfexec fmadm config" MODULE VERSION STATUS DESCRIPTION cpumem-retire 1.1 active CPU/Memory Retire Agent disk-transport 1.0 active Disk Transport Agent eft 1.16 active eft diagnosis engine ext-event-transport 0.1 active External FM event transport fabric-xlate 1.0 active Fabric Ereport Translater fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis io-retire 2.0 active I/O Retire Agent sensor-transport 1.1 active Sensor Transport Agent snmp-trapgen 1.0 active SNMP Trap Generation Agent sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.0 active Syslog Messaging Agent zfs-diagnosis 1.0 active ZFS Diagnosis Engine zfs-retire 1.0 active ZFS Retire Agent Diagnosis engines relevant to ZFS include: disk-transport: diagnose SMART reports fabric-xlate: translate PCI, PCI-X, PCI-E, and bridge reports zfs-diagnosis: notifies FMA when checksum, IO, and probe failure errors are found by ZFS activity. It also properly handles errors as a result of device removal. zfs-retire: manages hot spares for ZFS pools io-retire: retires a device which was diagnosed as faulty (NB may happen at next reboot) snmp-trapgen: you do configure SNMP traps, right? :-) Drivers, such as sd/ssd, can send FMA telemetry which will feed the diagnosis engines.> Those are the questions I would like to hear somebody give an authoritative answer to.This topic is broader than ZFS. For example, a disk which has both a UFS and ZFS file system could be diagnosed by UFS activity and retired, which would also affect the ZFS pool that uses the disk. Similarly, the disk-transport agent can detect overtemp errors, for which a retirement is a corrective action. For more info, visit the FMA community: http://hub.opensolaris.org/bin/view/Community+Group+fm/ As for an "authoritative answer," UTSL. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common -- richard
> In general, any system which detects and acts upon > faults, would like > to detect faults sooner rather than later.Yes, it makes sense. I think my main concern was about loss - in question 2.> > 2. Does having shorter error reporting times > provide any significant data safety through, for > example, preventing ZFS from kicking a drive from a > vdev? > > On Solaris, ZFS doesn''t kick out drives, FMA does.Thanks for the info. I''ll take a look at those links to gain a better understanding of when a drive gets kicked. Cheers, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org
>>>>> "sb" == Simon Breden <sbreden at gmail.com> writes:sb> 1. In simple non-RAID single drive ''desktop'' PC scenarios sb> where you have one drive, if your drive is experiencing sb> read/write errors, as this is the only drive you have, and sb> therefore you have no alternative redundant source of data to sb> help with required reconstruction/recovery, you REALLY NEED sb> your drive to try as much as possible to try to recover this sounds convincing to fetishists of an ordered world where egg-laying mammals do not exist, but it''s utter rubbish. As drives go bad they return errors frequently, and they don''t succeed in recovering them. They do not encounter, like, one or two errors per day under general use most of which are recoverable in 7 < x < 60 seconds: this just does not happen except in your dreams. Good drives have zero UNC errors in the smartctl -a logs, and the conditional probability of soon-failure on a drive that''s experienced just one UNC error is much higher than the regular probability of soon-failure. Once a drive for which you have no backup/mirror/whatever is returning errors, the remedy is not to wait longer. This does not work, basically ever. The remedy is to shut down the OS, copy the failing drive onto a good one with ''dd conv=noerror,sync'', fsck, and read back your data (with a bunch of zeroes inserted for unreadable blocks). Depending on how bad the drive is, you''ll have to use a smaller or larger block size: the reason is, most unreadable areas are larger than 1 sector, but the drive is so imbecillic if you read single sectors it will reinvoke its bogus retry timer for each and every sector within the same contiguous unreadable region: it has NO MEMORY for the fact that it already tried to read that area and failed. 60 seconds * <normal # of bad sectors> for a failing/pissed-off drive is generally somewhere between 3 days and forever, so you have to watch progress and start over with larger bs= if you are not on target to finish the dd within three days, because the drive will get worse and worse, so larger bs= (meaning, not bothering trying to read data that you would have been able to read) will get your data off the drive before it fails more completely and thus actually rescue *more*. Anyway, these drives, once they''ve gone bad their behavior is very stupid and nothing like this imaginary world that''s been pitched to you by these bogan electrical engineers who apparently have no experience using their own product. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100125/053c1176/attachment.bin>
> this sounds convincing to fetishists of an ordered > world where > egg-laying mammals do not exist, but it''s utter > rubbish.Very insightful! :)> As drives go bad they return errors frequently, and...Yep, so have good regular backups, and move quickly once probs start. Cheers, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org
On Mon, Jan 25, 2010 at 05:36:35PM -0500, Miles Nordin wrote:> >>>>> "sb" == Simon Breden <sbreden at gmail.com> writes: > > sb> 1. In simple non-RAID single drive ''desktop'' PC scenarios > sb> where you have one drive, if your drive is experiencing > sb> read/write errors, as this is the only drive you have, and > sb> therefore you have no alternative redundant source of data to > sb> help with required reconstruction/recovery, you REALLY NEED > sb> your drive to try as much as possible to try to recover > > this sounds convincing to fetishists of an ordered world where > egg-laying mammals do not exist, but it''s utter rubbish.There''s a family of platypus in the creek just down the bike path from my house. They''re returning thanks in large part to the removal of rubbish sources upstream.> As drives go bad they return errors frequently, and they don''t succeed > in recovering them.Typically, once a sector fails to read, this is true for that sector (or range of sectors). However, if the sector is overwritten and can be remapped, drives often continue working flawlessly for a long time thereafter. I have two of the original, infamous, glass-platter deathstars. Both were completely unusable, and would disappear into a bottomless pit of endlessly unsuccessful resets and read attempts on just about any attempt to get data off them. However, a write scrub with random data and write cache turned off allowed all the bad sectors to remap, and completely recovered them. They saw many years of use - as scratch space, written often, but with non-critical data. The machine isn''t used much anymore, but I still expect that they work fine today if I turn it on. I left the write cache off, because these and some other drives seemed not to detect and remap errors on write otherwise. They seem to only verify writes in that case, and otherwise rely on the sector already having been found bad and put on the pending list by prior reads, whether from the host or background self-test. The self-test are often also dumb, in that they will stop at the first error. I''ve seen this from several vendors, but won''t assert that it is universal or even still common for current drives.> Depending on how bad the drive is, you''ll have to use a smaller or > larger block sizeYes. I hate to press the point, but this is another area where CCTL/etc is useful - you can more quickly narrow down to the specific problem sectors.> Anyway, these drives, once they''ve gone bad their behavior is very > stupid and nothing like this imaginary world that''s been pitched toAgain, it depends on the behaviour you care about: trying to recover your only copy of crucial data, or getting back to a servicable state by remapping on overwrite. My experience with the latter is positive, and zfs users should need no convincing that relying on the former is crazy, regardless of the specific idiocy of the specific drive in specific (non-)recovery circumstances. Any incidence of errors is a concern to which you should respond, whether with "increased vigilance" or "immediate replacement" depends on your own preferences and paranoia. New drives will often hit a few of these errors in early use, before the weak sectors are found and abandoned, and then work flawlessly thereafter. Burn-in tests and scrubs are well worthwhile. Once they''ve run out of remapped sectors, or have started consistently producing errors, then they''re cactus. Do pay attention to the smart error counts and predictors. Or they can just fail in other strange ways. I have one that just works very very very slowly, for every single read request regardless of size or location, even if they''re all returned correctly in the end. The best practices of regular scrubs and sufficiently redundant pools and separate backups stand, in spite of and indeed because of such idiocy. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100126/7235d3b6/attachment.bin>
After following this topic the last days, and nearly everybody contributed to it, I think it''s time to add a new factor. Vibration. First some prove how sensitive modern drives are: http://blogs.sun.com/brendan/entry/unusual_disk_latency Most "enterprise" drives also contain circuitry to handle the vibration resulting from multi drives setups. Resonance for example is avoided by adjusting the spindle speed. Enterprise chassis with drive sleds contain mechanical dampening. In a typical soho case all drives are screwed to a shared chasis. The vibration problem is much worse in such a setup. The "green" drives with their lower spindle speeds reduces this effect. Personally I''ve good experience with the WD RE2 1TB green drives. I''ve an 8 drive pool with these and till today saw no problems. On another system with 6 seagate consumer drives I''ve lost already two drives. Both running 24/7 for almost two years. I would like thank the people who brought it under my attention the TLER and idle timeout CAN be configured on some drives. Although I''m only interested in these options when they survive a power cycle. Yes the enterprise drives are expensive, but so is my time and data. Regards, Frederik -- This message posted from opensolaris.org
On Tue, 26 Jan 2010, F. Wessels wrote:> The "green" drives with their lower spindle speeds reduces this > effect.I don''t agree that lowering the spindle speed necessarily reduces resonance. Reducing resonance is accomplished by assuring that chassis components do not resonate (produce standing waves) at a harmonic of the frequency that the rotating media produces. Adjusting the size, length, and weight of chassis components to avoid the harmonics aleviates the problem. A chassis designed for high-speed drives may not do so well with slow-speed drives, and vice-versa. Anyone who has played with audio frequency sweeps and a large subwoofer soon becomes familiar with resonance and that the lower frequencies often cause more problems than the higher ones. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>>>>> "dc" == Daniel Carosone <dan at geek.com.au> writes:dc> There''s a family of platypus in the creek just down the bike dc> path from my house. yeah, happy australiaday. :) What I didn''t understand in school is that egg layers like echidnas are not exotic but are pettingzoo/farm/roadkill type animals. IMHO there''s a severe taxonomic bias among humans, like a form of OCD, that brings us ridiculous things like ``the seven layer OSI model'''' and Tetris and belief in ``RAID edition'''' drives. dc> Typically, once a sector fails to read, this is true for that dc> sector (or range of sectors). However, if the sector is dc> overwritten and can be remapped, drives often continue working dc> flawlessly for a long time thereafter. While I don''t doubt you had that experience (I''ve had it too), I was mostly thinking of the google paper: http://labs.google.com/papers/disk_failures.html They focus on temperature, which makes sense because it''s $$$: spend on cooling, or on replacing drives? and tehy find even >45C does not increase failures until the third year, so basically just forget about it, and forget also about MTBF estimates based on silly temperature timewarp claims and pay attention to their numbers instead. But the interesting result for TLER/ERC is on page 7 figure 7, where you see within the first two years the effect of reallocation on expected life is very pronounced, and they say ``after their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts, making the critical thereshold for this parameter also ''1''.'''' It also says drives which fail the ''smartctl -t long'' test (again, this part of smartctl is broken on solaris :( plz keep in the back of your mind :), which checks that every sector on the medium is readable, are ``39 times more likely to fail within 60 days than drives without scan errors.'''' so...this suggests to me that read errors are not so much things that happen from time to time even with good drives, and therefore there is not much point in trying to write data into an unreadable sector (to remap it) or to worry about squeezing one marginal sector out of an unredundant desktop drive (the drive''s bad---warn OS, recover data, replace it). One of the things that''s known to cause bad sectors is high-flying writes, and all the google-studied drives were in data centers, so some of this might not be true of laptop drives that get knocked around a fair bit. dc> Once they''ve run out of remapped sectors, or have started dc> consistently producing errors, then they''re cactus. Do pay dc> attention to the smart error counts and predictors. yes, well, you can''t even read these counters on Solaris because smartctl doesn''t make it through the SATA stack, so ``do pay attention to'''' isn''t very practical advice. but if you have Linux, the advice of the google paper is to look at the remapped sector count (is it zero, or more than zero?), and IIRC that sometimes the ``seek error rate'''' can be compared among identical model drives but is useless otherwise. The ``overall health assessment'''' is obviously useless, but hopefully I don''t need to tell anyone that. The ''smartctl -t long'' test is my favorite, but it''s proactive. Anyway the main result I''m interested here is what I just said, that unreadable sectors are not a poisson process. They''re strong indicators of drives about to fail, ``the critical threshhold is ''1'' '''', and not things around which you can usefully plan cargocult baroque spaghetti rereading strategies. dc> The best practices of regular scrubs and sufficiently dc> redundant pools and separate backups stand, in spite of and dc> indeed because of such idiocy. ok, but the new thing that I''m arguing is that TLER/ERC is a completely useless adaptation to a quirk of RAID card firmware and has nothing to do with ZFS, nor with best RAID practices in general. I''m not certain this statement is true, but from what I''ve heard so far that''s what I think. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100126/1a1c7792/attachment.bin>
@Bob, yes you''re completely right. This kind of engineering is what you get when buying a 2540 for example. All parts are nicely matched. When you build your own whitebox the parts might not match. But that wasn''t my point. Vibration, in the drive and excited by the drive, increases with the spindle speed. Despite fluid bearings or other measures, the platters are always imperfect. And at some point this imbalance can no longer be compensated for. Result vibration. The amount of energy stored in the platters increases, I presume squared, with the spindle speed. So at higher speeds the effect will get greater. Now back to the resonance. If all drives are vibrating AND in sync than nice standing waves will ripple through your chassis. I''ve seen this, ages ago, in the extreme on arrays were the drives were synced by an external clock signal. This was specialty hardware with a HIPPI interface. Certain modern drives have circuitry to prevent this. Still preventing the vibrations in the first is easier with lower speeds, a non revolving disk will emit zero vibrations. It''s mechanically easier at 5400rpm than at 15000rpm. No vibrations equals no drive induced resonance. Back to the topic. Since TLER/ERC/CCTL drives usually have this feature as well. And I know the difference between the drives with and without I thought it would be relevant for the discussion. Regards, Frederik -- This message posted from opensolaris.org
Good observation. It seems that I''m only keeping ahead of the folks in this forum by running as hard as I can. I just bought the sheet aluminum for making my drive cages. I''m going for the drives-in-a-cage setup, but I''m also floating each drive on vinyl (and hence dissipative, not resonant) vibration dampers per drive. Stock item at McMaster-Carr. Lets each drive not only float from the chassis, but gets you both spring isolation and dissipative isolation from the supporting member. I''ll see if I can take some pictures. A seven-drive cage version of an ATX case/corpse is donating itself for a drilling template for the drive mounting holes. This is complemented by my lucky purchase from craigslist of two 4U rackmount cases for $25. The cages in these are rubber mounted to the outer case, which will help further damp feed-in of air-borne vibration picked up from the large flat panels of the case. The vinyl dampers @ 4/drive will both help keep the individual drive''s vibration in and the other drives'' vibration out, while dissipating it as heat. R.G. -- This message posted from opensolaris.org
On the subject of vibrations when using multiple drives in a case (tower), I''m using silicone grommets on all the drive screws to isolate vibrations. This does seem to greatly reduce the vibrations reaching the chassis, and makes the machine a lot quieter, and so I would expect that this minimises the vibrations transferred between drives via the chassis. In turn I would expect that this greatly reduces errors related to high vibration levels when reading and writing: less vertical head movement, leading to less variation in write signal strength. Cheers, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org
On 2010-Jan-27 05:38:57 +0800, "F. Wessels" <wessels147 at yahoo.com> wrote:>But that wasn''t my point. Vibration, in the drive and excited by the >drive, increases with the spindle speed.There''s also vibration caused by head actuator movements. This is unlikely to suffer from resonance amplification but may be higher amplitude than spindle-related vibration. And finally, there''s vibration from the various fans in the case -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100127/d43397a1/attachment.bin>