Karl Pielorz
2008-Sep-08 08:50 UTC
[zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?
Hi All, I run ZFS (a version 6 pool) under FreeBSD. Whilst I realise this changes a *whole heap* of things - I''m more interested in if I did ''anything wrong'' when I had a recent drive failure... On of a mirrored pair of drives on the system started failing, badly (confirmed by ''hard'' read & write erros logged to the console). ZFS also started showing errors, the machine started hanging, waiting for I/O''s to complete (which is how I noticed it). How many errors does a drive have to throw before it''s considered "failed" by ZFS? - Mine had got to about 30-40 [not a huge amount] - but was making the system unusable, so I manually attached another hot-spare drive to the ''good'' device left in that mirrored pair. However, ZFS was still trying to read data off the failing drive - this pushed the re-silver time up to 755 hours, whilst the number of errors in the next forty minutes or so got to around 300. Not wanting my data unprotected for 755 odd hours (and fearing the number was just going up and up) I did: zpool detach vol ad4 (''ad4'' was the failing drive). This hung all I/O on the pool :( - I waited 5 hours, and then decided to reboot. After the reboot the pool came back OK (with ''ad4'' removed) and the re-silver continued, and completed in half an hour. Thinking about it - perhaps I should have detached ad4 (the failing drive) before attaching another device? - My thinking at the time was I didn''t know how badly failed the drive was, and obviously removing what might have been 200Gb of ''perfectly'' accessible data from a mirrored pair, prior to re-silvering to a replacement, didn''t sit right. I''m hoping ZFS shouldn''t have hung when I later decided to fix the situation, and remove ad4? -Kp
Richard Elling
2008-Sep-08 14:30 UTC
[zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?
Karl Pielorz wrote:> Hi All, > > I run ZFS (a version 6 pool) under FreeBSD. Whilst I realise this changes a > *whole heap* of things - I''m more interested in if I did ''anything wrong'' > when I had a recent drive failure... > > On of a mirrored pair of drives on the system started failing, badly > (confirmed by ''hard'' read & write erros logged to the console). ZFS also > started showing errors, the machine started hanging, waiting for I/O''s to > complete (which is how I noticed it). > > How many errors does a drive have to throw before it''s considered "failed" > by ZFS? - Mine had got to about 30-40 [not a huge amount] - but was making > the system unusable, so I manually attached another hot-spare drive to the > ''good'' device left in that mirrored pair. > > However, ZFS was still trying to read data off the failing drive - this > pushed the re-silver time up to 755 hours, whilst the number of errors in > the next forty minutes or so got to around 300. Not wanting my data > unprotected for 755 odd hours (and fearing the number was just going up and > up) I did: > > zpool detach vol ad4 > > (''ad4'' was the failing drive). > > This hung all I/O on the pool :( - I waited 5 hours, and then decided to > reboot. >This seems like a reasonable process to follow, I would have done much the same.> After the reboot the pool came back OK (with ''ad4'' removed) and the > re-silver continued, and completed in half an hour. >There are failure modes that disks can get into which seem to be solved by a power-on reset. I had one of these just last week :-(. We would normally expect a soft reset to clear the cobwebs, but that was not my experience.> Thinking about it - perhaps I should have detached ad4 (the failing drive) > before attaching another device? - My thinking at the time was I didn''t > know how badly failed the drive was, and obviously removing what might have > been 200Gb of ''perfectly'' accessible data from a mirrored pair, prior to > re-silvering to a replacement, didn''t sit right. > > I''m hoping ZFS shouldn''t have hung when I later decided to fix the > situation, and remove ad4? >[caveat: I''ve not examined the FreeBSD ZFS port, the following presumes the FreeBSD port is similar to the Solaris port] ZFS does not have its own timeouts for this sort of problem. It relies on the underlying device drivers to manage their timeouts. So there was not much you could do at the ZFS level other than detach the disk. -- richard
Miles Nordin
2008-Sep-08 17:34 UTC
[zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?
>>>>> "kp" == Karl Pielorz <kpielorz_lst at tdx.co.uk> writes:kp> Thinking about it - perhaps I should have detached ad4 (the kp> failing drive) before attaching another device? no, I think ZFS should be fixed. 1. the procedure you used is how hot spares are used, so anyone who says it''s wrong for any reason is using hindsight bias. 2. Being able to pull data off a failing-but-not-fully-gone drive is something a good storage subsystem should be able to do. I might not expect it of LVM2 or of crappy raid-on-a-card, but I would definitely expect it from Netapp/EMC/Hitachi. 3. Also sometimes one is confused about which drive is failing because of crappy controllers and controller drivers, so by-the-book recovery procedures shouldn''t have to involve ad-hoc detaching. though my experience with software raid other than ZFS is the same---the whole job is about having the Fu to know what to unplug to make the rickety system stable again. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080908/8d192b16/attachment.bin>
Bob Friesenhahn
2008-Sep-08 17:54 UTC
[zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?
On Mon, 8 Sep 2008, Miles Nordin wrote:> > no, I think ZFS should be fixed. > > 1. the procedure you used is how hot spares are used, so anyone who > says it''s wrong for any reason is using hindsight bias. > > 2. Being able to pull data off a failing-but-not-fully-gone drive is > something a good storage subsystem should be able to do. I might > not expect it of LVM2 or of crappy raid-on-a-card, but I would > definitely expect it from Netapp/EMC/Hitachi.Please describe (in detail) how ZFS can be improved to be able to retrieve data from a failing drive (which might take minutes to return a read error due to "consumer" drive firmware) in a reasonable amount of time. I look forward to your response. Thanks, Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tuomas Leikola
2008-Sep-08 18:40 UTC
[zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?
On Mon, Sep 8, 2008 at 8:54 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:>> 2. Being able to pull data off a failing-but-not-fully-gone drive is >> something a good storage subsystem should be able to do. I might >> not expect it of LVM2 or of crappy raid-on-a-card, but I would >> definitely expect it from Netapp/EMC/Hitachi. > > Please describe (in detail) how ZFS can be improved to be able to > retrieve data from a failing drive (which might take minutes to return > a read error due to "consumer" drive firmware) in a reasonable amount > of time. I look forward to your response. >During resilver? Use a heuristic to determine one drive more suspect than others, and try to only issue "leaf" data requests to that drive. When it''s queue is full, the other drive can happily churn away. If it still hangs when resilver would otherwise complete, issue the same reads to other disk(s) to get the data, and forget about it. This way you wont stall the resilver unnecessarily (after you notice one drive being slowish) and still have it around if there are bad blocks on other drives (unless most software raid systems). Not perfect, but better than ignorant round robin readbalance. - Tuomas
Karl Pielorz
2008-Sep-08 19:46 UTC
[zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?
--On 08 September 2008 07:30 -0700 Richard Elling <Richard.Elling at Sun.COM> wrote:> This seems like a reasonable process to follow, I would have done > much the same.> [caveat: I''ve not examined the FreeBSD ZFS port, the following > presumes the FreeBSD port is similar to the Solaris port] > ZFS does not have its own timeouts for this sort of problem. > It relies on the underlying device drivers to manage their > timeouts. So there was not much you could do at the ZFS level > other than detach the disk.Ok, I''m glad I''m finally getting the hang of ZFS, and ''did the right thing(tm)''. Is there any tunable on ZFS that will tell it "If you get more than x/y/z Read, Write or Checksum errors" - detach the drive as ''failed''? Maybe on a per-drive basis? It''d probably need some way for admin to override it (i.e. force it to be ignored)? - for those times where you either have to, or for a drive you know will at least stand a chance of reading the rest of the surface ''past'' the errors. This would probably be set quite low for ''consumer'' grade drives, and moderately higher for ''enterprise'' drives that don''t "go out to lunch" for extended periods while seeing if they can recover a block. You could even default it to ''infinity'' if that''s what the current level is. It''d certainly have saved me a lot of time if the number of errors on the drive had past a relatively low figure, and it just ditched the drive... One other random thought occurred to me when this happened - if I detach a drive, does ZFS have to update some meta-data on *all* the drives for that pool (including the one I''ve detached) to know it''s been detached? (if that makes sense). That might explain why the ''detach'' I issued just hung (if it had to update meta-data on the drive I was removing, it probably got caught in the wash of failing I/O timing out on that device). -Karl
Richard Elling
2008-Sep-08 23:37 UTC
[zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?
Karl Pielorz wrote:> > > --On 08 September 2008 07:30 -0700 Richard Elling > <Richard.Elling at Sun.COM> wrote: > >> This seems like a reasonable process to follow, I would have done >> much the same. > >> [caveat: I''ve not examined the FreeBSD ZFS port, the following >> presumes the FreeBSD port is similar to the Solaris port] >> ZFS does not have its own timeouts for this sort of problem. >> It relies on the underlying device drivers to manage their >> timeouts. So there was not much you could do at the ZFS level >> other than detach the disk. > > Ok, I''m glad I''m finally getting the hang of ZFS, and ''did the right > thing(tm)''. > > Is there any tunable on ZFS that will tell it "If you get more than > x/y/z Read, Write or Checksum errors" - detach the drive as ''failed''? > Maybe on a per-drive basis?This is the function of one or more diagnosis engines in Solaris. Not all errors are visible to ZFS, it makes sense to diagnose the error where it is visible -- usually at the device driver level.> > It''d probably need some way for admin to override it (i.e. force it to > be ignored)? - for those times where you either have to, or for a > drive you know will at least stand a chance of reading the rest of the > surface ''past'' the errors. > > This would probably be set quite low for ''consumer'' grade drives, and > moderately higher for ''enterprise'' drives that don''t "go out to lunch" > for extended periods while seeing if they can recover a block. You > could even default it to ''infinity'' if that''s what the current level is. > > It''d certainly have saved me a lot of time if the number of errors on > the drive had past a relatively low figure, and it just ditched the > drive...In Solaris, this is implemented through the FMA diagnosis engines which communicate with interested parties, such as ZFS. At present, the variables really aren''t tunable, per se, but you can see the values in the source. For example, the ZFS diagnosis engine is: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c> > One other random thought occurred to me when this happened - if I > detach a drive, does ZFS have to update some meta-data on *all* the > drives for that pool (including the one I''ve detached) to know it''s > been detached? (if that makes sense).Yes.> > That might explain why the ''detach'' I issued just hung (if it had to > update meta-data on the drive I was removing, it probably got caught > in the wash of failing I/O timing out on that device).Yes, I believe this is consistent with what you saw. -- richard
The lop sided mirror stuff I''m going to RFE today would probably do it too: http://www.opensolaris.org/jive/thread.jspa?threadID=70811&tstart=0 If ZFS realised that a drive was returning results much slower than normal it could try reading off the other drive instead. That would allow the resilver to run from the good drive. The zpool detach failing is a problem though. I would hope that under Solaris FMA would have spotted the problem and faulted that drive, but I still feel that ZFS should be double checking stuff like this so that we don''t get these situations where the whole pool hangs. -- This message posted from opensolaris.org