Christian Heßmann
2010-Mar-03 23:46 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
Hello guys, I''ve already written this on the FreeBSD forums, but so far, the feedback is not so great - seems FreeBSD guys aren''t that keen on ZFS. I have some hopes you''ll be more experienced on these kind of errors: I have a ZFS pool comprised of two 3-disk RAIDs which I''ve recently moved from OS X to FreeBSD (8 stable). One harddisk failed last weekend with lots of shouting, SMART messages and even a kernel panic. I attached a new disk and started the replacement. Unfortunately, about 20% into the replacement, a second disk in the same RAID showed signs of misbehaviour by giving me read errors. The resilvering did finish, though, and it left me with only three broken files according to zpool status: [root at camelot /]# zpool status -v tank pool: tank state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 10h42m with 136 errors on Tue Mar 2 07:55:05 2010 config: NAME STATE READ WRITE CKSUM tank DEGRADED 137 0 0 raidz1 ONLINE 0 0 0 ad17p2 ONLINE 0 0 0 ad18p2 ONLINE 0 0 0 ad20p2 ONLINE 0 0 0 raidz1 DEGRADED 326 0 0 replacing DEGRADED 0 0 0 ad16p2 OFFLINE 2 169K 6 ad4p2 ONLINE 0 0 0 839G resilvered ad14p2 ONLINE 0 0 0 5.33G resilvered ad15p2 ONLINE 418 0 0 5.33G resilvered errors: Permanent errors have been detected in the following files: tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v tank/DVD at 20100222225100:/Payback.m4v tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v I have the feeling the problems on ad15p2 are related to a cable issue, since it doesn''t have any SMART errors, is quite a new drive (3 months old) and was IMHO sufficiently "burned in" by repeatedly filling it to the brim and checking the contents (via ZFS). So I''d like to switch off the server, replace the cable and do a scrub afterwards to make sure it doesn''t produce additional errors. Unfortunately, although it says the resilvering completed, I can''t detach ad16p2 (the first faulted disk) from the system: [root at camelot /]# zpool detach tank ad16p2 cannot detach ad16p2: no valid replicas To be honest, I don''t know how to proceed now. It feels like my system is in a very unstable state right now, with a replacement not yet finished and errors on two drives in one RAID.Z1. I deleted the files affected, but have about 20 snapshots of this filesystem and think these files are in most of them since they''re quite old. So, what should I do now? Delete all snapshots? Move all other files from this filesystem to a new filesystem and destroy the old filesystem? Try to export and import the pool? Is it even safe to reboot the machine right now? I got one response in the FreeBSD Forum telling me I should reboot the machine and do a scrub afterwards, it should then detect that it doesn''t need the old disk anymore - I am a bit reluctant doing that, to be honest... Any help would be appreciated. Thank you. Christian
Mark J Musante
2010-Mar-04 00:01 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
It looks like you''re running into a DTL issue. ZFS believes that ad16p2 has some data on it that hasn''t been copied off yet, and it''s not considering the fact that it''s part of a raidz group and ad4p2. There is a CR on this, http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but what''s viewable in the bug database is pretty minimal. If you haven''t made a backup yet (or at least done a complete snapshot and generated a send stream from it), my advice would be to do that now. Then reboot and see if that clears the DTL enough to let you do the detach. On 3 Mar, 2010, at 18.46, Christian He?mann wrote:> Hello guys, > > > I''ve already written this on the FreeBSD forums, but so far, the feedback is not so great - seems FreeBSD guys aren''t that keen on ZFS. I have some hopes you''ll be more experienced on these kind of errors: > > I have a ZFS pool comprised of two 3-disk RAIDs which I''ve recently moved from OS X to FreeBSD (8 stable). > > One harddisk failed last weekend with lots of shouting, SMART messages and even a kernel panic. > I attached a new disk and started the replacement. > Unfortunately, about 20% into the replacement, a second disk in the same RAID showed signs of misbehaviour by giving me read errors. The resilvering did finish, though, and it left me with only three broken files according to zpool status: > > [root at camelot /]# zpool status -v tank > pool: tank > state: DEGRADED > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: resilver completed after 10h42m with 136 errors on Tue Mar 2 07:55:05 2010 > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 137 0 0 > raidz1 ONLINE 0 0 0 > ad17p2 ONLINE 0 0 0 > ad18p2 ONLINE 0 0 0 > ad20p2 ONLINE 0 0 0 > raidz1 DEGRADED 326 0 0 > replacing DEGRADED 0 0 0 > ad16p2 OFFLINE 2 169K 6 > ad4p2 ONLINE 0 0 0 839G resilvered > ad14p2 ONLINE 0 0 0 5.33G resilvered > ad15p2 ONLINE 418 0 0 5.33G resilvered > > errors: Permanent errors have been detected in the following files: > > tank/DVD:<0x9cd> > tank/DVD at 20100222225100:/Memento.m4v > tank/DVD at 20100222225100:/Payback.m4v > tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v > > I have the feeling the problems on ad15p2 are related to a cable issue, since it doesn''t have any SMART errors, is quite a new drive (3 months old) and was IMHO sufficiently "burned in" by repeatedly filling it to the brim and checking the contents (via ZFS). So I''d like to switch off the server, replace the cable and do a scrub afterwards to make sure it doesn''t produce additional errors. > > Unfortunately, although it says the resilvering completed, I can''t detach ad16p2 (the first faulted disk) from the system: > > [root at camelot /]# zpool detach tank ad16p2 > cannot detach ad16p2: no valid replicas > > To be honest, I don''t know how to proceed now. It feels like my system is in a very unstable state right now, with a replacement not yet finished and errors on two drives in one RAID.Z1. > > I deleted the files affected, but have about 20 snapshots of this filesystem and think these files are in most of them since they''re quite old. > > So, what should I do now? Delete all snapshots? Move all other files from this filesystem to a new filesystem and destroy the old filesystem? Try to export and import the pool? Is it even safe to reboot the machine right now? > > I got one response in the FreeBSD Forum telling me I should reboot the machine and do a scrub afterwards, it should then detect that it doesn''t need the old disk anymore - I am a bit reluctant doing that, to be honest... > > Any help would be appreciated. > > Thank you. > > Christian > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2010-Mar-04 01:57 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
On Thu, 4 Mar 2010, Christian He?mann wrote:> > I''ve already written this on the FreeBSD forums, but so far, the feedback is > not so great - seems FreeBSD guys aren''t that keen on ZFS. I have some hopesI see lots and lots of zfs traffic on the discussion list "freebsd-fs at freebsd.org". This is where the FreeBSD filesystem developers hang out.> raidz1 DEGRADED 326 0 0 > replacing DEGRADED 0 0 0 > ad16p2 OFFLINE 2 169K 6 > ad4p2 ONLINE 0 0 0 839G resilvered > ad14p2 ONLINE 0 0 0 5.33G resilvered > ad15p2 ONLINE 418 0 0 5.33G resilvered > > Unfortunately, although it says the resilvering completed, I can''t detach > ad16p2 (the first faulted disk) from the system:The zpool status you posted shows that ad16p2 is still in ''replacing'' mode. If this is still the case, then it could be a reason that the original disk can''t yet be removed.> To be honest, I don''t know how to proceed now. It feels like my system is in > a very unstable state right now, with a replacement not yet finished and > errors on two drives in one RAID.Z1.If it is still in ''replacing'' mode then it seems that the best policy is to just wait. If there is no drive activity on ad4p2 then there may be something more wrong. Cold booting a system can be one of the scariest things to do so it should be a means of last resort. Maybe the system would not come back. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Freddie Cash
2010-Mar-04 02:57 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
On Wed, Mar 3, 2010 at 5:57 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Thu, 4 Mar 2010, Christian He?mann wrote: > >> To be honest, I don''t know how to proceed now. It feels like my system is >> in a very unstable state right now, with a replacement not yet finished and >> errors on two drives in one RAID.Z1. >> > > If it is still in ''replacing'' mode then it seems that the best policy is to > just wait. If there is no drive activity on ad4p2 then there may be > something more wrong. > > Cold booting a system can be one of the scariest things to do so it should > be a means of last resort. Maybe the system would not come back. >We''ve had this happen a couple of times on our FreeBSD-based storage servers. Rebooting and manually running a scrub has fixed the issue each time. 24x 500 GB SATA drives in 3x raidz2 vdev of 8 drives each -- Freddie Cash fjwcash at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100303/3ca7e529/attachment.html>
Christian Heßmann
2010-Mar-04 07:39 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
On 04.03.2010, at 02:57, Bob Friesenhahn wrote:> I see lots and lots of zfs traffic on the discussion list "freebsd-fs at freebsd.org > ". This is where the FreeBSD filesystem developers hang out.Thanks - I''ll have a look there. As usual, the cool kids are in mailing lists... ;-)> The zpool status you posted shows that ad16p2 is still in > ''replacing'' mode. If this is still the case, then it could be a > reason that the original disk can''t yet be removed.[...]> If it is still in ''replacing'' mode then it seems that the best > policy is to just wait. If there is no drive activity on ad4p2 then > there may be something more wrong.It bothers me as well that it says "replacing" instead of replaced or whatever else it should say. Since the resilvering completed I don''t have any activity on the drives anymore, so I presume it somehow thinks it''s done.> Cold booting a system can be one of the scariest things to do so it > should be a means of last resort. Maybe the system would not come > back.That''s my fear. Although from what I can gather from the feedback so far the FreeBSD users seem somewhat familiar with an error like that and recommend rebooting. I might take the majority advice, make a backup of the important parts of the pool and just go for a reboot. Might go for another repost into the freebsd-fs list before, though, so please bear with me that you have to read this again... Thanks. Christian
Victor Latushkin
2010-Mar-05 10:59 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
Mark J Musante wrote:> It looks like you''re running into a DTL issue. ZFS believes that ad16p2 has > some data on it that hasn''t been copied off yet, and it''s not considering the > fact that it''s part of a raidz group and ad4p2. > > There is a CR on this, > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but what''s > viewable in the bug database is pretty minimal. > > If you haven''t made a backup yet (or at least done a complete snapshot and > generated a send stream from it), my advice would be to do that now. Then > reboot and see if that clears the DTL enough to let you do the detach.Actually besides the bug mentioned above, resilvering will not clear DTLs upon completion due to 6887372 DTLs not cleared after resilver if permanent errors present as there are permanent errors present. Btw, they affect some files referenced by snapshots as ''zpool status -v'' suggests: >> tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v >> tank/DVD at 20100222225100:/Payback.m4v >> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v In case of OpenSolaris it is not that difficult to work around this bug without getting rid of files (snapshots referencing them) with errors, but in I''m not sure how to do the same on FreeBSD. But you always have option of destroying snapshot indicated above (and may be more). regards, victor> > > On 3 Mar, 2010, at 18.46, Christian He?mann wrote: > >> Hello guys, >> >> >> I''ve already written this on the FreeBSD forums, but so far, the feedback >> is not so great - seems FreeBSD guys aren''t that keen on ZFS. I have some >> hopes you''ll be more experienced on these kind of errors: >> >> I have a ZFS pool comprised of two 3-disk RAIDs which I''ve recently moved >> from OS X to FreeBSD (8 stable). >> >> One harddisk failed last weekend with lots of shouting, SMART messages and >> even a kernel panic. I attached a new disk and started the replacement. >> Unfortunately, about 20% into the replacement, a second disk in the same >> RAID showed signs of misbehaviour by giving me read errors. The resilvering >> did finish, though, and it left me with only three broken files according >> to zpool status: >> >> [root at camelot /]# zpool status -v tank pool: tank state: DEGRADED status: >> One or more devices has experienced an error resulting in data corruption. >> Applications may be affected. action: Restore the file in question if >> possible. Otherwise restore the entire pool from backup. see: >> http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 10h42m >> with 136 errors on Tue Mar 2 07:55:05 2010 config: >> >> NAME STATE READ WRITE CKSUM tank DEGRADED 137 >> 0 0 raidz1 ONLINE 0 0 0 ad17p2 ONLINE 0 >> 0 0 ad18p2 ONLINE 0 0 0 ad20p2 ONLINE 0 >> 0 0 raidz1 DEGRADED 326 0 0 replacing DEGRADED 0 >> 0 0 ad16p2 OFFLINE 2 169K 6 ad4p2 ONLINE 0 0 >> 0 839G resilvered ad14p2 ONLINE 0 0 0 5.33G resilvered >> ad15p2 ONLINE 418 0 0 5.33G resilvered >> >> errors: Permanent errors have been detected in the following files: >> >> tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v >> tank/DVD at 20100222225100:/Payback.m4v >> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v >> >> I have the feeling the problems on ad15p2 are related to a cable issue, >> since it doesn''t have any SMART errors, is quite a new drive (3 months old) >> and was IMHO sufficiently "burned in" by repeatedly filling it to the brim >> and checking the contents (via ZFS). So I''d like to switch off the server, >> replace the cable and do a scrub afterwards to make sure it doesn''t produce >> additional errors. >> >> Unfortunately, although it says the resilvering completed, I can''t detach >> ad16p2 (the first faulted disk) from the system: >> >> [root at camelot /]# zpool detach tank ad16p2 cannot detach ad16p2: no valid >> replicas >> >> To be honest, I don''t know how to proceed now. It feels like my system is >> in a very unstable state right now, with a replacement not yet finished and >> errors on two drives in one RAID.Z1. >> >> I deleted the files affected, but have about 20 snapshots of this >> filesystem and think these files are in most of them since they''re quite >> old. >> >> So, what should I do now? Delete all snapshots? Move all other files from >> this filesystem to a new filesystem and destroy the old filesystem? Try to >> export and import the pool? Is it even safe to reboot the machine right >> now? >> >> I got one response in the FreeBSD Forum telling me I should reboot the >> machine and do a scrub afterwards, it should then detect that it doesn''t >> need the old disk anymore - I am a bit reluctant doing that, to be >> honest... >> >> Any help would be appreciated. >> >> Thank you. >> >> Christian _______________________________________________ zfs-discuss >> mailing list zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Christian Hessmann
2010-Mar-05 11:28 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
Victor,> Btw, they affect some files referenced by snapshots as > ''zpool status -v'' suggests: > > >> tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v > >> tank/DVD at 20100222225100:/Payback.m4v > >> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v > > In case of OpenSolaris it is not that difficult to work around this bug > without getting rid of files (snapshots referencing them) with errors, > but in I''m not sure how to do the same on FreeBSD. > But you always have option of destroying snapshot indicated above (and may > be more).I''m still reluctant to reboot the machine, so what I did now was as you suggested destroy these snapshots (after deleting the files from the current filesystem, of course). I''m not so sure the result is good, though: ==============[root at camelot /tank/DVD]# zpool status -v tank pool: tank state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 10h42m with 136 errors on Tue Mar 2 07:55:05 2010 config: NAME STATE READ WRITE CKSUM tank DEGRADED 137 0 0 raidz1 ONLINE 0 0 0 ad17p2 ONLINE 0 0 0 ad18p2 ONLINE 0 0 0 ad20p2 ONLINE 0 0 0 raidz1 DEGRADED 326 0 0 replacing DEGRADED 0 0 0 ad16p2 OFFLINE 2 241K 6 ad4p2 ONLINE 0 0 0 839G resilvered ad14p2 ONLINE 0 0 0 5.33G resilvered ad15p2 ONLINE 418 0 0 5.33G resilvered errors: Permanent errors have been detected in the following files: tank/DVD:<0x9cd> <0x2064>:<0x25a4> <0x20ae>:<0x503> <0x20ae>:<0x9cd> ============== Any further information available on this hex messages? Regards Christian
Victor Latushkin
2010-Mar-10 07:31 UTC
[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk
Christian Hessmann wrote:> Victor, > >> Btw, they affect some files referenced by snapshots as >> ''zpool status -v'' suggests: >> >> >> tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v >> >> tank/DVD at 20100222225100:/Payback.m4v >> >> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v >> >> In case of OpenSolaris it is not that difficult to work around this bug >> without getting rid of files (snapshots referencing them) with errors, >> but in I''m not sure how to do the same on FreeBSD. >> But you always have option of destroying snapshot indicated above (and may >> be more). > > I''m still reluctant to reboot the machine, so what I did now was as you > suggested destroy these snapshots (after deleting the files from the > current filesystem, of course). > I''m not so sure the result is good, though: > > ==============> [root at camelot /tank/DVD]# zpool status -v tank > pool: tank > state: DEGRADED > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: resilver completed after 10h42m with 136 errors on Tue Mar 2 > 07:55:05 2010 > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 137 0 0 > raidz1 ONLINE 0 0 0 > ad17p2 ONLINE 0 0 0 > ad18p2 ONLINE 0 0 0 > ad20p2 ONLINE 0 0 0 > raidz1 DEGRADED 326 0 0 > replacing DEGRADED 0 0 0 > ad16p2 OFFLINE 2 241K 6 > ad4p2 ONLINE 0 0 0 839G resilvered > ad14p2 ONLINE 0 0 0 5.33G resilvered > ad15p2 ONLINE 418 0 0 5.33G resilvered > > errors: Permanent errors have been detected in the following files: > > tank/DVD:<0x9cd> > <0x2064>:<0x25a4> > <0x20ae>:<0x503> > <0x20ae>:<0x9cd> > ==============> > Any further information available on this hex messages?This tells that ZFS can no longer map object numbers from errlog into meaningful names, and this is expected, as you have destroyed them. Now you need to rerun a scrub. regards, victor