thr3ads.net - zfs discuss - [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Christian Heßmann

2010-Mar-03 23:46 UTC

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Hello guys,


I''ve already written this on the FreeBSD forums, but so far, the  
feedback is not so great - seems FreeBSD guys aren''t that keen on ZFS.
I have some hopes you''ll be more experienced on these kind of errors:

I have a ZFS pool comprised of two 3-disk RAIDs which I''ve recently  
moved from OS X to FreeBSD (8 stable).

One harddisk failed last weekend with lots of shouting, SMART messages  
and even a kernel panic.
I attached a new disk and started the replacement.
Unfortunately, about 20% into the replacement, a second disk in the  
same RAID showed signs of misbehaviour by giving me read errors. The  
resilvering did finish, though, and it left me with only three broken  
files according to zpool status:

[root at camelot /]# zpool status -v tank
   pool: tank
  state: DEGRADED
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2  
07:55:05 2010
config:

         NAME           STATE     READ WRITE CKSUM
         tank           DEGRADED   137     0     0
           raidz1       ONLINE       0     0     0
             ad17p2     ONLINE       0     0     0
             ad18p2     ONLINE       0     0     0
             ad20p2     ONLINE       0     0     0
           raidz1       DEGRADED   326     0     0
             replacing  DEGRADED     0     0     0
               ad16p2   OFFLINE      2  169K     6
               ad4p2    ONLINE       0     0     0  839G resilvered
             ad14p2     ONLINE       0     0     0  5.33G resilvered
             ad15p2     ONLINE     418     0     0  5.33G resilvered

errors: Permanent errors have been detected in the following files:

         tank/DVD:<0x9cd>
         tank/DVD at 20100222225100:/Memento.m4v
         tank/DVD at 20100222225100:/Payback.m4v
         tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v

I have the feeling the problems on ad15p2 are related to a cable  
issue, since it doesn''t have any SMART errors, is quite a new drive (3
months old) and was IMHO sufficiently "burned in" by repeatedly  
filling it to the brim and checking the contents (via ZFS). So I''d  
like to switch off the server, replace the cable and do a scrub  
afterwards to make sure it doesn''t produce additional errors.

Unfortunately, although it says the resilvering completed, I can''t  
detach ad16p2 (the first faulted disk) from the system:

[root at camelot /]# zpool detach tank ad16p2
cannot detach ad16p2: no valid replicas

To be honest, I don''t know how to proceed now. It feels like my system
is in a very unstable state right now, with a replacement not yet  
finished and errors on two drives in one RAID.Z1.

I deleted the files affected, but have about 20 snapshots of this  
filesystem and think these files are in most of them since they''re  
quite old.

So, what should I do now? Delete all snapshots? Move all other files  
from this filesystem to a new filesystem and destroy the old  
filesystem? Try to export and import the pool? Is it even safe to  
reboot the machine right now?

I got one response in the FreeBSD Forum telling me I should reboot the  
machine and do a scrub afterwards, it should then detect that it  
doesn''t need the old disk anymore - I am a bit reluctant doing that,  
to be honest...

Any help would be appreciated.

Thank you.

Christian

Mark J Musante

2010-Mar-04 00:01 UTC

head link

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

It looks like you''re running into a DTL issue.  ZFS believes that
ad16p2 has some data on it that hasn''t been copied off yet, and
it''s not considering the fact that it''s part of a raidz group
and ad4p2.

There is a CR on this,
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but
what''s viewable in the bug database is pretty minimal.

If you haven''t made a backup yet (or at least done a complete snapshot
and generated a send stream from it), my advice would be to do that now.  Then
reboot and see if that clears the DTL enough to let you do the detach.


On 3 Mar, 2010, at 18.46, Christian He?mann wrote:
> Hello guys,
> 
> 
> I''ve already written this on the FreeBSD forums, but so far, the
feedback is not so great - seems FreeBSD guys aren''t that keen on ZFS.
I have some hopes you''ll be more experienced on these kind of errors:
> 
> I have a ZFS pool comprised of two 3-disk RAIDs which I''ve
recently moved from OS X to FreeBSD (8 stable).
> 
> One harddisk failed last weekend with lots of shouting, SMART messages and
even a kernel panic.
> I attached a new disk and started the replacement.
> Unfortunately, about 20% into the replacement, a second disk in the same
RAID showed signs of misbehaviour by giving me read errors. The resilvering did
finish, though, and it left me with only three broken files according to zpool
status:
> 
> [root at camelot /]# zpool status -v tank
>  pool: tank
> state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>        corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>        entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2
07:55:05 2010
> config:
> 
>        NAME           STATE     READ WRITE CKSUM
>        tank           DEGRADED   137     0     0
>          raidz1       ONLINE       0     0     0
>            ad17p2     ONLINE       0     0     0
>            ad18p2     ONLINE       0     0     0
>            ad20p2     ONLINE       0     0     0
>          raidz1       DEGRADED   326     0     0
>            replacing  DEGRADED     0     0     0
>              ad16p2   OFFLINE      2  169K     6
>              ad4p2    ONLINE       0     0     0  839G resilvered
>            ad14p2     ONLINE       0     0     0  5.33G resilvered
>            ad15p2     ONLINE     418     0     0  5.33G resilvered
> 
> errors: Permanent errors have been detected in the following files:
> 
>        tank/DVD:<0x9cd>
>        tank/DVD at 20100222225100:/Memento.m4v
>        tank/DVD at 20100222225100:/Payback.m4v
>        tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v
> 
> I have the feeling the problems on ad15p2 are related to a cable issue,
since it doesn''t have any SMART errors, is quite a new drive (3 months
old) and was IMHO sufficiently "burned in" by repeatedly filling it to
the brim and checking the contents (via ZFS). So I''d like to switch off
the server, replace the cable and do a scrub afterwards to make sure it
doesn''t produce additional errors.
> 
> Unfortunately, although it says the resilvering completed, I can''t
detach ad16p2 (the first faulted disk) from the system:
> 
> [root at camelot /]# zpool detach tank ad16p2
> cannot detach ad16p2: no valid replicas
> 
> To be honest, I don''t know how to proceed now. It feels like my
system is in a very unstable state right now, with a replacement not yet
finished and errors on two drives in one RAID.Z1.
> 
> I deleted the files affected, but have about 20 snapshots of this
filesystem and think these files are in most of them since they''re
quite old.
> 
> So, what should I do now? Delete all snapshots? Move all other files from
this filesystem to a new filesystem and destroy the old filesystem? Try to
export and import the pool? Is it even safe to reboot the machine right now?
> 
> I got one response in the FreeBSD Forum telling me I should reboot the
machine and do a scrub afterwards, it should then detect that it
doesn''t need the old disk anymore - I am a bit reluctant doing that, to
be honest...
> 
> Any help would be appreciated.
> 
> Thank you.
> 
> Christian
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2010-Mar-04 01:57 UTC

head link

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

On Thu, 4 Mar 2010, Christian He?mann wrote:>
> I''ve already written this on the FreeBSD forums, but so far, the
feedback is
> not so great - seems FreeBSD guys aren''t that keen on ZFS. I have
some hopes
I see lots and lots of zfs traffic on the discussion list 
"freebsd-fs at freebsd.org".  This is where the FreeBSD filesystem 
developers hang out.
>         raidz1       DEGRADED   326     0     0
>           replacing  DEGRADED     0     0     0
>             ad16p2   OFFLINE      2  169K     6
>             ad4p2    ONLINE       0     0     0  839G resilvered
>           ad14p2     ONLINE       0     0     0  5.33G resilvered
>           ad15p2     ONLINE     418     0     0  5.33G resilvered
>
> Unfortunately, although it says the resilvering completed, I can''t
detach
> ad16p2 (the first faulted disk) from the system:
The zpool status you posted shows that ad16p2 is still in
''replacing''
mode.  If this is still the case, then it could be a reason that the 
original disk can''t yet be removed.
> To be honest, I don''t know how to proceed now. It feels like my
system is in
> a very unstable state right now, with a replacement not yet finished and 
> errors on two drives in one RAID.Z1.
If it is still in ''replacing'' mode then it seems that the best
policy
is to just wait.  If there is no drive activity on ad4p2 then there 
may be something more wrong.

Cold booting a system can be one of the scariest things to do so it 
should be a means of last resort.  Maybe the system would not come 
back.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Freddie Cash

2010-Mar-04 02:57 UTC

head link

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

On Wed, Mar 3, 2010 at 5:57 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Thu, 4 Mar 2010, Christian He?mann wrote:
>
>> To be honest, I don''t know how to proceed now. It feels like
my system is
>> in a very unstable state right now, with a replacement not yet finished
and
>> errors on two drives in one RAID.Z1.
>>
>
> If it is still in ''replacing'' mode then it seems that the
best policy is to
> just wait.  If there is no drive activity on ad4p2 then there may be
> something more wrong.
>
> Cold booting a system can be one of the scariest things to do so it should
> be a means of last resort.  Maybe the system would not come back.
>
We''ve had this happen a couple of times on our FreeBSD-based storage
servers.  Rebooting and manually running a scrub has fixed the issue each
time.

24x 500 GB SATA drives in 3x raidz2 vdev of 8 drives each

-- 
Freddie Cash
fjwcash at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100303/3ca7e529/attachment.html>

Christian Heßmann

2010-Mar-04 07:39 UTC

head link

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

On 04.03.2010, at 02:57, Bob Friesenhahn wrote:
> I see lots and lots of zfs traffic on the discussion list "freebsd-fs
at freebsd.org
> ".  This is where the FreeBSD filesystem developers hang out.
Thanks - I''ll have a look there. As usual, the cool kids are in  
mailing lists... ;-)

> The zpool status you posted shows that ad16p2 is still in  
> ''replacing'' mode.  If this is still the case, then it
could be a
> reason that the original disk can''t yet be removed.
[...]> If it is still in ''replacing'' mode then it seems that the
best
> policy is to just wait.  If there is no drive activity on ad4p2 then  
> there may be something more wrong.
It bothers me as well that it says "replacing" instead of replaced or
whatever else it should say. Since the resilvering completed I don''t  
have any activity on the drives anymore, so I presume it somehow  
thinks it''s done.

> Cold booting a system can be one of the scariest things to do so it  
> should be a means of last resort.  Maybe the system would not come  
> back.
That''s my fear. Although from what I can gather from the feedback so  
far the FreeBSD users seem somewhat familiar with an error like that  
and recommend rebooting. I might take the majority advice, make a  
backup of the important parts of the pool and just go for a reboot.

Might go for another repost into the freebsd-fs list before, though,  
so please bear with me that you have to read this again...

Thanks.

Christian

Victor Latushkin

2010-Mar-05 10:59 UTC

head link

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Mark J Musante wrote:> It looks like you''re running into a DTL issue.  ZFS believes that
ad16p2 has
> some data on it that hasn''t been copied off yet, and it''s
not considering the
> fact that it''s part of a raidz group and ad4p2.
> 
> There is a CR on this,
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but
what''s
> viewable in the bug database is pretty minimal.
> 
> If you haven''t made a backup yet (or at least done a complete
snapshot and
> generated a send stream from it), my advice would be to do that now.  Then
> reboot and see if that clears the DTL enough to let you do the detach.
Actually besides the bug mentioned above, resilvering will not clear DTLs upon 
completion due to

6887372 DTLs not cleared after resilver if permanent errors present

as there are permanent errors present. Btw, they affect some files referenced by
snapshots as ''zpool status -v'' suggests:

 >> tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v
 >> tank/DVD at 20100222225100:/Payback.m4v
 >> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v

In case of OpenSolaris it is not that difficult to work around this bug without 
getting rid of files (snapshots referencing them) with errors, but in
I''m not
sure how to do the same on FreeBSD.

But you always have option of destroying snapshot indicated above (and may be
more).

regards,
victor

> 
> 
> On 3 Mar, 2010, at 18.46, Christian He?mann wrote:
> 
>> Hello guys,
>> 
>> 
>> I''ve already written this on the FreeBSD forums, but so far,
the feedback
>> is not so great - seems FreeBSD guys aren''t that keen on ZFS.
I have some
>> hopes you''ll be more experienced on these kind of errors:
>> 
>> I have a ZFS pool comprised of two 3-disk RAIDs which I''ve
recently moved
>> from OS X to FreeBSD (8 stable).
>> 
>> One harddisk failed last weekend with lots of shouting, SMART messages
and
>> even a kernel panic. I attached a new disk and started the replacement.
>> Unfortunately, about 20% into the replacement, a second disk in the
same
>> RAID showed signs of misbehaviour by giving me read errors. The
resilvering
>> did finish, though, and it left me with only three broken files
according
>> to zpool status:
>> 
>> [root at camelot /]# zpool status -v tank pool: tank state: DEGRADED
status:
>> One or more devices has experienced an error resulting in data
corruption.
>> Applications may be affected. action: Restore the file in question if
>> possible.  Otherwise restore the entire pool from backup. see:
>> http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after
10h42m
>> with 136 errors on Tue Mar  2 07:55:05 2010 config:
>> 
>> NAME           STATE     READ WRITE CKSUM tank           DEGRADED   137
>> 0     0 raidz1       ONLINE       0     0     0 ad17p2     ONLINE      
0
>> 0     0 ad18p2     ONLINE       0     0     0 ad20p2     ONLINE       0
>> 0     0 raidz1       DEGRADED   326     0     0 replacing  DEGRADED    
0
>> 0     0 ad16p2   OFFLINE      2  169K     6 ad4p2    ONLINE       0    
0
>> 0  839G resilvered ad14p2     ONLINE       0     0     0  5.33G
resilvered
>> ad15p2     ONLINE     418     0     0  5.33G resilvered
>> 
>> errors: Permanent errors have been detected in the following files:
>> 
>> tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v 
>> tank/DVD at 20100222225100:/Payback.m4v 
>> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v
>> 
>> I have the feeling the problems on ad15p2 are related to a cable issue,
>> since it doesn''t have any SMART errors, is quite a new drive
(3 months old)
>> and was IMHO sufficiently "burned in" by repeatedly filling
it to the brim
>> and checking the contents (via ZFS). So I''d like to switch off
the server,
>> replace the cable and do a scrub afterwards to make sure it
doesn''t produce
>> additional errors.
>> 
>> Unfortunately, although it says the resilvering completed, I
can''t detach
>> ad16p2 (the first faulted disk) from the system:
>> 
>> [root at camelot /]# zpool detach tank ad16p2 cannot detach ad16p2: no
valid
>> replicas
>> 
>> To be honest, I don''t know how to proceed now. It feels like
my system is
>> in a very unstable state right now, with a replacement not yet finished
and
>> errors on two drives in one RAID.Z1.
>> 
>> I deleted the files affected, but have about 20 snapshots of this
>> filesystem and think these files are in most of them since
they''re quite
>> old.
>> 
>> So, what should I do now? Delete all snapshots? Move all other files
from
>> this filesystem to a new filesystem and destroy the old filesystem? Try
to
>> export and import the pool? Is it even safe to reboot the machine right
>> now?
>> 
>> I got one response in the FreeBSD Forum telling me I should reboot the
>> machine and do a scrub afterwards, it should then detect that it
doesn''t
>> need the old disk anymore - I am a bit reluctant doing that, to be
>> honest...
>> 
>> Any help would be appreciated.
>> 
>> Thank you.
>> 
>> Christian _______________________________________________ zfs-discuss
>> mailing list zfs-discuss at opensolaris.org 
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________ zfs-discuss mailing list 
> zfs-discuss at opensolaris.org 
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Christian Hessmann

2010-Mar-05 11:28 UTC

head link

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Victor,
> Btw, they affect some files referenced by snapshots as
> ''zpool status -v'' suggests:
>
>  >> tank/DVD:<0x9cd> tank/DVD at 20100222225100:/Memento.m4v
>  >> tank/DVD at 20100222225100:/Payback.m4v
>  >> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v
>
> In case of OpenSolaris it is not that difficult to work around this bug
> without getting rid of files (snapshots referencing them) with errors,
> but in I''m not sure how to do the same on FreeBSD.
> But you always have option of destroying snapshot indicated above (and may
> be more).
I''m still reluctant to reboot the machine, so what I did now was as you
suggested destroy these snapshots (after deleting the files from the
current filesystem, of course).
I''m not so sure the result is good, though:

==============[root at camelot /tank/DVD]# zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2
07:55:05 2010
config:

        NAME           STATE     READ WRITE CKSUM
        tank           DEGRADED   137     0     0
          raidz1       ONLINE       0     0     0
            ad17p2     ONLINE       0     0     0
            ad18p2     ONLINE       0     0     0
            ad20p2     ONLINE       0     0     0
          raidz1       DEGRADED   326     0     0
            replacing  DEGRADED     0     0     0
              ad16p2   OFFLINE      2  241K     6
              ad4p2    ONLINE       0     0     0  839G resilvered
            ad14p2     ONLINE       0     0     0  5.33G resilvered
            ad15p2     ONLINE     418     0     0  5.33G resilvered

errors: Permanent errors have been detected in the following files:

        tank/DVD:<0x9cd>
        <0x2064>:<0x25a4>
        <0x20ae>:<0x503>
        <0x20ae>:<0x9cd>
==============
Any further information available on this hex messages?


Regards
Christian

Victor Latushkin

2010-Mar-10 07:31 UTC

head link

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Christian Hessmann wrote:> Victor,
> 
>> Btw, they affect some files referenced by snapshots as
>> ''zpool status -v'' suggests:
>>
>>  >> tank/DVD:<0x9cd> tank/DVD at
20100222225100:/Memento.m4v
>>  >> tank/DVD at 20100222225100:/Payback.m4v
>>  >> tank/DVD at 20100222225100:/TheManWhoWasntThere.m4v
>>
>> In case of OpenSolaris it is not that difficult to work around this bug
>> without getting rid of files (snapshots referencing them) with errors,
>> but in I''m not sure how to do the same on FreeBSD.
>> But you always have option of destroying snapshot indicated above (and
may
>> be more).
> 
> I''m still reluctant to reboot the machine, so what I did now was
as you
> suggested destroy these snapshots (after deleting the files from the
> current filesystem, of course).
> I''m not so sure the result is good, though:
> 
> ==============> [root at camelot /tank/DVD]# zpool status -v tank
>   pool: tank
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2
> 07:55:05 2010
> config:
> 
>         NAME           STATE     READ WRITE CKSUM
>         tank           DEGRADED   137     0     0
>           raidz1       ONLINE       0     0     0
>             ad17p2     ONLINE       0     0     0
>             ad18p2     ONLINE       0     0     0
>             ad20p2     ONLINE       0     0     0
>           raidz1       DEGRADED   326     0     0
>             replacing  DEGRADED     0     0     0
>               ad16p2   OFFLINE      2  241K     6
>               ad4p2    ONLINE       0     0     0  839G resilvered
>             ad14p2     ONLINE       0     0     0  5.33G resilvered
>             ad15p2     ONLINE     418     0     0  5.33G resilvered
> 
> errors: Permanent errors have been detected in the following files:
> 
>         tank/DVD:<0x9cd>
>         <0x2064>:<0x25a4>
>         <0x20ae>:<0x503>
>         <0x20ae>:<0x9cd>
> ==============> 
> Any further information available on this hex messages?
This tells that ZFS can no longer map object numbers from errlog into meaningful
  names, and this is expected, as you have destroyed them.

Now you need to rerun a scrub.

regards,
victor

zfs discuss - Mar 2010 - (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk