thr3ads.net - zfs discuss - [zfs-discuss] Problems while resilvering [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Matthias Appel

2009-Dec-06 22:29 UTC

[zfs-discuss] Problems while resilvering

Hi,

i have go a problem with a zfs pool which wont resilver correctly.

The pool consists of two two-way mirrors.

One of the disks reportet checksum errors and fell out of the pool.

I replaced the faulty harddisk and switched it to available via cfgadm.

The resilver started immediately but gave me checksum errors upon
completeing like this:

  pool: performance
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 3h4m with 1 errors on Sun Dec  6
18:27:37 2009
config:

        NAME              STATE     READ WRITE CKSUM
        performance       DEGRADED     0     0     3
          mirror          ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
          mirror          DEGRADED     0     0     6
            c1t2d0        ONLINE       0     0     8  256K resilvered
            replacing     DEGRADED     6     0     0
              c1t3d0s0/o  FAULTED      0     0     0  corrupted data
              c1t3d0      ONLINE       0     0     6  445G resilvered

errors: Permanent errors have been detected in the following files:

        /performance/VIRUSWALL/Ubuntu-000001.vmdk


I deleted the file in question (I have a backup of the file) and did a
zpool clear performance.

Resilvering started again and after completion i got this:

root at storage:/performance/VIRUSWALL# zpool status -v
  pool: performance
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver in progress for 0h11m, 7.35% done, 2h27m to go
config:

        NAME              STATE     READ WRITE CKSUM
        performance       DEGRADED     0     0     0
          mirror          ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
          mirror          DEGRADED     0     0     0
            c1t2d0        ONLINE       0     0     1  128K resilvered
            replacing     DEGRADED     0     0     0
              c1t3d0s0/o  FAULTED      0     0     0  corrupted data
              c1t3d0      ONLINE       0     0     0  33.7G resilvered

errors: Permanent errors have been detected in the following files:

        performance/VIRUSWALL:<0x88>

Then I tried to rollback a snapshot of the zfs in question and did a
zpool clear again because I hoped zfs resilver will complete
successfully but it did not.

I tried to destroy performance/VIRUSWALL but it said the dataset is
busy.

I don''t know why the dataset is busy because i unmounted all NFS mounts
and leaved the directory in the ssh shell.

I tired a zpool clear again and resilvering started again and now I get
this:


 root at storage:/performance/VIRUSWALL# zpool status -v
  pool: performance
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver in progress for 0h39m, 12.21% done, 4h43m to go
config:

        NAME              STATE     READ WRITE CKSUM
        performance       DEGRADED     0     0     0
          mirror          ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
          mirror          DEGRADED     0     0     0
            c1t2d0        ONLINE       0     0     3  384K resilvered
            replacing     DEGRADED     0     0     0
              c1t3d0s0/o  FAULTED      0     0     0  corrupted data
              c1t3d0      ONLINE       0     0     0  56.5G resilvered

errors: Permanent errors have been detected in the following files:

        <0x24e>:<0x88>


Can anybody tell me how to resilver the pool in question and to get rid
of c1t3d0s0/o  which ist the old/defective harddisk.


I dont''t mind to delete the files with errors, I only want to get a
consistent zpool again.


Please tell me what I did wrong.


I had a problem with a defective harddisk before...I replaced the
harddisks as usual (configured via cfgadm) and resilvering
started...like this time zfs told me that there are defective files but
after resilvering the errors dissapeared (but the defective harddisk
showed up like this time) . I did a zpool clear again and resilvering
completed this time...but now resilvering does not complete successfully
and zfs keeps telling me that there are files with errors.

Why do I have to resilver two times to get a consistent zpool...Am I
wrong or ZFS (but I hope I am)


Hope you guys can tell me!

Cindy Swearingen

2009-Dec-07 19:57 UTC

head link

[zfs-discuss] Problems while resilvering

Hi Matthias,

I''m not sure I understand all the issues that are going on
in this configuration, but I don''t see that you used the
zpool replace command to complete physical replacement
of the failed disk, which would look like this:

# zpool replace performance c1t3d0

Then run zpool clear to clear the pool errors.

Thanks,

Cindy

On 12/06/09 15:29, Matthias Appel wrote:> Hi,
> 
> i have go a problem with a zfs pool which wont resilver correctly.
> 
> The pool consists of two two-way mirrors.
> 
> One of the disks reportet checksum errors and fell out of the pool.
> 
> I replaced the faulty harddisk and switched it to available via cfgadm.
> 
> The resilver started immediately but gave me checksum errors upon
> completeing like this:
> 
>   pool: performance
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: resilver completed after 3h4m with 1 errors on Sun Dec  6
> 18:27:37 2009
> config:
> 
>         NAME              STATE     READ WRITE CKSUM
>         performance       DEGRADED     0     0     3
>           mirror          ONLINE       0     0     0
>             c1t0d0        ONLINE       0     0     0
>             c1t1d0        ONLINE       0     0     0
>           mirror          DEGRADED     0     0     6
>             c1t2d0        ONLINE       0     0     8  256K resilvered
>             replacing     DEGRADED     6     0     0
>               c1t3d0s0/o  FAULTED      0     0     0  corrupted data
>               c1t3d0      ONLINE       0     0     6  445G resilvered
> 
> errors: Permanent errors have been detected in the following files:
> 
>         /performance/VIRUSWALL/Ubuntu-000001.vmdk
> 
> 
> I deleted the file in question (I have a backup of the file) and did a
> zpool clear performance.
> 
> Resilvering started again and after completion i got this:
> 
> root at storage:/performance/VIRUSWALL# zpool status -v
>   pool: performance
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: resilver in progress for 0h11m, 7.35% done, 2h27m to go
> config:
> 
>         NAME              STATE     READ WRITE CKSUM
>         performance       DEGRADED     0     0     0
>           mirror          ONLINE       0     0     0
>             c1t0d0        ONLINE       0     0     0
>             c1t1d0        ONLINE       0     0     0
>           mirror          DEGRADED     0     0     0
>             c1t2d0        ONLINE       0     0     1  128K resilvered
>             replacing     DEGRADED     0     0     0
>               c1t3d0s0/o  FAULTED      0     0     0  corrupted data
>               c1t3d0      ONLINE       0     0     0  33.7G resilvered
> 
> errors: Permanent errors have been detected in the following files:
> 
>         performance/VIRUSWALL:<0x88>
> 
> Then I tried to rollback a snapshot of the zfs in question and did a
> zpool clear again because I hoped zfs resilver will complete
> successfully but it did not.
> 
> I tried to destroy performance/VIRUSWALL but it said the dataset is
> busy.
> 
> I don''t know why the dataset is busy because i unmounted all NFS
mounts
> and leaved the directory in the ssh shell.
> 
> I tired a zpool clear again and resilvering started again and now I get
> this:
> 
> 
>  root at storage:/performance/VIRUSWALL# zpool status -v
>   pool: performance
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: resilver in progress for 0h39m, 12.21% done, 4h43m to go
> config:
> 
>         NAME              STATE     READ WRITE CKSUM
>         performance       DEGRADED     0     0     0
>           mirror          ONLINE       0     0     0
>             c1t0d0        ONLINE       0     0     0
>             c1t1d0        ONLINE       0     0     0
>           mirror          DEGRADED     0     0     0
>             c1t2d0        ONLINE       0     0     3  384K resilvered
>             replacing     DEGRADED     0     0     0
>               c1t3d0s0/o  FAULTED      0     0     0  corrupted data
>               c1t3d0      ONLINE       0     0     0  56.5G resilvered
> 
> errors: Permanent errors have been detected in the following files:
> 
>         <0x24e>:<0x88>
> 
> 
> Can anybody tell me how to resilver the pool in question and to get rid
> of c1t3d0s0/o  which ist the old/defective harddisk.
> 
> 
> I dont''t mind to delete the files with errors, I only want to get
a
> consistent zpool again.
> 
> 
> Please tell me what I did wrong.
> 
> 
> I had a problem with a defective harddisk before...I replaced the
> harddisks as usual (configured via cfgadm) and resilvering
> started...like this time zfs told me that there are defective files but
> after resilvering the errors dissapeared (but the defective harddisk
> showed up like this time) . I did a zpool clear again and resilvering
> completed this time...but now resilvering does not complete successfully
> and zfs keeps telling me that there are files with errors.
> 
> Why do I have to resilver two times to get a consistent zpool...Am I
> wrong or ZFS (but I hope I am)
> 
> 
> Hope you guys can tell me!
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Matthias Appel

2009-Dec-08 04:05 UTC

head link

[zfs-discuss] Problems while resilvering

I accidentially only replied to cindy but I wanted to reply to the list.
I don''t want to overstrain cindys time...maybe one of the list members
can help me as well.
> -----Urspr?ngliche Nachricht-----
> Von: Matthias Appel
> Gesendet: Dienstag, 08. Dezember 2009 03:34
> An: ''Cindy.Swearingen at Sun.COM''
> Betreff: AW: [zfs-discuss] Problems while resilvering
> 
> Hi Cindy,
> 
> 
> Thanks for your reply.
> 
> 
> > # zpool replace performance c1t3d0
> >
> > Then run zpool clear to clear the pool errors.
> 
> 
> Does this mean I don''t have to do a cfgadm -c configure sata0/3
(which
> starts a resilver in my case) but it
> Is sufficient to do a zpool replace (I do not have a hot-spare)?
> 
> In the meantime my pool resilvered correctly after 3 or 4 resilvering
> runs (each initiated by a zpool clear)
> I don''t understand why issuing a zpool clear starts a resilvering.
> 
> 
> 
> And I stumbled upon another issue:
> 
> I''ve worked with software RAID in an Linux environment and I am
used
> that if I relocate a disk to
> another controller and the disk is found and automatically attached
> back to the existing RAID set.
> 
> I just added another controller to my system and the device changed
> from c1t0d0 to c2d0.
> 
> Zpool status gives me this:
> 
> NAME        STATE     READ WRITE CKSUM
>         performance  DEGRADED     0     0     0
>           mirror    DEGRADED     0     0     0
>             c1t0d0  UNAVAIL      0     0     0  cannot open
>             c1t1d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c1t2d0  ONLINE       0     0     0
>             c1t3d0  ONLINE       0     0     0
> 
> How can I replace the unavailable drive wit c2do.
> I read the manpage of zpool and it only gives me the options to replace
> a disk with an empty one.
> 
> If I do a zpool replace c1t0d0 c2d0 it says that I have to use the -f
> option because the disk was in a zpool before.
> 
> Do I want to use the -f option (will the existing data be only updated
> with changed blocks or will a complete resilver
> Kick in).
> 
> If not, how can I relocate disks from one controller to another?

Cindy Swearingen

2009-Dec-08 16:16 UTC

head link

[zfs-discuss] Problems while resilvering

Hi Matthias,

The process of replacing disk and whether you need to unconfigure the
disk depends on the hardware. With some hardware, like our x4500 series, 
you must unconfigure the disk first by using cfgadm. This process is
described in this section of the ZFS Admin Guide:

http://docs.sun.com/app/docs/doc/817-2271/gazgd?a=view

I''m not sure why the zpool clear restarts the resilvering, but the
zpool replace command signals to ZFS that the disk is replaced, not
the zpool clear command.

Try the zpool replace c1t0d0 c2d0 with the -f option. Another option is
to detach c1t0d0 and then attach c2d0.

ZFS can generally detect when a device has changed and can recover, but
I believe this is hardware dependent. If you must change devices that
are part of a ZFS storage pool, then export the pool before the device
is changed.


Thanks,

Cindy

On 12/07/09 21:05, Matthias Appel wrote:> I accidentially only replied to cindy but I wanted to reply to the list.
> I don''t want to overstrain cindys time...maybe one of the list
members can help me as well.
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Matthias Appel
>> Gesendet: Dienstag, 08. Dezember 2009 03:34
>> An: ''Cindy.Swearingen at Sun.COM''
>> Betreff: AW: [zfs-discuss] Problems while resilvering
>>
>> Hi Cindy,
>>
>>
>> Thanks for your reply.
>>
>>
>>> # zpool replace performance c1t3d0
>>>
>>> Then run zpool clear to clear the pool errors.
>>
>> Does this mean I don''t have to do a cfgadm -c configure
sata0/3 (which
>> starts a resilver in my case) but it
>> Is sufficient to do a zpool replace (I do not have a hot-spare)?
>>
>> In the meantime my pool resilvered correctly after 3 or 4 resilvering
>> runs (each initiated by a zpool clear)
>> I don''t understand why issuing a zpool clear starts a
resilvering.
>>
>>
>>
>> And I stumbled upon another issue:
>>
>> I''ve worked with software RAID in an Linux environment and I
am used
>> that if I relocate a disk to
>> another controller and the disk is found and automatically attached
>> back to the existing RAID set.
>>
>> I just added another controller to my system and the device changed
>> from c1t0d0 to c2d0.
>>
>> Zpool status gives me this:
>>
>> NAME        STATE     READ WRITE CKSUM
>>         performance  DEGRADED     0     0     0
>>           mirror    DEGRADED     0     0     0
>>             c1t0d0  UNAVAIL      0     0     0  cannot open
>>             c1t1d0  ONLINE       0     0     0
>>           mirror    ONLINE       0     0     0
>>             c1t2d0  ONLINE       0     0     0
>>             c1t3d0  ONLINE       0     0     0
>>
>> How can I replace the unavailable drive wit c2do.
>> I read the manpage of zpool and it only gives me the options to replace
>> a disk with an empty one.
>>
>> If I do a zpool replace c1t0d0 c2d0 it says that I have to use the -f
>> option because the disk was in a zpool before.
>>
>> Do I want to use the -f option (will the existing data be only updated
>> with changed blocks or will a complete resilver
>> Kick in).
>>
>> If not, how can I relocate disks from one controller to another?
>

zfs discuss - Dec 2009 - Problems while resilvering

[zfs-discuss] Problems while resilvering

[zfs-discuss] Problems while resilvering

[zfs-discuss] Problems while resilvering

[zfs-discuss] Problems while resilvering