thr3ads.net - zfs discuss - [zfs-discuss] Help needed to find out where the problem is [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Carsten Aulbert

2009-Nov-26 10:35 UTC

[zfs-discuss] Help needed to find out where the problem is

Hi all,

on a x4500 with a relatively well patched Sol10u8

# uname -a
SunOS s13 5.10 Generic_141445-09 i86pc i386 i86pc

I''ve started a scrub after about 2 weeks of operation and have a lot of
checksum errors:

s13:~# zpool status                                         
  pool: atlashome                                           
 state: DEGRADED                                            
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors  
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P                                  
 scrub: resilver in progress for 1h17m, 8.96% done, 13h5m to go             
config:                                                                     

        NAME          STATE     READ WRITE CKSUM
        atlashome     DEGRADED     0     0     0
          raidz1      ONLINE       0     0     0
            c0t0d0    ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            c5t0d0    ONLINE       0     0     0
            c7t0d0    ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     0
            c6t1d0    ONLINE       0     0     6
            c7t1d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c8t1d0    ONLINE       0     0     0
            c0t2d0    ONLINE       0     0     0
            c1t2d0    ONLINE       0     0     0
            c5t2d0    ONLINE       0     0     2
            c6t2d0    ONLINE       0     0     1
          raidz1      ONLINE       0     0     0
            c7t2d0    ONLINE       0     0     0
            c8t2d0    ONLINE       0     0     0
            c0t3d0    ONLINE       0     0     0
            c1t3d0    ONLINE       0     0     0
            c5t3d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0
            c7t3d0    ONLINE       0     0     0
            c8t3d0    ONLINE       0     0     0
            c0t4d0    ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t4d0    ONLINE       0     0     0
            c7t4d0    ONLINE       0     0     0
            c8t4d0    ONLINE       0     0     0
            c0t5d0    ONLINE       0     0     1
            c1t5d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t5d0    ONLINE       0     0     0
            c6t5d0    ONLINE       0     0     0
            c7t5d0    ONLINE       0     0     0
            c8t5d0    ONLINE       0     0     1
            c0t6d0    ONLINE       0     0     0
          raidz1      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              c1t6d0  DEGRADED     6     0    17  too many errors
              c8t7d0  ONLINE       0     0     0  11.8G resilvered
            c5t6d0    ONLINE       0     0     0
            c6t6d0    ONLINE       0     0     0
            c7t6d0    ONLINE       0     0     1
            c8t6d0    ONLINE       0     0     1
          raidz1      ONLINE       0     0     0
            c0t7d0    ONLINE       0     0     0
            c1t7d0    ONLINE       0     0     1
            c5t7d0    ONLINE       0     0     0
            c6t7d0    ONLINE       0     0     0
            c7t7d0    ONLINE       0     0     0
        logs
          c6t4d0      ONLINE       0     0     0
        spares
          c8t7d0      INUSE     currently in use


So far, it seems that the pool survived it, but I''m a bit worried how
to trace
down the problem of this.

Any suggestion how to proceed?

Cheers

Carsten

Richard Elling

2009-Nov-26 16:28 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

On Nov 26, 2009, at 2:35 AM, Carsten Aulbert wrote:
> Hi all,
>
> on a x4500 with a relatively well patched Sol10u8
>
> # uname -a
> SunOS s13 5.10 Generic_141445-09 i86pc i386 i86pc
>
> I''ve started a scrub after about 2 weeks of operation and have a
lot
> of
> checksum errors:
>
> s13:~# zpool status
>  pool: atlashome
> state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.   
> An
>        attempt was made to correct the error.  Applications are  
> unaffected.
> action: Determine if the device needs to be replaced, and clear the  
> errors
>        using ''zpool clear'' or replace the device with
''zpool replace''.
Have you run ''zpool clear'' yet?
  -- richard
>   see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: resilver in progress for 1h17m, 8.96% done, 13h5m to go
> config:
>
>        NAME          STATE     READ WRITE CKSUM
>        atlashome     DEGRADED     0     0     0
>          raidz1      ONLINE       0     0     0
>            c0t0d0    ONLINE       0     0     0
>            c1t0d0    ONLINE       0     0     0
>            c5t0d0    ONLINE       0     0     0
>            c7t0d0    ONLINE       0     0     0
>            c8t0d0    ONLINE       0     0     0
>          raidz1      ONLINE       0     0     0
>            c0t1d0    ONLINE       0     0     0
>            c1t1d0    ONLINE       0     0     0
>            c5t1d0    ONLINE       0     0     0
>            c6t1d0    ONLINE       0     0     6
>            c7t1d0    ONLINE       0     0     0
>          raidz1      ONLINE       0     0     0
>            c8t1d0    ONLINE       0     0     0
>            c0t2d0    ONLINE       0     0     0
>            c1t2d0    ONLINE       0     0     0
>            c5t2d0    ONLINE       0     0     2
>            c6t2d0    ONLINE       0     0     1
>          raidz1      ONLINE       0     0     0
>            c7t2d0    ONLINE       0     0     0
>            c8t2d0    ONLINE       0     0     0
>            c0t3d0    ONLINE       0     0     0
>            c1t3d0    ONLINE       0     0     0
>            c5t3d0    ONLINE       0     0     0
>          raidz1      ONLINE       0     0     0
>            c6t3d0    ONLINE       0     0     0
>            c7t3d0    ONLINE       0     0     0
>            c8t3d0    ONLINE       0     0     0
>            c0t4d0    ONLINE       0     0     0
>            c1t4d0    ONLINE       0     0     0
>          raidz1      ONLINE       0     0     0
>            c5t4d0    ONLINE       0     0     0
>            c7t4d0    ONLINE       0     0     0
>            c8t4d0    ONLINE       0     0     0
>            c0t5d0    ONLINE       0     0     1
>            c1t5d0    ONLINE       0     0     0
>          raidz1      ONLINE       0     0     0
>            c5t5d0    ONLINE       0     0     0
>            c6t5d0    ONLINE       0     0     0
>            c7t5d0    ONLINE       0     0     0
>            c8t5d0    ONLINE       0     0     1
>            c0t6d0    ONLINE       0     0     0
>          raidz1      DEGRADED     0     0     0
>            spare     DEGRADED     0     0     0
>              c1t6d0  DEGRADED     6     0    17  too many errors
>              c8t7d0  ONLINE       0     0     0  11.8G resilvered
>            c5t6d0    ONLINE       0     0     0
>            c6t6d0    ONLINE       0     0     0
>            c7t6d0    ONLINE       0     0     1
>            c8t6d0    ONLINE       0     0     1
>          raidz1      ONLINE       0     0     0
>            c0t7d0    ONLINE       0     0     0
>            c1t7d0    ONLINE       0     0     1
>            c5t7d0    ONLINE       0     0     0
>            c6t7d0    ONLINE       0     0     0
>            c7t7d0    ONLINE       0     0     0
>        logs
>          c6t4d0      ONLINE       0     0     0
>        spares
>          c8t7d0      INUSE     currently in use
>
>
> So far, it seems that the pool survived it, but I''m a bit worried
> how to trace
> down the problem of this.
>
> Any suggestion how to proceed?
>
> Cheers
>
> Carsten
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Cindy Swearingen

2009-Nov-26 16:38 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

> Hi all,
> 
> on a x4500 with a relatively well patched Sol10u8
> 
> # uname -a
> SunOS s13 5.10 Generic_141445-09 i86pc i386 i86pc
> 
> I''ve started a scrub after about 2 weeks of operation
> and have a lot of 
> checksum errors:
> 
> s13:~# zpool status
>                                         
>                     
>                                         
> ced an unrecoverable error.  An
> attempt was made to correct the error.
>   Applications are unaffected.
> tion: Determine if the device needs to be replaced,
> and clear the errors  
> using ''zpool clear'' or replace the device
>  with ''zpool replace''.     
> see: http://www.sun.com/msg/ZFS-8000-9P
>                                   
> h17m, 8.96% done, 13h5m to go             
> config:
> 
> 
>         NAME          STATE     READ WRITE CKSUM
> atlashome     DEGRADED     0     0     0
>           raidz1      ONLINE       0     0     0
>   c0t0d0    ONLINE       0     0     0
>           c1t0d0    ONLINE       0     0     0
>   c5t0d0    ONLINE       0     0     0
>           c7t0d0    ONLINE       0     0     0
>   c8t0d0    ONLINE       0     0     0
>         raidz1      ONLINE       0     0     0
>     c0t1d0    ONLINE       0     0     0
>         c1t1d0    ONLINE       0     0     0
>     c5t1d0    ONLINE       0     0     0
>         c6t1d0    ONLINE       0     0     6
>     c7t1d0    ONLINE       0     0     0
>       raidz1      ONLINE       0     0     0
>       c8t1d0    ONLINE       0     0     0
>       c0t2d0    ONLINE       0     0     0
>       c1t2d0    ONLINE       0     0     0
>       c5t2d0    ONLINE       0     0     2
>       c6t2d0    ONLINE       0     0     1
>     raidz1      ONLINE       0     0     0
>         c7t2d0    ONLINE       0     0     0
>     c8t2d0    ONLINE       0     0     0
>         c0t3d0    ONLINE       0     0     0
>     c1t3d0    ONLINE       0     0     0
>         c5t3d0    ONLINE       0     0     0
>   raidz1      ONLINE       0     0     0
>           c6t3d0    ONLINE       0     0     0
>   c7t3d0    ONLINE       0     0     0
>           c8t3d0    ONLINE       0     0     0
>   c0t4d0    ONLINE       0     0     0
>           c1t4d0    ONLINE       0     0     0
> raidz1      ONLINE       0     0     0
>             c5t4d0    ONLINE       0     0     0
> c7t4d0    ONLINE       0     0     0
>             c8t4d0    ONLINE       0     0     0
> c0t5d0    ONLINE       0     0     1
>             c1t5d0    ONLINE       0     0     0
> idz1      ONLINE       0     0     0
>             c5t5d0    ONLINE       0     0     0
> c6t5d0    ONLINE       0     0     0
>             c7t5d0    ONLINE       0     0     0
> c8t5d0    ONLINE       0     0     1
>             c0t6d0    ONLINE       0     0     0
> idz1      DEGRADED     0     0     0
>             spare     DEGRADED     0     0     0
>   c1t6d0  DEGRADED     6     0    17  too many errors
> c8t7d0  ONLINE       0     0     0  11.8G
> resilvered
>             c5t6d0    ONLINE       0     0     0
> c6t6d0    ONLINE       0     0     0
>             c7t6d0    ONLINE       0     0     1
> c8t6d0    ONLINE       0     0     1
>           raidz1      ONLINE       0     0     0
>   c0t7d0    ONLINE       0     0     0
>           c1t7d0    ONLINE       0     0     1
>   c5t7d0    ONLINE       0     0     0
>           c6t7d0    ONLINE       0     0     0
>   c7t7d0    ONLINE       0     0     0
>       logs
>     c6t4d0      ONLINE       0     0     0
>     spares
>       c8t7d0      INUSE     currently in use
> ar, it seems that the pool survived it, but I''m a bit
> worried how to trace 
> down the problem of this.
> 
> Any suggestion how to proceed?
> 
> Cheers
> 
> Carsten
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
Hi Carsten,

Did anything about this configuration change before the checksum errors
occurred?

The errors on c1t6d0 are severe enough that your spare kicked in.

You can use the fmdump -eV command to review the disk errors that FMA has
detected. This command can generate a lot of output but you can see if
the checksum errors on the disks are transient or if they occur repeatedly.

At the very least, I would consider physically replacing c1t6d0.

Cindy
-- 
This message posted from opensolaris.org

Carsten Aulbert

2009-Nov-27 08:44 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

Hi all,

On Thursday 26 November 2009 17:38:42 Cindy Swearingen
wrote:> Did anything about this configuration change before the checksum errors
> occurred?
> 
No, This machine is running in this configuration for a couple of weeks now
> The errors on c1t6d0 are severe enough that your spare kicked in.
> Yes and overnight more spare would have kicked in if available:

s13:~# zpool status                                         
  pool: atlashome                                           
 state: DEGRADED                                            
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors  
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P                                  
 scrub: resilver completed after 5h46m with 0 errors on Thu Nov 26 15:55:22 
2009
config:                                                                         

        NAME          STATE     READ WRITE CKSUM
        atlashome     DEGRADED     0     0     0
          raidz1      ONLINE       0     0     0
            c0t0d0    ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            c5t0d0    ONLINE       0     0     0
            c7t0d0    ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     1
            c6t1d0    ONLINE       0     0     6
            c7t1d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c8t1d0    ONLINE       0     0     0
            c0t2d0    ONLINE       0     0     0
            c1t2d0    ONLINE       0     0     0
            c5t2d0    ONLINE       0     0     3
            c6t2d0    ONLINE       0     0     1
          raidz1      ONLINE       0     0     0
            c7t2d0    ONLINE       0     0     0
            c8t2d0    ONLINE       0     0     1
            c0t3d0    ONLINE       0     0     0
            c1t3d0    ONLINE       0     0     0
            c5t3d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0
            c7t3d0    ONLINE       0     0     0
            c8t3d0    ONLINE       0     0     0
            c0t4d0    ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t4d0    ONLINE       0     0     0
            c7t4d0    ONLINE       0     0     0
            c8t4d0    ONLINE       0     0     0
            c0t5d0    ONLINE       0     0     1
            c1t5d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t5d0    ONLINE       0     0     0
            c6t5d0    ONLINE       0     0     0
            c7t5d0    ONLINE       0     0     0
            c8t5d0    ONLINE       0     0     1
            c0t6d0    ONLINE       0     0     0
          raidz1      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              c1t6d0  DEGRADED     6     0    17  too many errors
              c8t7d0  ONLINE       0     0     0  130G resilvered
            c5t6d0    ONLINE       0     0     0
            c6t6d0    DEGRADED     0     0    41  too many errors
            c7t6d0    DEGRADED     1     0    14  too many errors
            c8t6d0    ONLINE       0     0     1
          raidz1      ONLINE       0     0     0
            c0t7d0    ONLINE       0     0     0
            c1t7d0    ONLINE       0     0     1
            c5t7d0    ONLINE       0     0     0
            c6t7d0    ONLINE       0     0     0
            c7t7d0    ONLINE       0     0     0
        logs
          c6t4d0      ONLINE       0     0     0
        spares
          c8t7d0      INUSE     currently in use

errors: No known data errors> You can use the fmdump -eV command to review the disk errors that FMA has
> detected. This command can generate a lot of output but you can see if
> the checksum errors on the disks are transient or if they occur repeatedly.
> 
Hmm, The output does not seem to stop. After about 1.3 GB of file size I 
stopped it. There seem to be a few different types here:

Nov 04 2009 15:54:08.039456458 ereport.fs.zfs.checksum
nvlist version: 0
        class = ereport.fs.zfs.checksum
        ena = 0x403c56a7d4a00001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = zfs
                pool = 0xea7c0de1586275c7
                vdev = 0xfca535aa8bbc70d1
        (end detector)

        pool = atlashome
        pool_guid = 0xea7c0de1586275c7
        pool_context = 0
        pool_failmode = wait
        vdev_guid = 0xfca535aa8bbc70d1
        vdev_type = spare
        parent_guid = 0x371eb0d63ce91f06
        parent_type = raidz
        zio_err = 0
        zio_offset = 0x9706d7600
        zio_size = 0x8000
        zio_objset = 0x46
        zio_object = 0xfbcc
        zio_level = 0
        zio_blkid = 0x23
        __ttl = 0x1
        __tod = 0x4af19590 0x25a0eca

or
Nov 02 2009 16:55:37.076615439 ereport.fs.zfs.checksum
nvlist version: 0                                     
        class = ereport.fs.zfs.checksum               
        ena = 0xa351756c27900c01                      
        detector = (embedded nvlist)                  
        nvlist version: 0                             
                version = 0x0                         
                scheme = zfs                          
                pool = 0xea7c0de1586275c7             
                vdev = 0x55c360b6c3e946ea             
        (end detector)                                

        pool = atlashome
        pool_guid = 0xea7c0de1586275c7
        pool_context = 0              
        pool_failmode = wait          
        vdev_guid = 0x55c360b6c3e946ea
        vdev_type = disk              
        vdev_path = /dev/dsk/c8t0d0s0 
        vdev_devid = id1,sd at SATA_____HITACHI_HDS7250S______KRVN67ZBH9EY9H/a
        parent_guid = 0x371eb0d63ce91f06
        parent_type = raidz
        zio_err = 0
        zio_offset = 0x1632eee00
        zio_size = 0x400
        zio_objset = 0x28
        zio_object = 0x797549
        zio_level = 0
        zio_blkid = 0x0
        __ttl = 0x1
        __tod = 0x4aef00f9 0x4910f0f

or

Oct 26 2009 15:43:43.973655977 ereport.fs.zfs.zpool
nvlist version: 0
        class = ereport.fs.zfs.zpool
        ena = 0x37f6ca58e400801
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = zfs
                pool = 0x8f607617c7160c92
        (end detector)

        pool = atlashome
        pool_guid = 0x8f607617c7160c92
        pool_context = 2
        pool_failmode = wait
        __ttl = 0x1
        __tod = 0x4ae5b59f 0x3a08cfa9

> At the very least, I would consider physically replacing c1t6d0.
> 
That''s an option and see if I can let the system repair more of the
errors.
Regarding the error with a named disk, there is only one disk named in the 
output so far.

Richard, I''ll try zpool clear as well, but wanted to wait for some
feedback as
this is the first time, we have hit a large number of errors.

What I find strange why a single vdev is producing so many errors. I think it 
should not be possible to be a controller fault as these vdevs span across 
controllers, I''ve not seen any memory errors (yet), not faulty CPU
messages...

Thanks a lot for the input!

Carsten

Bob Friesenhahn

2009-Nov-27 16:19 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

On Fri, 27 Nov 2009, Carsten Aulbert wrote:>
>> At the very least, I would consider physically replacing c1t6d0.
>
> That''s an option and see if I can let the system repair more of
the errors.
> Regarding the error with a named disk, there is only one disk named in the
> output so far.
Definitely replace c1t6d0 once the resilver is complete.
> Richard, I''ll try zpool clear as well, but wanted to wait for some
feedback as
> this is the first time, we have hit a large number of errors.
It does not seem wise to do a ''clear'' until the resilver is
complete
and everything is stable.
>From what others have posted here, sometimes the reported results change after any on-going scrub/resilvers have completed.
> What I find strange why a single vdev is producing so many errors. I think
it
> should not be possible to be a controller fault as these vdevs span across
> controllers, I''ve not seen any memory errors (yet), not faulty CPU
messages...
It is interesting that in addition to being in the same vdev, the 
disks encountering serious problems are all target 6.  Besides 
something at the zfs level, there could be some some issue at the 
device driver, or underlying hardware level.  Or maybe just bad luck.

As I recall, Albert Chin-A-Young posted about a pool failure where 
many devices in the same raidz2 vdev spontaneously failed somehow (in 
his case the whole pool was lost).  He is using different hardware but 
this looks somewhat similar.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Carsten Aulbert

2009-Nov-27 17:45 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

Hi Bob

On Friday 27 November 2009 17:19:22 Bob Friesenhahn
wrote:> 
> It is interesting that in addition to being in the same vdev, the
> disks encountering serious problems are all target 6.  Besides
> something at the zfs level, there could be some some issue at the
> device driver, or underlying hardware level.  Or maybe just bad luck.
> 
> As I recall, Albert Chin-A-Young posted about a pool failure where
> many devices in the same raidz2 vdev spontaneously failed somehow (in
> his case the whole pool was lost).  He is using different hardware but
> this looks somewhat similar.
It looks quite similar as this one:

http://www.mail-archive.com/storage-discuss at opensolaris.org/msg06125.html

we swapped the drive and resilvering is almost though and the vdev is showing 
a large number of errors:

 raidz1            DEGRADED     0     0     1
            spare           DEGRADED     0     0 8.81M
              replacing     DEGRADED     0     0     0
                c1t6d0s0/o  FAULTED      6     0    17  corrupted data
                c1t6d0      ONLINE       0     0     0  120G resilvered
              c8t7d0        ONLINE       0     0     0  120G resilvered
            c5t6d0          ONLINE       0     0     0
            c6t6d0          DEGRADED     0     0    41  too many errors
            c7t6d0          DEGRADED     1     0    14  too many errors
            c8t6d0          ONLINE       0     0     1


If having all sixes is a problem, maybe we should try to use a diagonal 
approach the next time (or solve the n-queen problem on a rectangular thumper 
layout)...

I guess after resilvering the next step will be zpool clear and a new scrub, 
but I fear that will show errors again.

Cheers

Carsten

Carsten Aulbert

2009-Nov-27 17:55 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

On Friday 27 November 2009 18:45:36 Carsten Aulbert wrote:
I was too fast, now it looks completely different:

 scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27 18:46:33 
2009
[...]
s13:~# zpool status         
  pool: atlashome           
 state: DEGRADED            
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors  
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P                                  
 scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27 18:46:33 
2009
config:                                                                        

        NAME        STATE     READ WRITE CKSUM
        atlashome   DEGRADED     0     0     0
          raidz1    ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
            c7t0d0  ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     1
            c5t1d0  ONLINE       0     0     2
            c6t1d0  ONLINE       0     0     6
            c7t1d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c8t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     3
            c6t2d0  ONLINE       0     0     1
          raidz1    ONLINE       0     0     0
            c7t2d0  ONLINE       0     0     1
            c8t2d0  ONLINE       0     0     1
            c0t3d0  ONLINE       0     0     1
            c1t3d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
            c7t3d0  ONLINE       0     0     1
            c8t3d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     1
            c1t4d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c5t4d0  ONLINE       0     0     0
            c7t4d0  ONLINE       0     0     0
            c8t4d0  ONLINE       0     0     1
            c0t5d0  ONLINE       0     0     1
            c1t5d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c5t5d0  ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0
            c7t5d0  ONLINE       0     0     0
            c8t5d0  ONLINE       0     0     1
            c0t6d0  ONLINE       0     0     0
          raidz1    DEGRADED     0     0     1
            c1t6d0  ONLINE       0     0     0  124G resilvered
            c5t6d0  ONLINE       0     0     0
            c6t6d0  DEGRADED     0     0    41  too many errors
            c7t6d0  DEGRADED     1     0    14  too many errors
            c8t6d0  ONLINE       0     0     1
          raidz1    ONLINE       0     0     0
            c0t7d0  ONLINE       0     0     0
            c1t7d0  ONLINE       0     0     1
            c5t7d0  ONLINE       0     0     0
            c6t7d0  ONLINE       0     0     0
            c7t7d0  ONLINE       0     0     0
        logs
          c6t4d0    ONLINE       0     0     0
        spares
          c8t7d0    AVAIL


Now the big question:

(1) zpool clear or
(2) bring in the spare again (or exchange two more disks)?

Opinions?

Cheers

Carsten

Bob Friesenhahn

2009-Nov-27 19:07 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

On Fri, 27 Nov 2009, Carsten Aulbert wrote:>
> Now the big question:
>
> (1) zpool clear or
> (2) bring in the spare again (or exchange two more disks)?
>
> Opinions?
Since "applications are unaffected" (good sign!), I would save all 
notes regarding current status, do ''zpool clear'',
''zpool scrub'' and
then make a decision based on what things look like when ''zpool
scrub''
has completed.  If significant degredation continues on similar disks, 
then replace those disks and repeat the process until things 
stabilize.  If things don''t stabilize, then suspect something like a 
motherboard or midplane problem, or a bad batch of disks.

Since you are using only raidz1, it is wise to scrub periodically in 
order to uncover any failing data before it might be needed to support 
a resilver.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Walker

2009-Nov-27 20:31 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

On Nov 27, 2009, at 12:55 PM, Carsten Aulbert <carsten.aulbert at aei.mpg.de 
 > wrote:
> On Friday 27 November 2009 18:45:36 Carsten Aulbert wrote:
> I was too fast, now it looks completely different:
>
> scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27  
> 18:46:33
> 2009
> [...]
> s13:~# zpool status
>  pool: atlashome
> state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.   
> An
>        attempt was made to correct the error.  Applications are  
> unaffected.
> action: Determine if the device needs to be replaced, and clear the  
> errors
>        using ''zpool clear'' or replace the device with
''zpool replace''.
>   see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27  
> 18:46:33
> 2009
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        atlashome   DEGRADED     0     0     0
>          raidz1    ONLINE       0     0     0
>            c0t0d0  ONLINE       0     0     0
>            c1t0d0  ONLINE       0     0     0
>            c5t0d0  ONLINE       0     0     0
>            c7t0d0  ONLINE       0     0     0
>            c8t0d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c0t1d0  ONLINE       0     0     0
>            c1t1d0  ONLINE       0     0     1
>            c5t1d0  ONLINE       0     0     2
>            c6t1d0  ONLINE       0     0     6
>            c7t1d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c8t1d0  ONLINE       0     0     0
>            c0t2d0  ONLINE       0     0     0
>            c1t2d0  ONLINE       0     0     0
>            c5t2d0  ONLINE       0     0     3
>            c6t2d0  ONLINE       0     0     1
>          raidz1    ONLINE       0     0     0
>            c7t2d0  ONLINE       0     0     1
>            c8t2d0  ONLINE       0     0     1
>            c0t3d0  ONLINE       0     0     1
>            c1t3d0  ONLINE       0     0     0
>            c5t3d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c6t3d0  ONLINE       0     0     0
>            c7t3d0  ONLINE       0     0     1
>            c8t3d0  ONLINE       0     0     0
>            c0t4d0  ONLINE       0     0     1
>            c1t4d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c5t4d0  ONLINE       0     0     0
>            c7t4d0  ONLINE       0     0     0
>            c8t4d0  ONLINE       0     0     1
>            c0t5d0  ONLINE       0     0     1
>            c1t5d0  ONLINE       0     0     0
>          raidz1    ONLINE       0     0     0
>            c5t5d0  ONLINE       0     0     0
>            c6t5d0  ONLINE       0     0     0
>            c7t5d0  ONLINE       0     0     0
>            c8t5d0  ONLINE       0     0     1
>            c0t6d0  ONLINE       0     0     0
>          raidz1    DEGRADED     0     0     1
>            c1t6d0  ONLINE       0     0     0  124G resilvered
>            c5t6d0  ONLINE       0     0     0
>            c6t6d0  DEGRADED     0     0    41  too many errors
>            c7t6d0  DEGRADED     1     0    14  too many errors
>            c8t6d0  ONLINE       0     0     1
>          raidz1    ONLINE       0     0     0
>            c0t7d0  ONLINE       0     0     0
>            c1t7d0  ONLINE       0     0     1
>            c5t7d0  ONLINE       0     0     0
>            c6t7d0  ONLINE       0     0     0
>            c7t7d0  ONLINE       0     0     0
>        logs
>          c6t4d0    ONLINE       0     0     0
>        spares
>          c8t7d0    AVAIL
>
>
> Now the big question:
>
> (1) zpool clear or
> (2) bring in the spare again (or exchange two more disks)?
>
> Opinions?
I would plan downtime to physically inspect the cabling.

-Ross

Carsten Aulbert

2009-Nov-27 20:53 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

Hi Ross,

On Friday 27 November 2009 21:31:52 Ross Walker wrote:> I would plan downtime to physically inspect the cabling.
There is not much cabling as the disks are directly connected to a large 
backplane (Sun Fire X4500)....

Cheers

Carsten

Carsten Aulbert

2009-Nov-30 16:46 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

Hi all,

after the disk was exchanged, I ran ''zpool clear'' and another
zpoo scrub
afterwards...

and guess what, now another vdev shows similar problems:

s13:~# zpool status                                         
  pool: atlashome                                           
 state: DEGRADED                                            
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors  
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P                                  
 scrub: resilver completed after 3h36m with 0 errors on Mon Nov 30 01:29:38 
2009
config:                                                                         

        NAME          STATE     READ WRITE CKSUM
        atlashome     DEGRADED     0     0     0
          raidz1      ONLINE       0     0     0
            c0t0d0    ONLINE       0     0     1
            c1t0d0    ONLINE       0     0     2
            c5t0d0    ONLINE       0     0     0
            c7t0d0    ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     0
            c6t1d0    ONLINE       0     0     0
            c7t1d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c8t1d0    ONLINE       0     0     0
            c0t2d0    ONLINE       0     0     0
            c1t2d0    ONLINE       0     0     0
            c5t2d0    ONLINE       0     0     9
            c6t2d0    ONLINE       0     0     0
          raidz1      DEGRADED     0     0     0
            c7t2d0    DEGRADED    14     0    73  too many errors
            spare     DEGRADED     0     0    80                 
              c8t2d0  DEGRADED     1     0    21  too many errors
              c8t7d0  ONLINE       0     0     0  154G resilvered
            c0t3d0    ONLINE       0     0     0                 
            c1t3d0    DEGRADED     0     0    16  too many errors
            c5t3d0    DEGRADED     2     0    84  too many errors
          raidz1      ONLINE       0     0     0                 
            c6t3d0    ONLINE       0     0     0                 
            c7t3d0    ONLINE       0     0     0
            c8t3d0    ONLINE       0     0     1
            c0t4d0    ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t4d0    ONLINE       0     0     1
            c7t4d0    ONLINE       0     0     0
            c8t4d0    ONLINE       0     0     0
            c0t5d0    ONLINE       0     0     0
            c1t5d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t5d0    ONLINE       0     0     0
            c6t5d0    ONLINE       0     0     1
            c7t5d0    ONLINE       0     0     0
            c8t5d0    ONLINE       0     0     0
            c0t6d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c1t6d0    ONLINE       0     0     0
            c5t6d0    ONLINE       0     0     0
            c6t6d0    ONLINE       0     0     0
            c7t6d0    ONLINE       0     0     0
            c8t6d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c0t7d0    ONLINE       0     0     1
            c1t7d0    ONLINE       0     0     0
            c5t7d0    ONLINE       0     0     0
            c6t7d0    ONLINE       0     0     0
            c7t7d0    ONLINE       0     0     0
        logs
          c6t4d0      ONLINE       0     0     0
        spares
          c8t7d0      INUSE     currently in use

errors: No known data errors

Now, the big question is, what could be faulty. fmadm only shows vdev checksum 
problems, right now I don''t have a spare system available, but I try to
set
one up.

So far on my list: Faulty CPU, SSD, RAM, motherboard, controller,...

Any suggestions?

Cheers

Carsten

Bob Friesenhahn

2009-Nov-30 18:09 UTC

head link

[zfs-discuss] Help needed to find out where the problem is

On Mon, 30 Nov 2009, Carsten Aulbert wrote:>
> after the disk was exchanged, I ran ''zpool clear'' and
another zpoo scrub
> afterwards...
>
> and guess what, now another vdev shows similar problems:
Ugh!
> Now, the big question is, what could be faulty. fmadm only shows vdev
checksum
> problems, right now I don''t have a spare system available, but I
try to set
> one up.
>
> So far on my list: Faulty CPU, SSD, RAM, motherboard, controller,...
If this is a different vdev than before then it seems like there is 
either a software (driver/kernel) bug or the midplane/motherboard is 
faulty.  Most of the problems are reported as CKSUM, which implies 
that after (successfully) reading data from the disks in the vdev and 
concatenating them to form a zfs block the resulting zfs block had a 
checksum error.
> Any suggestions?
Make sure that there are not fixes available for the kernel you are 
using.  You could be encountering a bug which has already been fixed.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

zfs discuss - Nov 2009 - Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is

[zfs-discuss] Help needed to find out where the problem is