Carsten Aulbert
2009-Nov-26 10:35 UTC
[zfs-discuss] Help needed to find out where the problem is
Hi all, on a x4500 with a relatively well patched Sol10u8 # uname -a SunOS s13 5.10 Generic_141445-09 i86pc i386 i86pc I''ve started a scrub after about 2 weeks of operation and have a lot of checksum errors: s13:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 1h17m, 8.96% done, 13h5m to go config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 6 c7t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 2 c6t2d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 1 c1t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 1 c0t6d0 ONLINE 0 0 0 raidz1 DEGRADED 0 0 0 spare DEGRADED 0 0 0 c1t6d0 DEGRADED 6 0 17 too many errors c8t7d0 ONLINE 0 0 0 11.8G resilvered c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 1 c8t6d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 1 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 logs c6t4d0 ONLINE 0 0 0 spares c8t7d0 INUSE currently in use So far, it seems that the pool survived it, but I''m a bit worried how to trace down the problem of this. Any suggestion how to proceed? Cheers Carsten
Richard Elling
2009-Nov-26 16:28 UTC
[zfs-discuss] Help needed to find out where the problem is
On Nov 26, 2009, at 2:35 AM, Carsten Aulbert wrote:> Hi all, > > on a x4500 with a relatively well patched Sol10u8 > > # uname -a > SunOS s13 5.10 Generic_141445-09 i86pc i386 i86pc > > I''ve started a scrub after about 2 weeks of operation and have a lot > of > checksum errors: > > s13:~# zpool status > pool: atlashome > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. > An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the > errors > using ''zpool clear'' or replace the device with ''zpool replace''.Have you run ''zpool clear'' yet? -- richard> see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver in progress for 1h17m, 8.96% done, 13h5m to go > config: > > NAME STATE READ WRITE CKSUM > atlashome DEGRADED 0 0 0 > raidz1 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > c7t0d0 ONLINE 0 0 0 > c8t0d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c5t1d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 6 > c7t1d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c8t1d0 ONLINE 0 0 0 > c0t2d0 ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > c5t2d0 ONLINE 0 0 2 > c6t2d0 ONLINE 0 0 1 > raidz1 ONLINE 0 0 0 > c7t2d0 ONLINE 0 0 0 > c8t2d0 ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > c1t3d0 ONLINE 0 0 0 > c5t3d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c7t3d0 ONLINE 0 0 0 > c8t3d0 ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c8t4d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 1 > c1t5d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > c6t5d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c8t5d0 ONLINE 0 0 1 > c0t6d0 ONLINE 0 0 0 > raidz1 DEGRADED 0 0 0 > spare DEGRADED 0 0 0 > c1t6d0 DEGRADED 6 0 17 too many errors > c8t7d0 ONLINE 0 0 0 11.8G resilvered > c5t6d0 ONLINE 0 0 0 > c6t6d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 1 > c8t6d0 ONLINE 0 0 1 > raidz1 ONLINE 0 0 0 > c0t7d0 ONLINE 0 0 0 > c1t7d0 ONLINE 0 0 1 > c5t7d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c7t7d0 ONLINE 0 0 0 > logs > c6t4d0 ONLINE 0 0 0 > spares > c8t7d0 INUSE currently in use > > > So far, it seems that the pool survived it, but I''m a bit worried > how to trace > down the problem of this. > > Any suggestion how to proceed? > > Cheers > > Carsten > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Cindy Swearingen
2009-Nov-26 16:38 UTC
[zfs-discuss] Help needed to find out where the problem is
> Hi all, > > on a x4500 with a relatively well patched Sol10u8 > > # uname -a > SunOS s13 5.10 Generic_141445-09 i86pc i386 i86pc > > I''ve started a scrub after about 2 weeks of operation > and have a lot of > checksum errors: > > s13:~# zpool status > > > > ced an unrecoverable error. An > attempt was made to correct the error. > Applications are unaffected. > tion: Determine if the device needs to be replaced, > and clear the errors > using ''zpool clear'' or replace the device > with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > > h17m, 8.96% done, 13h5m to go > config: > > > NAME STATE READ WRITE CKSUM > atlashome DEGRADED 0 0 0 > raidz1 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > c7t0d0 ONLINE 0 0 0 > c8t0d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c5t1d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 6 > c7t1d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c8t1d0 ONLINE 0 0 0 > c0t2d0 ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > c5t2d0 ONLINE 0 0 2 > c6t2d0 ONLINE 0 0 1 > raidz1 ONLINE 0 0 0 > c7t2d0 ONLINE 0 0 0 > c8t2d0 ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > c1t3d0 ONLINE 0 0 0 > c5t3d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c7t3d0 ONLINE 0 0 0 > c8t3d0 ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c8t4d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 1 > c1t5d0 ONLINE 0 0 0 > idz1 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > c6t5d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c8t5d0 ONLINE 0 0 1 > c0t6d0 ONLINE 0 0 0 > idz1 DEGRADED 0 0 0 > spare DEGRADED 0 0 0 > c1t6d0 DEGRADED 6 0 17 too many errors > c8t7d0 ONLINE 0 0 0 11.8G > resilvered > c5t6d0 ONLINE 0 0 0 > c6t6d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 1 > c8t6d0 ONLINE 0 0 1 > raidz1 ONLINE 0 0 0 > c0t7d0 ONLINE 0 0 0 > c1t7d0 ONLINE 0 0 1 > c5t7d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c7t7d0 ONLINE 0 0 0 > logs > c6t4d0 ONLINE 0 0 0 > spares > c8t7d0 INUSE currently in use > ar, it seems that the pool survived it, but I''m a bit > worried how to trace > down the problem of this. > > Any suggestion how to proceed? > > Cheers > > Carsten > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssHi Carsten, Did anything about this configuration change before the checksum errors occurred? The errors on c1t6d0 are severe enough that your spare kicked in. You can use the fmdump -eV command to review the disk errors that FMA has detected. This command can generate a lot of output but you can see if the checksum errors on the disks are transient or if they occur repeatedly. At the very least, I would consider physically replacing c1t6d0. Cindy -- This message posted from opensolaris.org
Carsten Aulbert
2009-Nov-27 08:44 UTC
[zfs-discuss] Help needed to find out where the problem is
Hi all, On Thursday 26 November 2009 17:38:42 Cindy Swearingen wrote:> Did anything about this configuration change before the checksum errors > occurred? >No, This machine is running in this configuration for a couple of weeks now> The errors on c1t6d0 are severe enough that your spare kicked in. >Yes and overnight more spare would have kicked in if available: s13:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 5h46m with 0 errors on Thu Nov 26 15:55:22 2009 config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 1 c6t1d0 ONLINE 0 0 6 c7t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 3 c6t2d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 1 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 1 c1t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 1 c0t6d0 ONLINE 0 0 0 raidz1 DEGRADED 0 0 0 spare DEGRADED 0 0 0 c1t6d0 DEGRADED 6 0 17 too many errors c8t7d0 ONLINE 0 0 0 130G resilvered c5t6d0 ONLINE 0 0 0 c6t6d0 DEGRADED 0 0 41 too many errors c7t6d0 DEGRADED 1 0 14 too many errors c8t6d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 1 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 logs c6t4d0 ONLINE 0 0 0 spares c8t7d0 INUSE currently in use errors: No known data errors> You can use the fmdump -eV command to review the disk errors that FMA has > detected. This command can generate a lot of output but you can see if > the checksum errors on the disks are transient or if they occur repeatedly. >Hmm, The output does not seem to stop. After about 1.3 GB of file size I stopped it. There seem to be a few different types here: Nov 04 2009 15:54:08.039456458 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0x403c56a7d4a00001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xea7c0de1586275c7 vdev = 0xfca535aa8bbc70d1 (end detector) pool = atlashome pool_guid = 0xea7c0de1586275c7 pool_context = 0 pool_failmode = wait vdev_guid = 0xfca535aa8bbc70d1 vdev_type = spare parent_guid = 0x371eb0d63ce91f06 parent_type = raidz zio_err = 0 zio_offset = 0x9706d7600 zio_size = 0x8000 zio_objset = 0x46 zio_object = 0xfbcc zio_level = 0 zio_blkid = 0x23 __ttl = 0x1 __tod = 0x4af19590 0x25a0eca or Nov 02 2009 16:55:37.076615439 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0xa351756c27900c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xea7c0de1586275c7 vdev = 0x55c360b6c3e946ea (end detector) pool = atlashome pool_guid = 0xea7c0de1586275c7 pool_context = 0 pool_failmode = wait vdev_guid = 0x55c360b6c3e946ea vdev_type = disk vdev_path = /dev/dsk/c8t0d0s0 vdev_devid = id1,sd at SATA_____HITACHI_HDS7250S______KRVN67ZBH9EY9H/a parent_guid = 0x371eb0d63ce91f06 parent_type = raidz zio_err = 0 zio_offset = 0x1632eee00 zio_size = 0x400 zio_objset = 0x28 zio_object = 0x797549 zio_level = 0 zio_blkid = 0x0 __ttl = 0x1 __tod = 0x4aef00f9 0x4910f0f or Oct 26 2009 15:43:43.973655977 ereport.fs.zfs.zpool nvlist version: 0 class = ereport.fs.zfs.zpool ena = 0x37f6ca58e400801 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x8f607617c7160c92 (end detector) pool = atlashome pool_guid = 0x8f607617c7160c92 pool_context = 2 pool_failmode = wait __ttl = 0x1 __tod = 0x4ae5b59f 0x3a08cfa9> At the very least, I would consider physically replacing c1t6d0. >That''s an option and see if I can let the system repair more of the errors. Regarding the error with a named disk, there is only one disk named in the output so far. Richard, I''ll try zpool clear as well, but wanted to wait for some feedback as this is the first time, we have hit a large number of errors. What I find strange why a single vdev is producing so many errors. I think it should not be possible to be a controller fault as these vdevs span across controllers, I''ve not seen any memory errors (yet), not faulty CPU messages... Thanks a lot for the input! Carsten
Bob Friesenhahn
2009-Nov-27 16:19 UTC
[zfs-discuss] Help needed to find out where the problem is
On Fri, 27 Nov 2009, Carsten Aulbert wrote:> >> At the very least, I would consider physically replacing c1t6d0. > > That''s an option and see if I can let the system repair more of the errors. > Regarding the error with a named disk, there is only one disk named in the > output so far.Definitely replace c1t6d0 once the resilver is complete.> Richard, I''ll try zpool clear as well, but wanted to wait for some feedback as > this is the first time, we have hit a large number of errors.It does not seem wise to do a ''clear'' until the resilver is complete and everything is stable.>From what others have posted here, sometimes the reported resultschange after any on-going scrub/resilvers have completed.> What I find strange why a single vdev is producing so many errors. I think it > should not be possible to be a controller fault as these vdevs span across > controllers, I''ve not seen any memory errors (yet), not faulty CPU messages...It is interesting that in addition to being in the same vdev, the disks encountering serious problems are all target 6. Besides something at the zfs level, there could be some some issue at the device driver, or underlying hardware level. Or maybe just bad luck. As I recall, Albert Chin-A-Young posted about a pool failure where many devices in the same raidz2 vdev spontaneously failed somehow (in his case the whole pool was lost). He is using different hardware but this looks somewhat similar. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Carsten Aulbert
2009-Nov-27 17:45 UTC
[zfs-discuss] Help needed to find out where the problem is
Hi Bob On Friday 27 November 2009 17:19:22 Bob Friesenhahn wrote:> > It is interesting that in addition to being in the same vdev, the > disks encountering serious problems are all target 6. Besides > something at the zfs level, there could be some some issue at the > device driver, or underlying hardware level. Or maybe just bad luck. > > As I recall, Albert Chin-A-Young posted about a pool failure where > many devices in the same raidz2 vdev spontaneously failed somehow (in > his case the whole pool was lost). He is using different hardware but > this looks somewhat similar.It looks quite similar as this one: http://www.mail-archive.com/storage-discuss at opensolaris.org/msg06125.html we swapped the drive and resilvering is almost though and the vdev is showing a large number of errors: raidz1 DEGRADED 0 0 1 spare DEGRADED 0 0 8.81M replacing DEGRADED 0 0 0 c1t6d0s0/o FAULTED 6 0 17 corrupted data c1t6d0 ONLINE 0 0 0 120G resilvered c8t7d0 ONLINE 0 0 0 120G resilvered c5t6d0 ONLINE 0 0 0 c6t6d0 DEGRADED 0 0 41 too many errors c7t6d0 DEGRADED 1 0 14 too many errors c8t6d0 ONLINE 0 0 1 If having all sixes is a problem, maybe we should try to use a diagonal approach the next time (or solve the n-queen problem on a rectangular thumper layout)... I guess after resilvering the next step will be zpool clear and a new scrub, but I fear that will show errors again. Cheers Carsten
Carsten Aulbert
2009-Nov-27 17:55 UTC
[zfs-discuss] Help needed to find out where the problem is
On Friday 27 November 2009 18:45:36 Carsten Aulbert wrote: I was too fast, now it looks completely different: scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27 18:46:33 2009 [...] s13:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27 18:46:33 2009 config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 1 c5t1d0 ONLINE 0 0 2 c6t1d0 ONLINE 0 0 6 c7t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 3 c6t2d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 1 c8t2d0 ONLINE 0 0 1 c0t3d0 ONLINE 0 0 1 c1t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 1 c8t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 1 c1t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 1 c0t5d0 ONLINE 0 0 1 c1t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 1 c0t6d0 ONLINE 0 0 0 raidz1 DEGRADED 0 0 1 c1t6d0 ONLINE 0 0 0 124G resilvered c5t6d0 ONLINE 0 0 0 c6t6d0 DEGRADED 0 0 41 too many errors c7t6d0 DEGRADED 1 0 14 too many errors c8t6d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 1 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 logs c6t4d0 ONLINE 0 0 0 spares c8t7d0 AVAIL Now the big question: (1) zpool clear or (2) bring in the spare again (or exchange two more disks)? Opinions? Cheers Carsten
Bob Friesenhahn
2009-Nov-27 19:07 UTC
[zfs-discuss] Help needed to find out where the problem is
On Fri, 27 Nov 2009, Carsten Aulbert wrote:> > Now the big question: > > (1) zpool clear or > (2) bring in the spare again (or exchange two more disks)? > > Opinions?Since "applications are unaffected" (good sign!), I would save all notes regarding current status, do ''zpool clear'', ''zpool scrub'' and then make a decision based on what things look like when ''zpool scrub'' has completed. If significant degredation continues on similar disks, then replace those disks and repeat the process until things stabilize. If things don''t stabilize, then suspect something like a motherboard or midplane problem, or a bad batch of disks. Since you are using only raidz1, it is wise to scrub periodically in order to uncover any failing data before it might be needed to support a resilver. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ross Walker
2009-Nov-27 20:31 UTC
[zfs-discuss] Help needed to find out where the problem is
On Nov 27, 2009, at 12:55 PM, Carsten Aulbert <carsten.aulbert at aei.mpg.de > wrote:> On Friday 27 November 2009 18:45:36 Carsten Aulbert wrote: > I was too fast, now it looks completely different: > > scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27 > 18:46:33 > 2009 > [...] > s13:~# zpool status > pool: atlashome > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. > An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the > errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver completed after 4h3m with 0 errors on Fri Nov 27 > 18:46:33 > 2009 > config: > > NAME STATE READ WRITE CKSUM > atlashome DEGRADED 0 0 0 > raidz1 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > c7t0d0 ONLINE 0 0 0 > c8t0d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 1 > c5t1d0 ONLINE 0 0 2 > c6t1d0 ONLINE 0 0 6 > c7t1d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c8t1d0 ONLINE 0 0 0 > c0t2d0 ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > c5t2d0 ONLINE 0 0 3 > c6t2d0 ONLINE 0 0 1 > raidz1 ONLINE 0 0 0 > c7t2d0 ONLINE 0 0 1 > c8t2d0 ONLINE 0 0 1 > c0t3d0 ONLINE 0 0 1 > c1t3d0 ONLINE 0 0 0 > c5t3d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c7t3d0 ONLINE 0 0 1 > c8t3d0 ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 1 > c1t4d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c8t4d0 ONLINE 0 0 1 > c0t5d0 ONLINE 0 0 1 > c1t5d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > c6t5d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c8t5d0 ONLINE 0 0 1 > c0t6d0 ONLINE 0 0 0 > raidz1 DEGRADED 0 0 1 > c1t6d0 ONLINE 0 0 0 124G resilvered > c5t6d0 ONLINE 0 0 0 > c6t6d0 DEGRADED 0 0 41 too many errors > c7t6d0 DEGRADED 1 0 14 too many errors > c8t6d0 ONLINE 0 0 1 > raidz1 ONLINE 0 0 0 > c0t7d0 ONLINE 0 0 0 > c1t7d0 ONLINE 0 0 1 > c5t7d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c7t7d0 ONLINE 0 0 0 > logs > c6t4d0 ONLINE 0 0 0 > spares > c8t7d0 AVAIL > > > Now the big question: > > (1) zpool clear or > (2) bring in the spare again (or exchange two more disks)? > > Opinions?I would plan downtime to physically inspect the cabling. -Ross
Carsten Aulbert
2009-Nov-27 20:53 UTC
[zfs-discuss] Help needed to find out where the problem is
Hi Ross, On Friday 27 November 2009 21:31:52 Ross Walker wrote:> I would plan downtime to physically inspect the cabling.There is not much cabling as the disks are directly connected to a large backplane (Sun Fire X4500).... Cheers Carsten
Carsten Aulbert
2009-Nov-30 16:46 UTC
[zfs-discuss] Help needed to find out where the problem is
Hi all, after the disk was exchanged, I ran ''zpool clear'' and another zpoo scrub afterwards... and guess what, now another vdev shows similar problems: s13:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 3h36m with 0 errors on Mon Nov 30 01:29:38 2009 config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 1 c1t0d0 ONLINE 0 0 2 c5t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 9 c6t2d0 ONLINE 0 0 0 raidz1 DEGRADED 0 0 0 c7t2d0 DEGRADED 14 0 73 too many errors spare DEGRADED 0 0 80 c8t2d0 DEGRADED 1 0 21 too many errors c8t7d0 ONLINE 0 0 0 154G resilvered c0t3d0 ONLINE 0 0 0 c1t3d0 DEGRADED 0 0 16 too many errors c5t3d0 DEGRADED 2 0 84 too many errors raidz1 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 1 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 1 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 1 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 1 c1t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 logs c6t4d0 ONLINE 0 0 0 spares c8t7d0 INUSE currently in use errors: No known data errors Now, the big question is, what could be faulty. fmadm only shows vdev checksum problems, right now I don''t have a spare system available, but I try to set one up. So far on my list: Faulty CPU, SSD, RAM, motherboard, controller,... Any suggestions? Cheers Carsten
Bob Friesenhahn
2009-Nov-30 18:09 UTC
[zfs-discuss] Help needed to find out where the problem is
On Mon, 30 Nov 2009, Carsten Aulbert wrote:> > after the disk was exchanged, I ran ''zpool clear'' and another zpoo scrub > afterwards... > > and guess what, now another vdev shows similar problems:Ugh!> Now, the big question is, what could be faulty. fmadm only shows vdev checksum > problems, right now I don''t have a spare system available, but I try to set > one up. > > So far on my list: Faulty CPU, SSD, RAM, motherboard, controller,...If this is a different vdev than before then it seems like there is either a software (driver/kernel) bug or the midplane/motherboard is faulty. Most of the problems are reported as CKSUM, which implies that after (successfully) reading data from the disks in the vdev and concatenating them to form a zfs block the resulting zfs block had a checksum error.> Any suggestions?Make sure that there are not fixes available for the kernel you are using. You could be encountering a bug which has already been fixed. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/