Dear All, I am receiving DEGRAGED for zpool status -v. 3 out of 14 disks are reported as degraded with ''too many errors''. This is Build 99 running on x4240 with STK SAS RAID controller. Version of AAC driver is 2.2.5. I am not sure even where to start. Any advice is very much appreciated. Trying to convince management that ZFS is the way to go and then getting this problem. RAID controller does not report any problems with drives. This is RAIDZ (RAID5) zpool. Thank you everybody. Regards, Leonid -- This message posted from opensolaris.org
Leonid Roodnitsky wrote:> Dear All, > > I am receiving DEGRAGED for zpool status -v. 3 out of 14 disks are reported as degraded with ''too many errors''. This is Build 99 running on x4240 with STK SAS RAID controller. Version of AAC driver is 2.2.5. I am not sure even where to start. Any advice is very much appreciated. Trying to convince management that ZFS is the way to go and then getting this problem. RAID controller does not report any problems with drives. This is RAIDZ (RAID5) zpool. Thank you everybody. > >The zpool man page says: The health of the top-level vdev, such as mirror or raidz device, is potentially impacted by the state of its associ- ated vdevs, or component devices. A top-level vdev or com- ponent device is in one of the following states: DEGRADED One or more top-level vdevs is in the degraded state because one or more component devices are offline. Sufficient replicas exist to continue functioning. One or more component devices is in the degraded or faulted state, but sufficient replicas exist to continue functioning. The underlying condi- tions are as follows: o The number of checksum errors exceeds acceptable levels and the device is degraded as an indication that some- thing may be wrong. ZFS continues to use the device as necessary. o The number of I/O errors exceeds acceptable levels. The device could not be marked as faulted because there are insufficient replicas to continue func- tioning. You should take this into consideration as you decide whether to replace disks or not. -- richard
Dear All, Is there any way to figure out which piece is at fault? Sun SAS RAID (Adaptec/Intel) controller is reporting that drives are good, but ZFS is not happy about checksum errors. Is there any way to figure out which component introduced the error? Leonid -- This message posted from opensolaris.org
Leonid, You could use the fmdump -eV command to look for problems with these disks. This command might generate a lot of output, but it should be clear if the root cause is a problem accessing these devices. I would also check /var/adm/messages for any driver-related messages. Cindy Leonid Roodnitsky wrote:> Dear All, > > Is there any way to figure out which piece is at fault? Sun SAS RAID (Adaptec/Intel) controller is reporting that drives are good, but ZFS is not happy about checksum errors. Is there any way to figure out which component introduced the error? > > Leonid
Could this be relevant? Notice sd_cache_control mismatch message. Thank you everybody for any ideas or help. I really appreciate it. Feb 06 2009 23:14:07.704531935 ereport.io.scsi.cmd.disk.dev.uderr nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.uderr ena = 0x2487a4cf2e00c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path /pci at 0,0/pci10de,375 at f/pci108e,286 at 0/disk at 1,0 devid = id1,sd at TSun_____STK_RAID_INT____6DB80B08 (end detector) driver-assessment = fail op-code = 0x1a cdb = 0x1a 0x0 0x8 0x0 0x18 0x0 pkt-reason = 0x0 pkt-state = 0x1f pkt-stats = 0x0 stat-code = 0x0 un-decode-info = sd_cache_control: Mode Sense caching page code mismatch 0 un-decode-value __ttl = 0x1 __tod = 0x498d189f 0x29fe4ddf Leonid -----Original Message----- From: Cindy.Swearingen at Sun.COM [mailto:Cindy.Swearingen at Sun.COM] Sent: Tuesday, February 10, 2009 3:42 PM To: Roodnitsky, Leonid Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS corruption Leonid, You could use the fmdump -eV command to look for problems with these disks. This command might generate a lot of output, but it should be clear if the root cause is a problem accessing these devices. I would also check /var/adm/messages for any driver-related messages. Cindy Leonid Roodnitsky wrote:> Dear All, > > Is there any way to figure out which piece is at fault? Sun SAS RAID(Adaptec/Intel) controller is reporting that drives are good, but ZFS is not happy about checksum errors. Is there any way to figure out which component introduced the error?> > Leonid