Hi, I recently noticed that there are a lot of Hard Errors on multiple drives that''s being reported by iostat. Also, dmesg reports various messages from the mpt driver. My config is: MB: SUPERMICRO X8SIL-F HBA: AOC-USAS-L8i (LSI 1068) RAM: 4GB ECC SunOS SAN 5.11 snv_134 i86pc i386 i86pc Solaris My configuration is a striped mirrored vdev of 13 drives (one mirror had an error on a drive, which I cleared. But just to be safe I added another drive to the mirror): NAME STATE READ WRITE CKSUM zpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c4t13d0 ONLINE 0 0 0 c4t19d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c4t25d0 ONLINE 0 0 0 c4t31d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c4t12d0 ONLINE 0 0 0 c4t18d0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c4t24d0 ONLINE 0 0 0 c4t30d0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c4t11d0 ONLINE 0 0 0 c4t17d0 ONLINE 0 0 0 c4t10d0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c4t23d0 ONLINE 0 0 0 c4t29d0 ONLINE 0 0 0 Here''s the output from iostat -En: c6d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: WDC WD3200BEKT- Revision: Serial No: WD-WXR1A30 Size: 320.07GB <320070352896 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c7d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: WDC WD3200BEKT- Revision: Serial No: WD-WXR1A30 Size: 320.07GB <320070352896 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c4t12d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t13d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t18d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t19d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t24d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t25d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t30d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t31d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t17d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD20EADS-32S Revision: 0A01 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t11d0 Soft Errors: 0 Hard Errors: 17 Transport Errors: 116 Vendor: ATA Product: WDC WD20EADS-32S Revision: 5G04 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 17 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t23d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST31500341AS Revision: CC1H Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t29d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST31500341AS Revision: CC1H Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t10d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD204UI Revision: 0001 Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 And a sample from dmesg: Jan 1 10:26:28 SAN Log info 0x31123000 received for target 11. Jan 1 10:26:28 SAN scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Jan 1 10:26:28 SAN scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,d138 at 3/pci15d9,a580 at 0 (mpt0): Jan 1 10:26:28 SAN Log info 0x31123000 received for target 11. Jan 1 10:26:28 SAN scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Jan 1 10:26:28 SAN scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,d138 at 3/pci15d9,a580 at 0 (mpt0): Jan 1 10:26:28 SAN Log info 0x31123000 received for target 11. Jan 1 10:26:28 SAN scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc What do they mean? It can''t be that most of my SAMSUNG drives are failing? They almost all have the same number of errors, which is weird. Could this be caused by the fact that these SAMSUNG drives have 4K sectors? ''zpool status'' reports no errors, although it did report a checksum error a while back on a drive, which I cleared. Any help greatly appreciated! Thanks -- This message posted from opensolaris.org
Maybe a cable is loose? Reinsert all the cables into all drives? And the controller card? Yes, ZFS detects such problems. -- This message posted from opensolaris.org
Thanks for the input! I am using an Ipass to Ipass cable that connects my HBA to my backplane. It was firmly locked into both connectors. I offlined 2 supposedly faulty SAMSUNG drives, scanned their whole surface using estools and it did not report any errors. I''m starting to think that it may be an issue with the mpt driver and the HBA card. Anyone else using an LSI 1068E based HBA card and having issues? Thanks -- This message posted from opensolaris.org
For anyone that is interested, here''s a progress report. I created a new pool with only one mirror vdev of 2 disks, namely with the new SAMSUNG HD204UI. These drives, along with the older HD203WI, use Advanced Format Technology (e.g. 4K sectors). Only these drives had hard errors in my pool, as opposed the the old Seagates and WDs. To create the new pool, I recompiled the zpool cmd to give the value of ashift 12 so that the new pool has an alignement of 4K instead of 512 bytes (see here : http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html). So I filled this new 4K aligned pool with 1.5TB of data, scrubbed it and no errors. I checked the log and no hard errors either. Usually after a scrub I get some hard errors. Maybe the pool needs to have more vdevs in it to really stress the HBA and produce hard errors, but it''s a strange coincidence nonetheless that only the 4K drives had errors and then when used in a 4K aligned pool, no more errors. I''ll probably re-create my original pool with only 4K drives in a 4K aligned pool and see what happens. -- This message posted from opensolaris.org
"hard errors" are a generic classification. fmdump -eV shows the sense/asc/ascq, which is generally more useful for diagnosis. More below... On Jan 1, 2011, at 7:50 AM, Benji wrote:> Hi, > > I recently noticed that there are a lot of Hard Errors on multiple drives that''s being reported by iostat. Also, dmesg reports various messages from the mpt driver. > > My config is: > MB: SUPERMICRO X8SIL-F > HBA: AOC-USAS-L8i (LSI 1068) > RAM: 4GB ECC > SunOS SAN 5.11 snv_134 i86pc i386 i86pc Solaris > > My configuration is a striped mirrored vdev of 13 drives (one mirror had an error on a drive, which I cleared. But just to be safe I added another drive to the mirror): > > NAME STATE READ WRITE CKSUM > zpool ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > c4t13d0 ONLINE 0 0 0 > c4t19d0 ONLINE 0 0 0 > mirror-1 ONLINE 0 0 0 > c4t25d0 ONLINE 0 0 0 > c4t31d0 ONLINE 0 0 0 > mirror-2 ONLINE 0 0 0 > c4t12d0 ONLINE 0 0 0 > c4t18d0 ONLINE 0 0 0 > mirror-3 ONLINE 0 0 0 > c4t24d0 ONLINE 0 0 0 > c4t30d0 ONLINE 0 0 0 > mirror-4 ONLINE 0 0 0 > c4t11d0 ONLINE 0 0 0 > c4t17d0 ONLINE 0 0 0 > c4t10d0 ONLINE 0 0 0 > mirror-5 ONLINE 0 0 0 > c4t23d0 ONLINE 0 0 0 > c4t29d0 ONLINE 0 0 0 > > > Here''s the output from iostat -En: > > c6d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Model: WDC WD3200BEKT- Revision: Serial No: WD-WXR1A30 Size: 320.07GB <320070352896 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 > c7d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Model: WDC WD3200BEKT- Revision: Serial No: WD-WXR1A30 Size: 320.07GB <320070352896 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 > c4t12d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t13d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t18d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t19d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t24d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t25d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t30d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0003 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t31d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD203WI Revision: 0002 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t17d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Vendor: ATA Product: WDC WD20EADS-32S Revision: 0A01 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t11d0 Soft Errors: 0 Hard Errors: 17 Transport Errors: 116 > Vendor: ATA Product: WDC WD20EADS-32S Revision: 5G04 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 17 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t23d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Vendor: ATA Product: ST31500341AS Revision: CC1H Serial No: > Size: 1500.30GB <1500301910016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t29d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Vendor: ATA Product: ST31500341AS Revision: CC1H Serial No: > Size: 1500.30GB <1500301910016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t10d0 Soft Errors: 0 Hard Errors: 252 Transport Errors: 0 > Vendor: ATA Product: SAMSUNG HD204UI Revision: 0001 Serial No: > Size: 2000.40GB <2000398934016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > > And a sample from dmesg: > > Jan 1 10:26:28 SAN Log info 0x31123000 received for target 11. > Jan 1 10:26:28 SAN scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Jan 1 10:26:28 SAN scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,d138 at 3/pci15d9,a580 at 0 (mpt0): > Jan 1 10:26:28 SAN Log info 0x31123000 received for target 11. > Jan 1 10:26:28 SAN scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Jan 1 10:26:28 SAN scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,d138 at 3/pci15d9,a580 at 0 (mpt0): > Jan 1 10:26:28 SAN Log info 0x31123000 received for target 11. > Jan 1 10:26:28 SAN scsi_status=0x0, ioc_status=0x804b, scsi_state=0xcThis is the unit explaining that it aborted a command. This can be due to a bus reset, which is, by default, part of the recovery process. The default bus reset can be changed, as documented in the sd man page.> What do they mean? It can''t be that most of my SAMSUNG drives are failing? They almost all have the same number of errors, which is weird. Could this be caused by the fact that these SAMSUNG drives have 4K sectors? ''zpool status'' reports no errors, although it did report a checksum error a while back on a drive, which I cleared.In my experience, this looks like a set of devices sitting behind an expander. I have seen one bad disk take out all disks sitting behind an expander. I have also seen bad disk firmware take out all disks behind an expander. I once saw a bad cable take out everything. -- richard
Richard Elling <richard.elling at gmail.com> writes:> In my experience, this looks like a set of devices sitting behind an > expander. I have seen one bad disk take out all disks sitting behind > an expander. I have also seen bad disk firmware take out all disks > behind an expander. I once saw a bad cable take out everything. > -- richardIn my experience i ve also seen the same problems. a lot of sata disks (seagate barracuda ES.2 and other) all behind expanders (supermicro sc847 chassis) the issue were solved after we removed all sata disks behind our expander and replaced them with Enterprise SAS Disks. thereafter we only faced this problems when an connected sata-ssd died. so we also moved our sata-ssds away from this backplane and connected them directly to the 1068 based controller. the problem arrised, after we moved a identically server to a expander backplane (to get more drives connected). before this discs were running for months without any problems *direct* attached. regards daniel