After having massive problems with a supermicro X7DBE box using AOC-SAT2-MV8 Marvell controllers and opensolaris snv79 (same as described here: http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1) we just start over using new hardware and opensolaris 2008.05 upgraded to snv94. We used again a supermicro X7DBE but now with two LSI SAS3081E SAS controllers. And guess what? Now we get these error-messages in /var/adm/messages: Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 5,0 (sd11): Aug 11 18:20:52 thumper2 Error for Command: read(10) Error Level: Retryable Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] Requested Block: 1423173120 Error Block: 1423173120 Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: WD-WCAP Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Along whit these messages there are a lot of this messages: Aug 11 18:20:51 thumper2 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0 (mpt1): Aug 11 18:20:51 thumper2 Log info 0x31123000 received for target 5. Aug 11 18:20:51 thumper2 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc I would believe having a faulty disk, but not two: Aug 11 17:47:47 thumper2 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0 (mpt1): Aug 11 17:47:47 thumper2 Log info 0x31123000 received for target 4. Aug 11 17:47:47 thumper2 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 4,0 (sd10): Aug 11 17:47:48 thumper2 Error for Command: read(10) Error Level: Retryable Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] Requested Block: 252165120 Error Block: 252165120 Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Aug 11 17:48:34 thumper2 scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,25f9 at 6/pci1000,3140 at 0 (mpt0): Does somebody know what is going on here? I have checked the disks with iostat -En : -bash-3.2# iostat -En ... c4t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: FUJITSU Product: MBA3073RC Revision: 0103 Serial No: Size: 73.54GB <73543163904 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c4t5d0 Soft Errors: 4 Hard Errors: 24 Transport Errors: 179 Vendor: ATA Product: ST3750330NS Revision: SN04 Serial No: Size: 750.16GB <750156374016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 22 Recoverable: 4 Illegal Request: 0 Predictive Failure Analysis: 0 c4t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No: Size: 750.16GB <750156374016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c6t4d0 Soft Errors: 6 Hard Errors: 17 Transport Errors: 466 Vendor: ATA Product: ST3750640NS Revision: G Serial No: Size: 750.16GB <750156374016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 17 Recoverable: 6 Illegal Request: 0 Predictive Failure Analysis: 0 c6t5d0 Soft Errors: 2 Hard Errors: 23 Transport Errors: 539 Vendor: ATA Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No: Size: 750.16GB <750156374016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 23 Recoverable: 2 Illegal Request: 0 Predictive Failure Analysis: 0 I have check the drives with smartctl: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 115 075 006 Pre-fail Always - 94384069 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 263091894 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4050 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 22 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 068 062 045 Old_age Always - 32 (Lifetime Min/Max 30/34) 194 Temperature_Celsius 0x0022 032 040 000 Old_age Always - 32 (0 25 0 0) 195 Hardware_ECC_Recovered 0x001a 065 056 000 Old_age Always - 173161329 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 But with no UDMA_CRC_Errors I believe the disks are fine. Message was edited by: a0040 This message posted from opensolaris.org
Frank Fischer wrote:> After having massive problems with a supermicro X7DBE box using AOC-SAT2-MV8 Marvell controllers and opensolaris snv79 (same as described here: http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1) we just start over using new hardware and opensolaris 2008.05 upgraded to snv94. We used again a supermicro X7DBE but now with two LSI SAS3081E SAS controllers. And guess what? Now we get these error-messages in /var/adm/messages: > > Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 5,0 (sd11): > Aug 11 18:20:52 thumper2 Error for Command: read(10) Error Level: Retryable > Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] Requested Block: 1423173120 Error Block: 1423173120 > Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: WD-WCAP > Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention > Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 > > Along whit these messages there are a lot of this messages: > > Aug 11 18:20:51 thumper2 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0 (mpt1): > Aug 11 18:20:51 thumper2 Log info 0x31123000 received for target 5. > Aug 11 18:20:51 thumper2 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > > > I would believe having a faulty disk, but not two: > > Aug 11 17:47:47 thumper2 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0 (mpt1): > Aug 11 17:47:47 thumper2 Log info 0x31123000 received for target 4. > Aug 11 17:47:47 thumper2 scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc > Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 4,0 (sd10): > Aug 11 17:47:48 thumper2 Error for Command: read(10) Error Level: Retryable > Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] Requested Block: 252165120 Error Block: 252165120 > Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention > Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 > Aug 11 17:48:34 thumper2 scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,25f9 at 6/pci1000,3140 at 0 (mpt0): > > > Does somebody know what is going on here? > I have checked the disks with iostat -En : > > -bash-3.2# iostat -En > ... > c4t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Vendor: FUJITSU Product: MBA3073RC Revision: 0103 Serial No: > Size: 73.54GB <73543163904 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t5d0 Soft Errors: 4 Hard Errors: 24 Transport Errors: 179 > Vendor: ATA Product: ST3750330NS Revision: SN04 Serial No: > Size: 750.16GB <750156374016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 22 Recoverable: 4 > Illegal Request: 0 Predictive Failure Analysis: 0 > c4t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Vendor: ATA Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No: > Size: 750.16GB <750156374016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 Predictive Failure Analysis: 0 > c6t4d0 Soft Errors: 6 Hard Errors: 17 Transport Errors: 466 > Vendor: ATA Product: ST3750640NS Revision: G Serial No: > Size: 750.16GB <750156374016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 17 Recoverable: 6 > Illegal Request: 0 Predictive Failure Analysis: 0 > c6t5d0 Soft Errors: 2 Hard Errors: 23 Transport Errors: 539 > Vendor: ATA Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No: > Size: 750.16GB <750156374016 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 23 Recoverable: 2 > Illegal Request: 0 Predictive Failure Analysis: 0 > > I have check the drives with smartctl: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 115 075 006 Pre-fail Always - 94384069 > 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 15 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 > 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 263091894 > 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4050 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 22 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 > 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 > 190 Airflow_Temperature_Cel 0x0022 068 062 045 Old_age Always - 32 (Lifetime Min/Max 30/34) > 194 Temperature_Celsius 0x0022 032 040 000 Old_age Always - 32 (0 25 0 0) > 195 Hardware_ECC_Recovered 0x001a 065 056 000 Old_age Always - 173161329 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 > 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 > 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 > > But with no UDMA_CRC_Errors I believe the disks are fine. > > Message was edited by: > a0040 > >Could it be that you have faulty cables? I''m using an LSI SAS controller (4 port variant) on SPARC, and it works like a charm. The only problem I''m observing is during boot time: the mpt driver is resetting/initializing all buses twice. This takes quite some time, but finally the machine comes up without a problem. The messages appearing in syslog are of the following form: Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice] /pci at 1d,700000/scsi at 2 (mpt2): Aug 12 11:47:28 azalin initiator SCSI ID now 7 Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice] /pci at 1d,700000/scsi at 2 (mpt2): Aug 12 11:47:28 azalin Rev. 1 LSI, Inc. 1064 found. Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice] /pci at 1d,700000/scsi at 2 (mpt2): Aug 12 11:47:28 azalin mpt2 supports power management. Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice] /pci at 1d,700000/scsi at 2 (mpt2): Aug 12 11:47:28 azalin mpt2 Firmware version v0.3.1e.0 (IR) Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice] /pci at 1d,700000/scsi at 2 (mpt2): Aug 12 11:47:28 azalin mpt2: IOC Operational. Aug 12 11:47:43 azalin scsi: [ID 243001 kern.info] /pci at 1d,700000/scsi at 2 (mpt2): Aug 12 11:47:43 azalin mpt2: Initiator WWNs: 0x500062b000005e88-0x500062b000005e8b But as I said - once the system is up and running it works perfectly. - Thomas
ff> I have check the drives with smartctl: ff> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE ff> 1 Raw_Read_Error_Rate 0x000f 115 075 006 Pre-fail Always - 94384069 ff> 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 ff> 195 Hardware_ECC_Recovered 0x001a 065 056 000 Old_age Always - 173161329 ff> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 ff> But with no UDMA_CRC_Errors I believe the disks are fine. no, UDMA_CRC_Errors counts checksum errors on PATA cables. I cannot confirm/deny if it counts CRC errors on SATA cables (and even if it did this is complicated because there are weird scsi-emulation proprietary drivers, port multipliers, u.s.w.) so, if you are having problems, and that parameter is increasing, then it''s probably cabling problems not drive problems. The other three values I quoted are the ones that matter. The VALUE is scaled by constants defined by the manufacturer and used for the ``overall health assessment'''', but the constants they use are always way too forgiving, so it''s worthless. The RAW_VALUE looks bigger than I''m used to, but this may also be meaningless. The only way I know to get information out of the report is: How do the RAW_VALUE''s of the three parameters I quoted compare with other drives of the same model, or to this drive before it started failing? There is another section of the smartctl -a report that logs the last 5 or so errors the drive has reported to the host. IIRC you will see errors called ''ICRC'' or ''UNC'' on failing drives. this experience is all PATA/SATA-specific. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080812/f3f9f61a/attachment.bin>