thr3ads.net - zfs discuss - [zfs-discuss] ZFS, SATA, LSI and stability [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Frank Fischer

2008-Aug-12 14:17 UTC

[zfs-discuss] ZFS, SATA, LSI and stability

After having massive problems with a supermicro X7DBE box using AOC-SAT2-MV8
Marvell controllers and opensolaris snv79 (same as described here:
http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1) we just start
over using new hardware and opensolaris 2008.05 upgraded to snv94. We used again
a supermicro X7DBE but now with two LSI SAS3081E SAS controllers. And guess
what? Now we get these error-messages in /var/adm/messages:

Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 5,0 (sd11):
Aug 11 18:20:52 thumper2        Error for Command: read(10)                Error
Level: Retryable
Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  Requested Block:
1423173120                Error Block: 1423173120
Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  Vendor: ATA             
Serial Number:      WD-WCAP
Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  Sense Key:
Unit_Attention
Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Along whit these messages there are a lot of this messages:

Aug 11 18:20:51 thumper2 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2690 at
1c/pci1000,3140 at 0 (mpt1):
Aug 11 18:20:51 thumper2        Log info 0x31123000 received for target 5.
Aug 11 18:20:51 thumper2        scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc


I would believe having a faulty disk, but not two:

Aug 11 17:47:47 thumper2 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,2690 at
1c/pci1000,3140 at 0 (mpt1):
Aug 11 17:47:47 thumper2        Log info 0x31123000 received for target 4.
Aug 11 17:47:47 thumper2        scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc
Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 4,0 (sd10):
Aug 11 17:47:48 thumper2        Error for Command: read(10)                Error
Level: Retryable
Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  Requested Block:
252165120                 Error Block: 252165120
Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  Vendor: ATA             
Serial Number:
Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  Sense Key:
Unit_Attention
Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 11 17:48:34 thumper2 scsi: [ID 243001 kern.warning] WARNING: /pci at
0,0/pci8086,25f9 at 6/pci1000,3140 at 0 (mpt0):


Does somebody know what is going on here?
I have checked the disks with iostat -En :

-bash-3.2# iostat -En
...
c4t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: FUJITSU  Product: MBA3073RC        Revision: 0103 Serial No:  
Size: 73.54GB <73543163904 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
c4t5d0           Soft Errors: 4 Hard Errors: 24 Transport Errors: 179 
Vendor: ATA      Product: ST3750330NS      Revision: SN04 Serial No:  
Size: 750.16GB <750156374016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 22 Recoverable: 4 
Illegal Request: 0 Predictive Failure Analysis: 0 
c4t6d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA      Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No:  
Size: 750.16GB <750156374016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
c6t4d0           Soft Errors: 6 Hard Errors: 17 Transport Errors: 466 
Vendor: ATA      Product: ST3750640NS      Revision: G    Serial No:  
Size: 750.16GB <750156374016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 17 Recoverable: 6 
Illegal Request: 0 Predictive Failure Analysis: 0 
c6t5d0           Soft Errors: 2 Hard Errors: 23 Transport Errors: 539 
Vendor: ATA      Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No:  
Size: 750.16GB <750156374016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 23 Recoverable: 2 
Illegal Request: 0 Predictive Failure Analysis: 0 

I have check the drives with smartctl:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   075   006    Pre-fail  Always       -
94384069
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -
0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -
15
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -
0
  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -
263091894
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -
4050
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -
0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -
22
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -
0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -
0
190 Airflow_Temperature_Cel 0x0022   068   062   045    Old_age   Always       -
32 (Lifetime Min/Max 30/34)
194 Temperature_Celsius     0x0022   032   040   000    Old_age   Always       -
32 (0 25 0 0)
195 Hardware_ECC_Recovered  0x001a   065   056   000    Old_age   Always       -
173161329
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -
0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -
0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -
0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -
0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -
0

But with no UDMA_CRC_Errors I believe the disks are fine.

Message was edited by: 
        a0040
 
 
This message posted from opensolaris.org

Thomas Maier-Komor

2008-Aug-12 14:29 UTC

head link

[zfs-discuss] ZFS, SATA, LSI and stability

Frank Fischer wrote:> After having massive problems with a supermicro X7DBE box using
AOC-SAT2-MV8 Marvell controllers and opensolaris snv79 (same as described here:
http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1) we just start
over using new hardware and opensolaris 2008.05 upgraded to snv94. We used again
a supermicro X7DBE but now with two LSI SAS3081E SAS controllers. And guess
what? Now we get these error-messages in /var/adm/messages:
> 
> Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 5,0 (sd11):
> Aug 11 18:20:52 thumper2        Error for Command: read(10)               
Error Level: Retryable
> Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  Requested Block:
1423173120                Error Block: 1423173120
> Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  Vendor: ATA        
Serial Number:      WD-WCAP
> Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  Sense Key:
Unit_Attention
> Aug 11 18:20:52 thumper2 scsi: [ID 107833 kern.notice]  ASC: 0x29 (power
on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
> 
> Along whit these messages there are a lot of this messages:
> 
> Aug 11 18:20:51 thumper2 scsi: [ID 365881 kern.info] /pci at
0,0/pci8086,2690 at 1c/pci1000,3140 at 0 (mpt1):
> Aug 11 18:20:51 thumper2        Log info 0x31123000 received for target 5.
> Aug 11 18:20:51 thumper2        scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc
> 
> 
> I would believe having a faulty disk, but not two:
> 
> Aug 11 17:47:47 thumper2 scsi: [ID 365881 kern.info] /pci at
0,0/pci8086,2690 at 1c/pci1000,3140 at 0 (mpt1):
> Aug 11 17:47:47 thumper2        Log info 0x31123000 received for target 4.
> Aug 11 17:47:47 thumper2        scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc
> Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci8086,2690 at 1c/pci1000,3140 at 0/sd at 4,0 (sd10):
> Aug 11 17:47:48 thumper2        Error for Command: read(10)               
Error Level: Retryable
> Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  Requested Block:
252165120                 Error Block: 252165120
> Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  Vendor: ATA        
Serial Number:
> Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  Sense Key:
Unit_Attention
> Aug 11 17:47:48 thumper2 scsi: [ID 107833 kern.notice]  ASC: 0x29 (power
on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
> Aug 11 17:48:34 thumper2 scsi: [ID 243001 kern.warning] WARNING: /pci at
0,0/pci8086,25f9 at 6/pci1000,3140 at 0 (mpt0):
> 
> 
> Does somebody know what is going on here?
> I have checked the disks with iostat -En :
> 
> -bash-3.2# iostat -En
> ...
> c4t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
> Vendor: FUJITSU  Product: MBA3073RC        Revision: 0103 Serial No:  
> Size: 73.54GB <73543163904 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
> Illegal Request: 0 Predictive Failure Analysis: 0 
> c4t5d0           Soft Errors: 4 Hard Errors: 24 Transport Errors: 179 
> Vendor: ATA      Product: ST3750330NS      Revision: SN04 Serial No:  
> Size: 750.16GB <750156374016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 22 Recoverable: 4 
> Illegal Request: 0 Predictive Failure Analysis: 0 
> c4t6d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
> Vendor: ATA      Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No:  
> Size: 750.16GB <750156374016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
> Illegal Request: 0 Predictive Failure Analysis: 0 
> c6t4d0           Soft Errors: 6 Hard Errors: 17 Transport Errors: 466 
> Vendor: ATA      Product: ST3750640NS      Revision: G    Serial No:  
> Size: 750.16GB <750156374016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 17 Recoverable: 6 
> Illegal Request: 0 Predictive Failure Analysis: 0 
> c6t5d0           Soft Errors: 2 Hard Errors: 23 Transport Errors: 539 
> Vendor: ATA      Product: WDC WD7500AYYS-0 Revision: 4G30 Serial No:  
> Size: 750.16GB <750156374016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 23 Recoverable: 2 
> Illegal Request: 0 Predictive Failure Analysis: 0 
> 
> I have check the drives with smartctl:
> 
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   115   075   006    Pre-fail  Always   
-       94384069
>   3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always   
-       0
>   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always   
-       15
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always   
-       0
>   7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always   
-       263091894
>   9 Power_On_Hours          0x0032   096   096   000    Old_age   Always   
-       4050
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always   
-       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always   
-       22
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always   
-       0
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always   
-       0
> 190 Airflow_Temperature_Cel 0x0022   068   062   045    Old_age   Always   
-       32 (Lifetime Min/Max 30/34)
> 194 Temperature_Celsius     0x0022   032   040   000    Old_age   Always   
-       32 (0 25 0 0)
> 195 Hardware_ECC_Recovered  0x001a   065   056   000    Old_age   Always   
-       173161329
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always   
-       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline  
-       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always   
-       0
> 200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline  
-       0
> 202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always   
-       0
> 
> But with no UDMA_CRC_Errors I believe the disks are fine.
> 
> Message was edited by: 
>         a0040
>  
>  
Could it be that you have faulty cables? I''m using an LSI SAS
controller
(4 port variant) on SPARC, and it works like a charm.

The only problem I''m observing is during boot time: the mpt driver is
resetting/initializing all buses twice. This takes quite some time, but
finally the machine comes up without a problem. The messages appearing
in syslog are of the following form:
Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice]
/pci at 1d,700000/scsi at 2 (mpt2):
Aug 12 11:47:28 azalin  initiator SCSI ID now 7
Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice]
/pci at 1d,700000/scsi at 2 (mpt2):
Aug 12 11:47:28 azalin  Rev. 1 LSI, Inc. 1064 found.
Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice]
/pci at 1d,700000/scsi at 2 (mpt2):
Aug 12 11:47:28 azalin  mpt2 supports power management.
Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice]
/pci at 1d,700000/scsi at 2 (mpt2):
Aug 12 11:47:28 azalin  mpt2 Firmware version v0.3.1e.0 (IR)
Aug 12 11:47:28 azalin scsi: [ID 365881 kern.notice]
/pci at 1d,700000/scsi at 2 (mpt2):
Aug 12 11:47:28 azalin  mpt2: IOC Operational.
Aug 12 11:47:43 azalin scsi: [ID 243001 kern.info] /pci at 1d,700000/scsi at 2
(mpt2):
Aug 12 11:47:43 azalin  mpt2: Initiator WWNs:
0x500062b000005e88-0x500062b000005e8b

But as I said - once the system is up and running it works perfectly.

- Thomas

Miles Nordin

2008-Aug-12 20:44 UTC

head link

[zfs-discuss] ZFS, SATA, LSI and stability

ff> I have check the drives with smartctl:

    ff> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
UPDATED  WHEN_FAILED RAW_VALUE
    ff>   1 Raw_Read_Error_Rate     0x000f   115   075   006    Pre-fail 
Always       -       94384069
    ff>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail 
Always       -       0
    ff> 195 Hardware_ECC_Recovered  0x001a   065   056   000    Old_age  
Always       -       173161329
    ff> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age  
Always       -       0

    ff> But with no UDMA_CRC_Errors I believe the disks are fine.

no, UDMA_CRC_Errors counts checksum errors on PATA cables.  I cannot
confirm/deny if it counts CRC errors on SATA cables (and even if it
did this is complicated because there are weird scsi-emulation
proprietary drivers, port multipliers, u.s.w.)  so, if you are having
problems, and that parameter is increasing, then it''s probably cabling
problems not drive problems.

The other three values I quoted are the ones that matter.  The VALUE
is scaled by constants defined by the manufacturer and used for the
``overall health assessment'''', but the constants they use are
always
way too forgiving, so it''s worthless.  The RAW_VALUE looks bigger than
I''m used to, but this may also be meaningless.  The only way I know to
get information out of the report is:  How do the RAW_VALUE''s of the
three parameters I quoted compare with other drives of the same model,
or to this drive before it started failing?

There is another section of the smartctl -a report that logs the last
5 or so errors the drive has reported to the host.  IIRC you will see
errors called ''ICRC'' or ''UNC'' on failing
drives.

this experience is all PATA/SATA-specific.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080812/f3f9f61a/attachment.bin>

Possibly Parallel Threads

Search for more apparently analagous threads

zfs discuss - Aug 2008 - ZFS, SATA, LSI and stability

[zfs-discuss] ZFS, SATA, LSI and stability

[zfs-discuss] ZFS, SATA, LSI and stability

[zfs-discuss] ZFS, SATA, LSI and stability

Possibly Parallel Threads