thr3ads.net - zfs discuss - [zfs-discuss] SNV_125 MPT warning in logfile [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Bruno Sousa

2009-Oct-22 11:40 UTC

[zfs-discuss] SNV_125 MPT warning in logfile

Hi all,

Recently i upgrade from snv_118 to snv_125, and suddently i started to 
see this messages at /var/adm/messages :

Oct 22 12:54:37 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
/pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
Oct 22 12:54:37 SAN02      mpt_handle_event: IOCStatus=0x8000, 
IOCLogInfo=0x3112011a
Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
/pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
Oct 22 12:56:47 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
IOCLogInfo=0x3112011a
Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
/pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
Oct 22 12:56:47 SAN02      mpt_handle_event: IOCStatus=0x8000, 
IOCLogInfo=0x3112011a
Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
/pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
Oct 22 12:56:50 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
IOCLogInfo=0x3112011a
Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
/pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
Oct 22 12:56:50 SAN02      mpt_handle_event: IOCStatus=0x8000, 
IOCLogInfo=0x3112011a


Is this a symptom of a disk error or some change was made in the 
driver?,that now i have more information, where in the past such 
information didn''t appear?

Thanks,
Bruno

I''m using a LSI Logic SAS1068E B3 and i within lsiutil i have this 
behaviour :


1 MPT Port found

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  mpt0              LSI Logic SAS1068E B3     105      011a0000     0

Select a device:  [1-1 or 0 to quit] 1

 1.  Identify firmware, BIOS, and/or FCode
 2.  Download firmware (update the FLASH)
 4.  Download/erase BIOS and/or FCode (update the FLASH)
 8.  Scan for devices
10.  Change IOC settings (interrupt coalescing)
13.  Change SAS IO Unit settings
16.  Display attached devices
20.  Diagnostics
21.  RAID actions
22.  Reset bus
23.  Reset target
42.  Display operating system names for devices
45.  Concatenate SAS firmware and NVDATA files
59.  Dump PCI config space
60.  Show non-default settings
61.  Restore default settings
66.  Show SAS discovery errors
69.  Show board manufacturing information
97.  Reset SAS link, HARD RESET
98.  Reset SAS link
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 20

 1.  Inquiry Test
 2.  WriteBuffer/ReadBuffer/Compare Test
 3.  Read Test
 4.  Write/Read/Compare Test
 8.  Read Capacity / Read Block Limits Test
12.  Display phy counters
13.  Clear phy counters
14.  SATA SMART Read Test
15.  SEP (SCSI Enclosure Processor) Test
18.  Report LUNs Test
19.  Drive firmware download
20.  Expander firmware download
21.  Read Logical Blocks
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

Diagnostics menu, select an option:  [1-99 or e/p/w or 0 to quit] 12

Adapter Phy 0:  Link Down, No Errors

Adapter Phy 1:  Link Down, No Errors

Adapter Phy 2:  Link Down, No Errors

Adapter Phy 3:  Link Down, No Errors

Adapter Phy 4:  Link Up, No Errors

Adapter Phy 5:  Link Up, No Errors

Adapter Phy 6:  Link Up, No Errors

Adapter Phy 7:  Link Up, No Errors

Expander (Handle 0009) Phy 0:  Link Up
  Invalid DWord Count                                  79,967,229
  Running Disparity Error Count                        63,036,893
  Loss of DWord Synch Count                                   113
  Phy Reset Problem Count                                       0

Expander (Handle 0009) Phy 1:  Link Up
  Invalid DWord Count                                  79,967,207
  Running Disparity Error Count                        78,339,626
  Loss of DWord Synch Count                                   113
  Phy Reset Problem Count                                       0

Expander (Handle 0009) Phy 2:  Link Up
  Invalid DWord Count                                  76,717,646
  Running Disparity Error Count                        73,334,563
  Loss of DWord Synch Count                                   113
  Phy Reset Problem Count                                       0

Expander (Handle 0009) Phy 3:  Link Up
  Invalid DWord Count                                  79,896,409
  Running Disparity Error Count                        76,199,329
  Loss of DWord Synch Count                                   113
  Phy Reset Problem Count                                       0

Expander (Handle 0009) Phy 4:  Link Up, No Errors

Expander (Handle 0009) Phy 5:  Link Up, No Errors

Expander (Handle 0009) Phy 6:  Link Up, No Errors

Expander (Handle 0009) Phy 7:  Link Up, No Errors

Expander (Handle 0009) Phy 8:  Link Up, No Errors

Expander (Handle 0009) Phy 9:  Link Up, No Errors

Expander (Handle 0009) Phy 10:  Link Up, No Errors

Expander (Handle 0009) Phy 11:  Link Up, No Errors

Expander (Handle 0009) Phy 12:  Link Up, No Errors

Expander (Handle 0009) Phy 13:  Link Up, No Errors

Expander (Handle 0009) Phy 14:  Link Up, No Errors

Expander (Handle 0009) Phy 15:  Link Up, No Errors

Expander (Handle 0009) Phy 16:  Link Up, No Errors

Expander (Handle 0009) Phy 17:  Link Up, No Errors

Expander (Handle 0009) Phy 18:  Link Up, No Errors

Expander (Handle 0009) Phy 19:  Link Up, No Errors

Expander (Handle 0009) Phy 20:  Link Down, No Errors

Expander (Handle 0009) Phy 21:  Link Down, No Errors

Expander (Handle 0009) Phy 22:  Link Up
  Invalid DWord Count                                     743,980
  Running Disparity Error Count                            38,796
  Loss of DWord Synch Count                                     1
  Phy Reset Problem Count                                       0

Expander (Handle 0009) Phy 23:  Link Down, No Errors

Expander (Handle 0009) Phy 24:  Link Down, No Errors

Expander (Handle 0009) Phy 25:  Link Down
  Invalid DWord Count                                       1,755
  Running Disparity Error Count                               408
  Loss of DWord Synch Count                                     0
  Phy Reset Problem Count                                       0

Expander (Handle 0009) Phy 26:  Link Down
  Invalid DWord Count                                       1,127
  Running Disparity Error Count                             1,022
  Loss of DWord Synch Count                                     0
  Phy Reset Problem Count                                       0

Expander (Handle 0009) Phy 27:  Link Down, No Errors

Expander (Handle 0009) Phy 28:  Link Down, No Errors

Expander (Handle 0009) Phy 29:  Link Down, No Errors

Expander (Handle 0009) Phy 30:  Link Down, No Errors

Expander (Handle 0009) Phy 31:  Link Down, No Errors

Expander (Handle 0009) Phy 32:  Link Down, No Errors

Expander (Handle 0009) Phy 33:  Link Down, No Errors

Expander (Handle 0009) Phy 34:  Link Down, No Errors

Expander (Handle 0009) Phy 35:  Link Down, No Errors

Expander (Handle 0009) Phy 36:  Link Up, No Errors

Expander (Handle 0009) Phy 37:  Link Down, No Errors


Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 42

mpt0 is /dev/cfg/c5

 B___T___L  Type       Operating System Device Name
ScsiIo to Bus 0 Target 8 failed, IOCStatus = 004b (IOC Terminated)
 0  10   0  Disk       /dev/rdsk/c5t10d0s2
 0  11   0  Disk       /dev/rdsk/c5t11d0s2
 0  12   0  Disk       /dev/rdsk/c5t12d0s2
 0  13   0  Disk       /dev/rdsk/c5t13d0s2
 0  14   0  Disk       /dev/rdsk/c5t14d0s2
 0  15   0  Disk       /dev/rdsk/c5t15d0s2
 0  16   0  Disk       /dev/rdsk/c5t16d0s2
 0  17   0  Disk       /dev/rdsk/c5t17d0s2
 0  18   0  Disk       /dev/rdsk/c5t18d0s2
 0  19   0  Disk       /dev/rdsk/c5t19d0s2
 0  20   0  Disk       /dev/rdsk/c5t20d0s2
 0  21   0  Disk       /dev/rdsk/c5t21d0s2
 0  22   0  Disk       /dev/rdsk/c5t22d0s2
 0  23   0  Disk       /dev/rdsk/c5t23d0s2
 0  24   0  Disk       /dev/rdsk/c5t24d0s2
 0  25   0  Disk       /dev/rdsk/c5t25d0s2
 0  26   0  Disk       /dev/rdsk/c5t26d0s2


The iostat -En gives :

 iostat -En
c4t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: SEAGATE ST32500N Revision: 3AZQ Serial No:
Size: 250.06GB <250056000000 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 281 Predictive Failure Analysis: 0
c4t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: SEAGATE ST32502N Revision: SU0D Serial No:
Size: 250.06GB <250056000000 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 285 Predictive Failure Analysis: 0
c3t0d0           Soft Errors: 0 Hard Errors: 9 Transport Errors: 0
Vendor: TEAC     Product: DV-28E-V         Revision: 1.AC Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 9 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c5t10d0          Soft Errors: 18 Hard Errors: 1 Transport Errors: 9
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 18
Illegal Request: 8 Predictive Failure Analysis: 0
c5t11d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
Illegal Request: 8 Predictive Failure Analysis: 0
c5t12d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
Illegal Request: 8 Predictive Failure Analysis: 0
c5t13d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
Illegal Request: 8 Predictive Failure Analysis: 0
c5t14d0          Soft Errors: 16 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 16
Illegal Request: 8 Predictive Failure Analysis: 0
c5t15d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t16d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t17d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t18d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t19d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t20d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t21d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t22d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t23d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t24d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t25d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0
c5t26d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
Size: 1500.30GB <1500301910016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
Illegal Request: 6 Predictive Failure Analysis: 0



Thank you,
Bruno

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Cindy Swearingen

2009-Oct-22 16:39 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Hi Bruno,

I see some bugs associated with these messages (6694909) that point to
an LSI firmware upgrade that cause these harmless errors to display.

According to the 6694909 comments, this issue is documented in the
release notes.

As they are harmless, I wouldn''t worry about them.

Maybe someone from the driver group can comment further.

Cindy


On 10/22/09 05:40, Bruno Sousa wrote:> Hi all,
> 
> Recently i upgrade from snv_118 to snv_125, and suddently i started to 
> see this messages at /var/adm/messages :
> 
> Oct 22 12:54:37 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:54:37 SAN02      mpt_handle_event: IOCStatus=0x8000, 
> IOCLogInfo=0x3112011a
> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:47 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
> IOCLogInfo=0x3112011a
> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:47 SAN02      mpt_handle_event: IOCStatus=0x8000, 
> IOCLogInfo=0x3112011a
> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:50 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
> IOCLogInfo=0x3112011a
> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:50 SAN02      mpt_handle_event: IOCStatus=0x8000, 
> IOCLogInfo=0x3112011a
> 
> 
> Is this a symptom of a disk error or some change was made in the 
> driver?,that now i have more information, where in the past such 
> information didn''t appear?
> 
> Thanks,
> Bruno
> 
> I''m using a LSI Logic SAS1068E B3 and i within lsiutil i have this
> behaviour :
> 
> 
> 1 MPT Port found
> 
>     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
> 1.  mpt0              LSI Logic SAS1068E B3     105      011a0000     0
> 
> Select a device:  [1-1 or 0 to quit] 1
> 
> 1.  Identify firmware, BIOS, and/or FCode
> 2.  Download firmware (update the FLASH)
> 4.  Download/erase BIOS and/or FCode (update the FLASH)
> 8.  Scan for devices
> 10.  Change IOC settings (interrupt coalescing)
> 13.  Change SAS IO Unit settings
> 16.  Display attached devices
> 20.  Diagnostics
> 21.  RAID actions
> 22.  Reset bus
> 23.  Reset target
> 42.  Display operating system names for devices
> 45.  Concatenate SAS firmware and NVDATA files
> 59.  Dump PCI config space
> 60.  Show non-default settings
> 61.  Restore default settings
> 66.  Show SAS discovery errors
> 69.  Show board manufacturing information
> 97.  Reset SAS link, HARD RESET
> 98.  Reset SAS link
> 99.  Reset port
> e   Enable expert mode in menus
> p   Enable paged mode
> w   Enable logging
> 
> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 20
> 
> 1.  Inquiry Test
> 2.  WriteBuffer/ReadBuffer/Compare Test
> 3.  Read Test
> 4.  Write/Read/Compare Test
> 8.  Read Capacity / Read Block Limits Test
> 12.  Display phy counters
> 13.  Clear phy counters
> 14.  SATA SMART Read Test
> 15.  SEP (SCSI Enclosure Processor) Test
> 18.  Report LUNs Test
> 19.  Drive firmware download
> 20.  Expander firmware download
> 21.  Read Logical Blocks
> 99.  Reset port
> e   Enable expert mode in menus
> p   Enable paged mode
> w   Enable logging
> 
> Diagnostics menu, select an option:  [1-99 or e/p/w or 0 to quit] 12
> 
> Adapter Phy 0:  Link Down, No Errors
> 
> Adapter Phy 1:  Link Down, No Errors
> 
> Adapter Phy 2:  Link Down, No Errors
> 
> Adapter Phy 3:  Link Down, No Errors
> 
> Adapter Phy 4:  Link Up, No Errors
> 
> Adapter Phy 5:  Link Up, No Errors
> 
> Adapter Phy 6:  Link Up, No Errors
> 
> Adapter Phy 7:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 0:  Link Up
>  Invalid DWord Count                                  79,967,229
>  Running Disparity Error Count                        63,036,893
>  Loss of DWord Synch Count                                   113
>  Phy Reset Problem Count                                       0
> 
> Expander (Handle 0009) Phy 1:  Link Up
>  Invalid DWord Count                                  79,967,207
>  Running Disparity Error Count                        78,339,626
>  Loss of DWord Synch Count                                   113
>  Phy Reset Problem Count                                       0
> 
> Expander (Handle 0009) Phy 2:  Link Up
>  Invalid DWord Count                                  76,717,646
>  Running Disparity Error Count                        73,334,563
>  Loss of DWord Synch Count                                   113
>  Phy Reset Problem Count                                       0
> 
> Expander (Handle 0009) Phy 3:  Link Up
>  Invalid DWord Count                                  79,896,409
>  Running Disparity Error Count                        76,199,329
>  Loss of DWord Synch Count                                   113
>  Phy Reset Problem Count                                       0
> 
> Expander (Handle 0009) Phy 4:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 5:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 6:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 7:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 8:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 9:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 10:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 11:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 12:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 13:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 14:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 15:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 16:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 17:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 18:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 19:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 20:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 21:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 22:  Link Up
>  Invalid DWord Count                                     743,980
>  Running Disparity Error Count                            38,796
>  Loss of DWord Synch Count                                     1
>  Phy Reset Problem Count                                       0
> 
> Expander (Handle 0009) Phy 23:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 24:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 25:  Link Down
>  Invalid DWord Count                                       1,755
>  Running Disparity Error Count                               408
>  Loss of DWord Synch Count                                     0
>  Phy Reset Problem Count                                       0
> 
> Expander (Handle 0009) Phy 26:  Link Down
>  Invalid DWord Count                                       1,127
>  Running Disparity Error Count                             1,022
>  Loss of DWord Synch Count                                     0
>  Phy Reset Problem Count                                       0
> 
> Expander (Handle 0009) Phy 27:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 28:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 29:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 30:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 31:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 32:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 33:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 34:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 35:  Link Down, No Errors
> 
> Expander (Handle 0009) Phy 36:  Link Up, No Errors
> 
> Expander (Handle 0009) Phy 37:  Link Down, No Errors
> 
> 
> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 42
> 
> mpt0 is /dev/cfg/c5
> 
> B___T___L  Type       Operating System Device Name
> ScsiIo to Bus 0 Target 8 failed, IOCStatus = 004b (IOC Terminated)
> 0  10   0  Disk       /dev/rdsk/c5t10d0s2
> 0  11   0  Disk       /dev/rdsk/c5t11d0s2
> 0  12   0  Disk       /dev/rdsk/c5t12d0s2
> 0  13   0  Disk       /dev/rdsk/c5t13d0s2
> 0  14   0  Disk       /dev/rdsk/c5t14d0s2
> 0  15   0  Disk       /dev/rdsk/c5t15d0s2
> 0  16   0  Disk       /dev/rdsk/c5t16d0s2
> 0  17   0  Disk       /dev/rdsk/c5t17d0s2
> 0  18   0  Disk       /dev/rdsk/c5t18d0s2
> 0  19   0  Disk       /dev/rdsk/c5t19d0s2
> 0  20   0  Disk       /dev/rdsk/c5t20d0s2
> 0  21   0  Disk       /dev/rdsk/c5t21d0s2
> 0  22   0  Disk       /dev/rdsk/c5t22d0s2
> 0  23   0  Disk       /dev/rdsk/c5t23d0s2
> 0  24   0  Disk       /dev/rdsk/c5t24d0s2
> 0  25   0  Disk       /dev/rdsk/c5t25d0s2
> 0  26   0  Disk       /dev/rdsk/c5t26d0s2
> 
> 
> The iostat -En gives :
> 
> iostat -En
> c4t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: SEAGATE ST32500N Revision: 3AZQ Serial No:
> Size: 250.06GB <250056000000 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 281 Predictive Failure Analysis: 0
> c4t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: SEAGATE ST32502N Revision: SU0D Serial No:
> Size: 250.06GB <250056000000 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 285 Predictive Failure Analysis: 0
> c3t0d0           Soft Errors: 0 Hard Errors: 9 Transport Errors: 0
> Vendor: TEAC     Product: DV-28E-V         Revision: 1.AC Serial No:
> Size: 0.00GB <0 bytes>
> Media Error: 0 Device Not Ready: 9 No Device: 0 Recoverable: 0
> Illegal Request: 0 Predictive Failure Analysis: 0
> c5t10d0          Soft Errors: 18 Hard Errors: 1 Transport Errors: 9
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t11d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t12d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t13d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t14d0          Soft Errors: 16 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 16
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t15d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t16d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t17d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t18d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t19d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t20d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t21d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t22d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t23d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t24d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t25d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t26d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> 
> 
> 
> Thank you,
> Bruno
>

Adam Cheal

2009-Oct-22 22:21 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Cindy: How can I view the bug report you referenced? Standard methods show my
the bug number is valid (6694909) but no content or notes. We are having similar
messages appear with snv_118 with a busy LSI controller, especially during
scrubbing, and I''d be interested to see what they mentioned in that
report. Also, the LSI firmware updates for the LSISAS3081E (the controller we
use) don''t usually come with release notes indicating what has changed
in each firmware revision, so I''m not sure where they got that idea
from.
-- 
This message posted from opensolaris.org

James C. McPherson

2009-Oct-22 22:32 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Adam Cheal wrote:> Cindy: How can I view the bug report you referenced? Standard methods
> show my the bug number is valid (6694909) but no content or notes. We are
> having similar messages appear with snv_118 with a busy LSI controller,
> especially during scrubbing, and I''d be interested to see what
they
> mentioned in that report. Also, the LSI firmware updates for the
> LSISAS3081E (the controller we use) don''t usually come with
release notes
> indicating what has changed in each firmware revision, so I''m not
sure
> where they got that idea from.

Hi Adam,
unfortunately, you can''t see that bug from outside.

The evaluation from LSI is very clear that this is a firmware issue
rather than a driver issue, and is claimed to be fixed in

	LSI BIOS v6.26.00 FW 1.27.02
(aka "Phase 15")


cheers,
James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Adam Cheal

2009-Oct-22 23:01 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

James: We are running Phase 16 on our LSISAS3801E''s, and have also
tried the recently released Phase 17 but it didn''t help. All firmware
NVRAM settings are default. Basically, when we put the disks behind this
controller under load (e.g. scrubbing, recursive ls on large ZFS filesystem) we
get this series of log entries that appear at random intervals:

scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 34,0 (sd49):
       incomplete read- retrying
scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0 (mpt0):
       mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110b00
scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0 (mpt0):
       mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31110b00
scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0 (mpt0):
       mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31112000
scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0 (mpt0):
       mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31112000
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       Log info 0x31110b00 received for target 40.
       scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       Log info 0x31110b00 received for target 40.
       scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       Log info 0x31110b00 received for target 40.
       scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       Log info 0x31110b00 received for target 40.
       scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 2d,0 (sd42):
       incomplete read- retrying
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       Rev. 8 LSI, Inc. 1068E found.
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       mpt0 supports power management.
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       mpt0: IOC Operational.

It seems to be timing out accessing a disk, retrying, giving up and then doing a
bus reset?

This is happening with random disks behind the controller and on multiple
systems with the same hardware config. We are running snv_118 right now and was
hoping this was some magic mpt-related "bug" that was going to be
fixed in snv_125 but it doesn''t look like it. The LSI3801E is driving 2
x 23-disk JBOD''s which, albeit a dense solution, it should be able to
handle. We are also using wide raidz2 vdevs (22 disks each, one per JBOD) which
agreeably is slower performance-wise, but the goal here is density not
performance. I would have hoped that the system would just "slow down"
if there was IO contention, but not experience things like bus resets.

Your thoughts?
-- 
This message posted from opensolaris.org

James C. McPherson

2009-Oct-22 23:07 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Adam Cheal wrote:> James: We are running Phase 16 on our LSISAS3801E''s, and have also
tried
> the recently released Phase 17 but it didn''t help. All firmware
NVRAM
> settings are default. Basically, when we put the disks behind this
> controller under load (e.g. scrubbing, recursive ls on large ZFS
> filesystem) we get this series of log entries that appear at random
> intervals:
...> It seems to be timing out accessing a disk, retrying, giving up and then
> doing a bus reset?
> 
> This is happening with random disks behind the controller and on multiple
> systems with the same hardware config. We are running snv_118 right now
> and was hoping this was some magic mpt-related "bug" that was
going to be
> fixed in snv_125 but it doesn''t look like it. The LSI3801E is
driving 2 x
> 23-disk JBOD''s which, albeit a dense solution, it should be able
to
> handle. We are also using wide raidz2 vdevs (22 disks each, one per JBOD)
> which agreeably is slower performance-wise, but the goal here is density
> not performance. I would have hoped that the system would just "slow
> down" if there was IO contention, but not experience things like bus
> resets.
> 
> Your thoughts?
ugh. New bug time - bugs.opensolaris.org, please select
Solaris / kernel / driver-mpt. In addition to the error
messages and description of when you see it, please provide
output from

cfgadm -lav
prtconf -v

I''ll see that it gets moved to the correct group asap.


Cheers,
James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Adam Cheal

2009-Oct-22 23:51 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

I''ve filed the bug, but was unable to include the "prtconf
-v" output as the comments field only accepted 15000 chars total. Let me
know if there is anything else I can provide/do to help figure this problem out
as it is essentially preventing us from doing any kind of heavy IO to these
pools, including scrubbing.
-- 
This message posted from opensolaris.org

Carson Gaspar

2009-Oct-23 01:55 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On 10/22/09 4:07 PM, James C. McPherson wrote:> Adam Cheal wrote:
>> It seems to be timing out accessing a disk, retrying, giving up and
then
>> doing a bus reset?
...> ugh. New bug time - bugs.opensolaris.org, please select
> Solaris / kernel / driver-mpt. In addition to the error
> messages and description of when you see it, please provide
> output from
>
> cfgadm -lav
> prtconf -v
>
> I''ll see that it gets moved to the correct group asap.
FYI this is very similar to the behaviour I was seeing with my directly attached
SATA disks on snv_118 (see the list archives for my original messages). I have 
not yet seen the error since I replaced my Hitachi 500 GB disks for Seagate 
1.5TB disks, so it could very well have been some unfortunate LSI firmware / 
Hitachi drive firmware interaction.

carson:gandalf 0 $ gzcat /var/adm/messages.2.gz  | ggrep -4 mpt | tail -9
Oct  8 00:44:17 gandalf.taltos.org scsi: [ID 365881 kern.notice] 
/pci at 0,0/pci8086,27d0 at 1c/pci1000,3140 at 0 (mpt0):
Oct  8 00:44:17 gandalf.taltos.org      Log info 0x31130000 received for target
1.
Oct  8 00:44:17 gandalf.taltos.org      scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Oct  8 00:44:17 gandalf.taltos.org scsi: [ID 365881 kern.notice] 
/pci at 0,0/pci8086,27d0 at 1c/pci1000,3140 at 0 (mpt0):
Oct  8 00:44:17 gandalf.taltos.org      Log info 0x31130000 received for target
1.
Oct  8 00:44:17 gandalf.taltos.org      scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc
Oct  8 00:44:17 gandalf.taltos.org scsi: [ID 365881 kern.notice] 
/pci at 0,0/pci8086,27d0 at 1c/pci1000,3140 at 0 (mpt0):
Oct  8 00:44:17 gandalf.taltos.org      Log info 0x31130000 received for target
1.
Oct  8 00:44:17 gandalf.taltos.org      scsi_status=0x0, ioc_status=0x8048, 
scsi_state=0xc

carson:gandalf 1 $ gzcat /var/adm/messages.2.gz  | sed -ne ''s,^.*\(Log 
info\),\1,p'' | sort -u
Log info 0x31110b00 received for target 7.
Log info 0x31130000 received for target 0.
Log info 0x31130000 received for target 1.
Log info 0x31130000 received for target 2.
Log info 0x31130000 received for target 3.
Log info 0x31130000 received for target 4.
Log info 0x31130000 received for target 6.
Log info 0x31130000 received for target 7.
Log info 0x31140000 received for target 0.
Log info 0x31140000 received for target 1.
Log info 0x31140000 received for target 2.
Log info 0x31140000 received for target 3.
Log info 0x31140000 received for target 4.
Log info 0x31140000 received for target 6.
Log info 0x31140000 received for target 7.

carson:gandalf 0 $ gzcat /var/adm/messages.2.gz  | sed -ne 
''s,^.*\(scsi_status\),\1,p'' | sort -u
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc

-- 
Carson

Bruno Sousa

2009-Oct-23 07:00 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Hi Cindy,

I have a couple of questions about this issue :

   1. i have exactly the same LSI controller in another server running
      opensolaris snv_101b, and so far no errors like this ones where
      seen in the system
   2. up to snv_118 i haven''t seen any problems, only now within
snv_125
   3. the Sun StorageTek SAS HBA isn''t a LSI OEM ? if so, is it
possible
      to know what firmware version is that HBA using?


Thank you,
Bruno

Cindy Swearingen wrote:> Hi Bruno,
>
> I see some bugs associated with these messages (6694909) that point to
> an LSI firmware upgrade that cause these harmless errors to display.
>
> According to the 6694909 comments, this issue is documented in the
> release notes.
>
> As they are harmless, I wouldn''t worry about them.
>
> Maybe someone from the driver group can comment further.
>
> Cindy
>
>
> On 10/22/09 05:40, Bruno Sousa wrote:
>> Hi all,
>>
>> Recently i upgrade from snv_118 to snv_125, and suddently i started 
>> to see this messages at /var/adm/messages :
>>
>> Oct 22 12:54:37 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:54:37 SAN02      mpt_handle_event: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:47 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:47 SAN02      mpt_handle_event: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:50 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:50 SAN02      mpt_handle_event: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>>
>>
>> Is this a symptom of a disk error or some change was made in the 
>> driver?,that now i have more information, where in the past such 
>> information didn''t appear?
>>
>> Thanks,
>> Bruno
>>
>> I''m using a LSI Logic SAS1068E B3 and i within lsiutil i have
this
>> behaviour :
>>
>>
>> 1 MPT Port found
>>
>>     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev 
IOC
>> 1.  mpt0              LSI Logic SAS1068E B3     105      011a0000     0
>>
>> Select a device:  [1-1 or 0 to quit] 1
>>
>> 1.  Identify firmware, BIOS, and/or FCode
>> 2.  Download firmware (update the FLASH)
>> 4.  Download/erase BIOS and/or FCode (update the FLASH)
>> 8.  Scan for devices
>> 10.  Change IOC settings (interrupt coalescing)
>> 13.  Change SAS IO Unit settings
>> 16.  Display attached devices
>> 20.  Diagnostics
>> 21.  RAID actions
>> 22.  Reset bus
>> 23.  Reset target
>> 42.  Display operating system names for devices
>> 45.  Concatenate SAS firmware and NVDATA files
>> 59.  Dump PCI config space
>> 60.  Show non-default settings
>> 61.  Restore default settings
>> 66.  Show SAS discovery errors
>> 69.  Show board manufacturing information
>> 97.  Reset SAS link, HARD RESET
>> 98.  Reset SAS link
>> 99.  Reset port
>> e   Enable expert mode in menus
>> p   Enable paged mode
>> w   Enable logging
>>
>> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 20
>>
>> 1.  Inquiry Test
>> 2.  WriteBuffer/ReadBuffer/Compare Test
>> 3.  Read Test
>> 4.  Write/Read/Compare Test
>> 8.  Read Capacity / Read Block Limits Test
>> 12.  Display phy counters
>> 13.  Clear phy counters
>> 14.  SATA SMART Read Test
>> 15.  SEP (SCSI Enclosure Processor) Test
>> 18.  Report LUNs Test
>> 19.  Drive firmware download
>> 20.  Expander firmware download
>> 21.  Read Logical Blocks
>> 99.  Reset port
>> e   Enable expert mode in menus
>> p   Enable paged mode
>> w   Enable logging
>>
>> Diagnostics menu, select an option:  [1-99 or e/p/w or 0 to quit] 12
>>
>> Adapter Phy 0:  Link Down, No Errors
>>
>> Adapter Phy 1:  Link Down, No Errors
>>
>> Adapter Phy 2:  Link Down, No Errors
>>
>> Adapter Phy 3:  Link Down, No Errors
>>
>> Adapter Phy 4:  Link Up, No Errors
>>
>> Adapter Phy 5:  Link Up, No Errors
>>
>> Adapter Phy 6:  Link Up, No Errors
>>
>> Adapter Phy 7:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 0:  Link Up
>>  Invalid DWord Count                                  79,967,229
>>  Running Disparity Error Count                        63,036,893
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 1:  Link Up
>>  Invalid DWord Count                                  79,967,207
>>  Running Disparity Error Count                        78,339,626
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 2:  Link Up
>>  Invalid DWord Count                                  76,717,646
>>  Running Disparity Error Count                        73,334,563
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 3:  Link Up
>>  Invalid DWord Count                                  79,896,409
>>  Running Disparity Error Count                        76,199,329
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 4:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 5:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 6:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 7:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 8:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 9:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 10:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 11:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 12:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 13:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 14:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 15:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 16:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 17:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 18:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 19:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 20:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 21:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 22:  Link Up
>>  Invalid DWord Count                                     743,980
>>  Running Disparity Error Count                            38,796
>>  Loss of DWord Synch Count                                     1
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 23:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 24:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 25:  Link Down
>>  Invalid DWord Count                                       1,755
>>  Running Disparity Error Count                               408
>>  Loss of DWord Synch Count                                     0
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 26:  Link Down
>>  Invalid DWord Count                                       1,127
>>  Running Disparity Error Count                             1,022
>>  Loss of DWord Synch Count                                     0
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 27:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 28:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 29:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 30:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 31:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 32:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 33:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 34:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 35:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 36:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 37:  Link Down, No Errors
>>
>>
>> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 42
>>
>> mpt0 is /dev/cfg/c5
>>
>> B___T___L  Type       Operating System Device Name
>> ScsiIo to Bus 0 Target 8 failed, IOCStatus = 004b (IOC Terminated)
>> 0  10   0  Disk       /dev/rdsk/c5t10d0s2
>> 0  11   0  Disk       /dev/rdsk/c5t11d0s2
>> 0  12   0  Disk       /dev/rdsk/c5t12d0s2
>> 0  13   0  Disk       /dev/rdsk/c5t13d0s2
>> 0  14   0  Disk       /dev/rdsk/c5t14d0s2
>> 0  15   0  Disk       /dev/rdsk/c5t15d0s2
>> 0  16   0  Disk       /dev/rdsk/c5t16d0s2
>> 0  17   0  Disk       /dev/rdsk/c5t17d0s2
>> 0  18   0  Disk       /dev/rdsk/c5t18d0s2
>> 0  19   0  Disk       /dev/rdsk/c5t19d0s2
>> 0  20   0  Disk       /dev/rdsk/c5t20d0s2
>> 0  21   0  Disk       /dev/rdsk/c5t21d0s2
>> 0  22   0  Disk       /dev/rdsk/c5t22d0s2
>> 0  23   0  Disk       /dev/rdsk/c5t23d0s2
>> 0  24   0  Disk       /dev/rdsk/c5t24d0s2
>> 0  25   0  Disk       /dev/rdsk/c5t25d0s2
>> 0  26   0  Disk       /dev/rdsk/c5t26d0s2
>>
>>
>> The iostat -En gives :
>>
>> iostat -En
>> c4t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: SEAGATE ST32500N Revision: 3AZQ Serial No:
>> Size: 250.06GB <250056000000 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>> Illegal Request: 281 Predictive Failure Analysis: 0
>> c4t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: SEAGATE ST32502N Revision: SU0D Serial No:
>> Size: 250.06GB <250056000000 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>> Illegal Request: 285 Predictive Failure Analysis: 0
>> c3t0d0           Soft Errors: 0 Hard Errors: 9 Transport Errors: 0
>> Vendor: TEAC     Product: DV-28E-V         Revision: 1.AC Serial No:
>> Size: 0.00GB <0 bytes>
>> Media Error: 0 Device Not Ready: 9 No Device: 0 Recoverable: 0
>> Illegal Request: 0 Predictive Failure Analysis: 0
>> c5t10d0          Soft Errors: 18 Hard Errors: 1 Transport Errors: 9
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t11d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t12d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t13d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t14d0          Soft Errors: 16 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 16
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t15d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t16d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t17d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t18d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t19d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t20d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t21d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t22d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t23d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t24d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t25d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t26d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>>
>>
>>
>> Thank you,
>> Bruno
>>
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/46f94311/attachment.html>

Bruno Sousa

2009-Oct-23 07:02 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Hi Adam,

How many disks and zpoo/zfs''s do you have behind that LSI?
I have a system with 22 disks and 4 zpools with around 30 zfs''s and so 
far it works like a charm, even during heavy load. The opensolaris 
release is snv_101b .

Bruno
Adam Cheal wrote:> Cindy: How can I view the bug report you referenced? Standard methods show
my the bug number is valid (6694909) but no content or notes. We are having
similar messages appear with snv_118 with a busy LSI controller, especially
during scrubbing, and I''d be interested to see what they mentioned in
that report. Also, the LSI firmware updates for the LSISAS3081E (the controller
we use) don''t usually come with release notes indicating what has
changed in each firmware revision, so I''m not sure where they got that
idea from.
>   


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Adam Cheal

2009-Oct-23 14:13 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Our config is:
OpenSolaris snv_118 x64
1 x LSISAS3801E controller
2 x 23-disk JBOD (fully populated, 1TB 7.2k SATA drives)
Each of the two external ports on the LSI connects to a 23-disk JBOD. ZFS-wise
we use 1 zpool with 2 x 22-disk raidz2 vdevs (1 vdev per JBOD). Each zpool has
one ZFS filesystem containing millions of files/directories. This data is served
up via CIFS (kernel), which is why we went with snv_118 (first release
post-2009.06 that had stable CIFS server). Like I mentioned to James, we know
that the server won''t be a star performance-wise especially because of
the wide vdevs but it shouldn''t hiccup under load either. A guaranteed
way for us to cause these IO errors is to load up the zpool with about 30 TB of
data (90% full) then scrub it. Within 30 minutes we start to see the errors,
which usually evolves into "failing" disks (because of excessive retry
errors) which just makes things worse.
-- 
This message posted from opensolaris.org

Jeremy f

2009-Oct-23 18:16 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

What bug# is this under? I''m having what I believe is the same problem.
Is
it possible to just take the mpt driver from a prior build in the time
being?
The below is from the load the zpool scrub creates. This is on a dell t7400
workstation with a 1068E oemed lsi. I updated the firmware to the newest
available from dell. The errors follow whichever of the 4 drives has the
highest load.

Streaming doesn''t seem to trigger it as I can push 60 MiB a second to a
mirrored rpool all day, it''s only when there are a lot of metadata
operations.


Oct 23 06:25:44 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:25:44 systurbo5       Disconnected command timeout for Target 1
Oct 23 06:27:15 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:27:15 systurbo5       Disconnected command timeout for Target 1
Oct 23 06:28:26 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:28:26 systurbo5       Disconnected command timeout for Target 1
Oct 23 06:29:47 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:29:47 systurbo5       Disconnected command timeout for Target 1
Oct 23 06:30:58 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:30:58 systurbo5       Disconnected command timeout for Target 1
Oct 23 06:31:28 systurbo5 scsi: [ID 243001 kern.warning] WARNING: /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:31:28 systurbo5       mpt_handle_event_sync: IOCStatus=0x8000,
IOCLogInfo=0x31123000
Oct 23 06:31:28 systurbo5 scsi: [ID 243001 kern.warning] WARNING: /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:31:28 systurbo5       mpt_handle_event: IOCStatus=0x8000,
IOCLogInfo=0x31123000
Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc
Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc
Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc
Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
scsi_state=0xc


On Fri, Oct 23, 2009 at 7:13 AM, Adam Cheal <acheal at pnimedia.com>
wrote:
> Our config is:
> OpenSolaris snv_118 x64
> 1 x LSISAS3801E controller
> 2 x 23-disk JBOD (fully populated, 1TB 7.2k SATA drives)
> Each of the two external ports on the LSI connects to a 23-disk JBOD.
> ZFS-wise we use 1 zpool with 2 x 22-disk raidz2 vdevs (1 vdev per JBOD).
> Each zpool has one ZFS filesystem containing millions of files/directories.
> This data is served up via CIFS (kernel), which is why we went with snv_118
> (first release post-2009.06 that had stable CIFS server). Like I mentioned
> to James, we know that the server won''t be a star performance-wise
> especially because of the wide vdevs but it shouldn''t hiccup under
load
> either. A guaranteed way for us to cause these IO errors is to load up the
> zpool with about 30 TB of data (90% full) then scrub it. Within 30 minutes
> we start to see the errors, which usually evolves into "failing"
disks
> (because of excessive retry errors) which just makes things worse.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/44878d67/attachment.html>

Jeremy f

2009-Oct-23 18:32 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Sorry, running snv_123, indiana

On Fri, Oct 23, 2009 at 11:16 AM, Jeremy f <ryshask at gmail.com> wrote:
> What bug# is this under? I''m having what I believe is the same
problem. Is
> it possible to just take the mpt driver from a prior build in the time
> being?
> The below is from the load the zpool scrub creates. This is on a dell t7400
> workstation with a 1068E oemed lsi. I updated the firmware to the newest
> available from dell. The errors follow whichever of the 4 drives has the
> highest load.
>
> Streaming doesn''t seem to trigger it as I can push 60 MiB a second
to a
> mirrored rpool all day, it''s only when there are a lot of metadata
> operations.
>
>
> Oct 23 06:25:44 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:25:44 systurbo5       Disconnected command timeout for Target 1
> Oct 23 06:27:15 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:27:15 systurbo5       Disconnected command timeout for Target 1
> Oct 23 06:28:26 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:28:26 systurbo5       Disconnected command timeout for Target 1
> Oct 23 06:29:47 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:29:47 systurbo5       Disconnected command timeout for Target 1
> Oct 23 06:30:58 systurbo5 scsi: [ID 107833 kern.warning] WARNING: /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:30:58 systurbo5       Disconnected command timeout for Target 1
> Oct 23 06:31:28 systurbo5 scsi: [ID 243001 kern.warning] WARNING: /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:31:28 systurbo5       mpt_handle_event_sync: IOCStatus=0x8000,
> IOCLogInfo=0x31123000
> Oct 23 06:31:28 systurbo5 scsi: [ID 243001 kern.warning] WARNING: /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:31:28 systurbo5       mpt_handle_event: IOCStatus=0x8000,
> IOCLogInfo=0x31123000
> Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
> Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
> scsi_state=0xc
> Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
> Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
> scsi_state=0xc
> Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
> Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
> scsi_state=0xc
> Oct 23 06:31:29 systurbo5 scsi: [ID 365881 kern.info] /pci at 0
> ,0/pci8086,4029 at 9/pci8086,3500 at 0/pci8086,3510 at 0/pci1028,21d at 0
(mpt0):
> Oct 23 06:31:29 systurbo5       Log info 0x31123000 received for target 1.
> Oct 23 06:31:29 systurbo5       scsi_status=0x0, ioc_status=0x804b,
> scsi_state=0xc
>
>
> On Fri, Oct 23, 2009 at 7:13 AM, Adam Cheal <acheal at pnimedia.com>
wrote:
>
>> Our config is:
>> OpenSolaris snv_118 x64
>> 1 x LSISAS3801E controller
>> 2 x 23-disk JBOD (fully populated, 1TB 7.2k SATA drives)
>> Each of the two external ports on the LSI connects to a 23-disk JBOD.
>> ZFS-wise we use 1 zpool with 2 x 22-disk raidz2 vdevs (1 vdev per
JBOD).
>> Each zpool has one ZFS filesystem containing millions of
files/directories.
>> This data is served up via CIFS (kernel), which is why we went with
snv_118
>> (first release post-2009.06 that had stable CIFS server). Like I
mentioned
>> to James, we know that the server won''t be a star
performance-wise
>> especially because of the wide vdevs but it shouldn''t hiccup
under load
>> either. A guaranteed way for us to cause these IO errors is to load up
the
>> zpool with about 30 TB of data (90% full) then scrub it. Within 30
minutes
>> we start to see the errors, which usually evolves into
"failing" disks
>> (because of excessive retry errors) which just makes things worse.
>> --
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/ecc5d833/attachment.html>

Adam Cheal

2009-Oct-23 19:14 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Just submitted the bug yesterday, under advice of James, so I don''t
have a number you can refer to you...the "change request" number is
6894775 if that helps or is directly related to the future bugid.
>From what I seen/read this problem has been around for awhile but only rears
its ugly head under heavy IO with large filesets, probably related to large
metadata sets as you spoke of. We are using snv_118 x64 but it seems to appear
in snv_123 and snv_125 as well from what I read here.
We''ve tried installing SSD''s to act as a read-cache for the
pool to reduce the metadata hits on the physical disks and as a last-ditch
effort we even tried switching to the "latest" LSI-supplied itmpt
driver from 2007 (from reading
http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/) and
disabling the mpt driver but we ended up with the same timeout issues. In our
case, the drives in the JBODs are all WD (model WD1002FBYS-18A6B0) 1TB 7.2k SATA
drives.

In revisting our architecture, we compared it to Sun''s x4540 Thumper
offering which uses the same controller with similar (though apparently
customized) firmware and 48 disks. The difference is that they use 6 x LSI1068e
controllers which each have to deal with only 8 disks...obviously better on
performance but this architecture could be "hiding" the real IO issue
by distributing the IO across so many controllers.
-- 
This message posted from opensolaris.org

James C. McPherson

2009-Oct-23 20:39 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Adam Cheal wrote:> Just submitted the bug yesterday, under advice of James, so I
don''t have a number you can refer to you...the "change
request" number is 6894775 if that helps or is directly related to the
future bugid.
> 
>>From what I seen/read this problem has been around for awhile but only
rears its ugly head under heavy IO with large filesets, probably related to
large metadata sets as you spoke of. We are using snv_118 x64 but it seems to
appear in snv_123 and snv_125 as well from what I read here.
> 
> We''ve tried installing SSD''s to act as a read-cache for
the pool to reduce the metadata hits on the physical disks and as a last-ditch
effort we even tried switching to the "latest" LSI-supplied itmpt
driver from 2007 (from reading
http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/) and
disabling the mpt driver but we ended up with the same timeout issues. In our
case, the drives in the JBODs are all WD (model WD1002FBYS-18A6B0) 1TB 7.2k SATA
drives.
> 
> In revisting our architecture, we compared it to Sun''s x4540
Thumper offering which uses the same controller with similar (though apparently
customized) firmware and 48 disks. The difference is that they use 6 x LSI1068e
controllers which each have to deal with only 8 disks...obviously better on
performance but this architecture could be "hiding" the real IO issue
by distributing the IO across so many controllers.
Hi Adam,
I was watching the incoming queues all day yesterday for the
bug, but missed seeing it, not sure why.

I''ve now moved the bug to the appropriate category so it will
get attention from the right people.


Thanks,
James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Bruno Sousa

2009-Oct-23 20:48 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Could Sun''x x4540 Thumper reason to have 6 LSI''s some sort of
"hidden"
problems found by Sun where the HBA resets, and due to market time 
pressure the "quick and dirty" solution was to spread the load over 
multiple HBA''s instead of software fix?

Just my 2 cents..


Bruno


Adam Cheal wrote:> Just submitted the bug yesterday, under advice of James, so I
don''t have a number you can refer to you...the "change
request" number is 6894775 if that helps or is directly related to the
future bugid.
>
> >From what I seen/read this problem has been around for awhile but only
rears its ugly head under heavy IO with large filesets, probably related to
large metadata sets as you spoke of. We are using snv_118 x64 but it seems to
appear in snv_123 and snv_125 as well from what I read here.
>
> We''ve tried installing SSD''s to act as a read-cache for
the pool to reduce the metadata hits on the physical disks and as a last-ditch
effort we even tried switching to the "latest" LSI-supplied itmpt
driver from 2007 (from reading
http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/) and
disabling the mpt driver but we ended up with the same timeout issues. In our
case, the drives in the JBODs are all WD (model WD1002FBYS-18A6B0) 1TB 7.2k SATA
drives.
>
> In revisting our architecture, we compared it to Sun''s x4540
Thumper offering which uses the same controller with similar (though apparently
customized) firmware and 48 disks. The difference is that they use 6 x LSI1068e
controllers which each have to deal with only 8 disks...obviously better on
performance but this architecture could be "hiding" the real IO issue
by distributing the IO across so many controllers.
>   

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Bruno Sousa

2009-Oct-23 20:58 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Hi Cindy,

Thank you for the update, mas it seems like i can''t see any information
specific to that bug.
I can only see bugs number 6702538 and 6615564, but according to their 
history, they have been fixed quite some time ago.
Can you by any chance present the information about bug 6694909 ?

Thank you,
Bruno


Cindy Swearingen wrote:> Hi Bruno,
>
> I see some bugs associated with these messages (6694909) that point to
> an LSI firmware upgrade that cause these harmless errors to display.
>
> According to the 6694909 comments, this issue is documented in the
> release notes.
>
> As they are harmless, I wouldn''t worry about them.
>
> Maybe someone from the driver group can comment further.
>
> Cindy
>
>
> On 10/22/09 05:40, Bruno Sousa wrote:
>> Hi all,
>>
>> Recently i upgrade from snv_118 to snv_125, and suddently i started 
>> to see this messages at /var/adm/messages :
>>
>> Oct 22 12:54:37 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:54:37 SAN02      mpt_handle_event: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:47 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:47 SAN02      mpt_handle_event: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:50 SAN02      mpt_handle_event_sync: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: 
>> /pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
>> Oct 22 12:56:50 SAN02      mpt_handle_event: IOCStatus=0x8000, 
>> IOCLogInfo=0x3112011a
>>
>>
>> Is this a symptom of a disk error or some change was made in the 
>> driver?,that now i have more information, where in the past such 
>> information didn''t appear?
>>
>> Thanks,
>> Bruno
>>
>> I''m using a LSI Logic SAS1068E B3 and i within lsiutil i have
this
>> behaviour :
>>
>>
>> 1 MPT Port found
>>
>>     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev 
IOC
>> 1.  mpt0              LSI Logic SAS1068E B3     105      011a0000     0
>>
>> Select a device:  [1-1 or 0 to quit] 1
>>
>> 1.  Identify firmware, BIOS, and/or FCode
>> 2.  Download firmware (update the FLASH)
>> 4.  Download/erase BIOS and/or FCode (update the FLASH)
>> 8.  Scan for devices
>> 10.  Change IOC settings (interrupt coalescing)
>> 13.  Change SAS IO Unit settings
>> 16.  Display attached devices
>> 20.  Diagnostics
>> 21.  RAID actions
>> 22.  Reset bus
>> 23.  Reset target
>> 42.  Display operating system names for devices
>> 45.  Concatenate SAS firmware and NVDATA files
>> 59.  Dump PCI config space
>> 60.  Show non-default settings
>> 61.  Restore default settings
>> 66.  Show SAS discovery errors
>> 69.  Show board manufacturing information
>> 97.  Reset SAS link, HARD RESET
>> 98.  Reset SAS link
>> 99.  Reset port
>> e   Enable expert mode in menus
>> p   Enable paged mode
>> w   Enable logging
>>
>> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 20
>>
>> 1.  Inquiry Test
>> 2.  WriteBuffer/ReadBuffer/Compare Test
>> 3.  Read Test
>> 4.  Write/Read/Compare Test
>> 8.  Read Capacity / Read Block Limits Test
>> 12.  Display phy counters
>> 13.  Clear phy counters
>> 14.  SATA SMART Read Test
>> 15.  SEP (SCSI Enclosure Processor) Test
>> 18.  Report LUNs Test
>> 19.  Drive firmware download
>> 20.  Expander firmware download
>> 21.  Read Logical Blocks
>> 99.  Reset port
>> e   Enable expert mode in menus
>> p   Enable paged mode
>> w   Enable logging
>>
>> Diagnostics menu, select an option:  [1-99 or e/p/w or 0 to quit] 12
>>
>> Adapter Phy 0:  Link Down, No Errors
>>
>> Adapter Phy 1:  Link Down, No Errors
>>
>> Adapter Phy 2:  Link Down, No Errors
>>
>> Adapter Phy 3:  Link Down, No Errors
>>
>> Adapter Phy 4:  Link Up, No Errors
>>
>> Adapter Phy 5:  Link Up, No Errors
>>
>> Adapter Phy 6:  Link Up, No Errors
>>
>> Adapter Phy 7:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 0:  Link Up
>>  Invalid DWord Count                                  79,967,229
>>  Running Disparity Error Count                        63,036,893
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 1:  Link Up
>>  Invalid DWord Count                                  79,967,207
>>  Running Disparity Error Count                        78,339,626
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 2:  Link Up
>>  Invalid DWord Count                                  76,717,646
>>  Running Disparity Error Count                        73,334,563
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 3:  Link Up
>>  Invalid DWord Count                                  79,896,409
>>  Running Disparity Error Count                        76,199,329
>>  Loss of DWord Synch Count                                   113
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 4:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 5:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 6:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 7:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 8:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 9:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 10:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 11:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 12:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 13:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 14:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 15:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 16:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 17:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 18:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 19:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 20:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 21:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 22:  Link Up
>>  Invalid DWord Count                                     743,980
>>  Running Disparity Error Count                            38,796
>>  Loss of DWord Synch Count                                     1
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 23:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 24:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 25:  Link Down
>>  Invalid DWord Count                                       1,755
>>  Running Disparity Error Count                               408
>>  Loss of DWord Synch Count                                     0
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 26:  Link Down
>>  Invalid DWord Count                                       1,127
>>  Running Disparity Error Count                             1,022
>>  Loss of DWord Synch Count                                     0
>>  Phy Reset Problem Count                                       0
>>
>> Expander (Handle 0009) Phy 27:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 28:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 29:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 30:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 31:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 32:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 33:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 34:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 35:  Link Down, No Errors
>>
>> Expander (Handle 0009) Phy 36:  Link Up, No Errors
>>
>> Expander (Handle 0009) Phy 37:  Link Down, No Errors
>>
>>
>> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 42
>>
>> mpt0 is /dev/cfg/c5
>>
>> B___T___L  Type       Operating System Device Name
>> ScsiIo to Bus 0 Target 8 failed, IOCStatus = 004b (IOC Terminated)
>> 0  10   0  Disk       /dev/rdsk/c5t10d0s2
>> 0  11   0  Disk       /dev/rdsk/c5t11d0s2
>> 0  12   0  Disk       /dev/rdsk/c5t12d0s2
>> 0  13   0  Disk       /dev/rdsk/c5t13d0s2
>> 0  14   0  Disk       /dev/rdsk/c5t14d0s2
>> 0  15   0  Disk       /dev/rdsk/c5t15d0s2
>> 0  16   0  Disk       /dev/rdsk/c5t16d0s2
>> 0  17   0  Disk       /dev/rdsk/c5t17d0s2
>> 0  18   0  Disk       /dev/rdsk/c5t18d0s2
>> 0  19   0  Disk       /dev/rdsk/c5t19d0s2
>> 0  20   0  Disk       /dev/rdsk/c5t20d0s2
>> 0  21   0  Disk       /dev/rdsk/c5t21d0s2
>> 0  22   0  Disk       /dev/rdsk/c5t22d0s2
>> 0  23   0  Disk       /dev/rdsk/c5t23d0s2
>> 0  24   0  Disk       /dev/rdsk/c5t24d0s2
>> 0  25   0  Disk       /dev/rdsk/c5t25d0s2
>> 0  26   0  Disk       /dev/rdsk/c5t26d0s2
>>
>>
>> The iostat -En gives :
>>
>> iostat -En
>> c4t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: SEAGATE ST32500N Revision: 3AZQ Serial No:
>> Size: 250.06GB <250056000000 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>> Illegal Request: 281 Predictive Failure Analysis: 0
>> c4t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: SEAGATE ST32502N Revision: SU0D Serial No:
>> Size: 250.06GB <250056000000 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>> Illegal Request: 285 Predictive Failure Analysis: 0
>> c3t0d0           Soft Errors: 0 Hard Errors: 9 Transport Errors: 0
>> Vendor: TEAC     Product: DV-28E-V         Revision: 1.AC Serial No:
>> Size: 0.00GB <0 bytes>
>> Media Error: 0 Device Not Ready: 9 No Device: 0 Recoverable: 0
>> Illegal Request: 0 Predictive Failure Analysis: 0
>> c5t10d0          Soft Errors: 18 Hard Errors: 1 Transport Errors: 9
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t11d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t12d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t13d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t14d0          Soft Errors: 16 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 16
>> Illegal Request: 8 Predictive Failure Analysis: 0
>> c5t15d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t16d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t17d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t18d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t19d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t20d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t21d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t22d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t23d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t24d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t25d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>> c5t26d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
>> Size: 1500.30GB <1500301910016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
>> Illegal Request: 6 Predictive Failure Analysis: 0
>>
>>
>>
>> Thank you,
>> Bruno
>>
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/526d3a63/attachment.html>

Richard Elling

2009-Oct-23 21:14 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Oct 23, 2009, at 1:48 PM, Bruno Sousa wrote:> Could Sun''x x4540 Thumper reason to have 6 LSI''s some
sort of
> "hidden" problems found by Sun where the HBA resets, and due to  
> market time pressure the "quick and dirty" solution was to spread
> the load over multiple HBA''s instead of software fix?
I don''t think so. X4540 has 48 disks -- 6 controllers at 8 disks/ 
controller.
This is the same configuration as the X4500, which used a Marvell  
controller.
This decision leverages parts from the previous design.
  -- richard

Tim Cook

2009-Oct-23 22:59 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Fri, Oct 23, 2009 at 3:48 PM, Bruno Sousa <bsousa at epinfante.com>
wrote:
> Could Sun''x x4540 Thumper reason to have 6 LSI''s some
sort of "hidden"
> problems found by Sun where the HBA resets, and due to market time pressure
> the "quick and dirty" solution was to spread the load over
multiple HBA''s
> instead of software fix?
>
> Just my 2 cents..
>
>
> Bruno
>
>What else were you expecting them to do?  According to LSI''s website,
the
1068e in an x8 configuration is an 8-port card.
http://www.lsi.com/DistributionSystem/AssetDocument/files/docs/marketing_docs/storage_stand_prod/SCG_LSISAS1068E_PB_040407.pdf

While they could''ve used expanders, that just creates one more
component
that can fail/have issues.  Looking at the diagram, they''ve taken the
absolute shortest I/O path possible, which is what I would hope to
see/expect.
http://www.sun.com/servers/x64/x4540/server_architecture.pdf

One drive per channel, 6 channels total.

I also wouldn''t be surprised to find out that they found this the
optimal
configuration from a performance/throughput/IOPS perspective as well. 
Can''t
seem to find those numbers published by LSI.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/afade564/attachment.html>

Adam Cheal

2009-Oct-23 23:32 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

I don''t think there was any intention on Sun''s part to ignore
the problem...obviously their target market wants a performance-oriented box and
the x4540 delivers that. Each 1068E controller chip supports 8 SAS PHY channels
= 1 channel per drive = no contention for channels. The x4540 is a monster and
performs like a dream with snv_118 (we have a few ourselves).

My issue is that implementing an archival-type solution demands a dense, simple
storage platform that performs at a reasonable level, nothing more. Our design
has the same controller chip (8 SAS PHY channels) driving 46 disks, so there is
bound to be contention there especially in high-load situations. I just need it
to work and handle load gracefully, not timeout and cause disk
"failures"; at this point I can''t even scrub the zpools to
verify the data we have on there is valid. From a hardware perspective, the
3801E card is spec''ed to handle our architecture; the OS just seems to
fall over somewhere though and not be able to throttle itself in certain
intensive IO situations.

That said, I don''t know whether to point the finger at LSI''s
firmware or mpt-driver/ZFS. Sun obviously has a good relationship with LSI as
their 1068E is the recommended SAS controller chip and is used in their own
products. At least we''ve got a bug filed now, and we can hopefully
follow this through to find out where the system breaks down.
-- 
This message posted from opensolaris.org

Tim Cook

2009-Oct-23 23:46 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Fri, Oct 23, 2009 at 6:32 PM, Adam Cheal <acheal at pnimedia.com>
wrote:
> I don''t think there was any intention on Sun''s part to
ignore the
> problem...obviously their target market wants a performance-oriented box
and
> the x4540 delivers that. Each 1068E controller chip supports 8 SAS PHY
> channels = 1 channel per drive = no contention for channels. The x4540 is a
> monster and performs like a dream with snv_118 (we have a few ourselves).
>
> My issue is that implementing an archival-type solution demands a dense,
> simple storage platform that performs at a reasonable level, nothing more.
> Our design has the same controller chip (8 SAS PHY channels) driving 46
> disks, so there is bound to be contention there especially in high-load
> situations. I just need it to work and handle load gracefully, not timeout
> and cause disk "failures"; at this point I can''t even
scrub the zpools to
> verify the data we have on there is valid. From a hardware perspective, the
> 3801E card is spec''ed to handle our architecture; the OS just
seems to fall
> over somewhere though and not be able to throttle itself in certain
> intensive IO situations.
>
> That said, I don''t know whether to point the finger at
LSI''s firmware or
> mpt-driver/ZFS. Sun obviously has a good relationship with LSI as their
> 1068E is the recommended SAS controller chip and is used in their own
> products. At least we''ve got a bug filed now, and we can hopefully
follow
> this through to find out where the system breaks down.
>
>Have you checked in with LSI to verify the IOPS ability of the chip?  Just
because it supports having 46 drives attached to one ASIC doesn''t mean
it
can actually service all 46 at once.  You''re talking (VERY
conservatively)
2800 IOPS.

Even ignoring that, I know for a fact that the chip can''t handle raw
throughput numbers on 46 disks unless you''ve got some very severe raid
overhead.  That chip is good for roughly 2GB/sec each direction.  46 7200RPM
drives can fairly easily push 4x that amount in streaming IO loads.

Long story short, it appears you''ve got a 5lbs bag a 50lbs load...

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/b563ace2/attachment.html>

Adam Cheal

2009-Oct-24 00:17 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

LSI''s sales literature on that card specs "128 devices" which
I take with a few hearty grains of salt. I agree that with all 46 drives pumping
out streamed data, the controller would be overworked BUT the drives will only
deliver data as fast as the OS tells them to. Just because the speedometer says
200 mph max doesn''t mean we should (or even can!) go that fast.

The IO intensive operations that trigger our timeout issues are a small
percentage of the actual normal IO we do to the box. Most of the time the
solution happily serves up archived data, but when it comes time to scrub or do
mass operations on the entire dataset bad things happen. It seems a waste to
architect a more expensive performance-oriented solution when you
aren''t going to use that performance the majority of the time. There is
a balance between performance and functionality, but I still feel that we should
be able to make this situation work.

Ideally, the OS could dynamically adapt to slower storage and throttle its IO
requests accordingly. At the least, it could allow the user to specify some IO
thresholds so we can "cage the beast" if need be. We''ve tried
some manual tuning via kernel parameters to restrict max queued operations per
vdev and also a "scrub" related one (specifics escape me), but it
still manages to overload itself.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Oct-24 00:17 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Oct 23, 2009, at 4:46 PM, Tim Cook wrote:> On Fri, Oct 23, 2009 at 6:32 PM, Adam Cheal <acheal at pnimedia.com>
> wrote:
> I don''t think there was any intention on Sun''s part to
ignore the
> problem...obviously their target market wants a performance-oriented  
> box and the x4540 delivers that. Each 1068E controller chip supports  
> 8 SAS PHY channels = 1 channel per drive = no contention for  
> channels. The x4540 is a monster and performs like a dream with  
> snv_118 (we have a few ourselves).
>
> My issue is that implementing an archival-type solution demands a  
> dense, simple storage platform that performs at a reasonable level,  
> nothing more. Our design has the same controller chip (8 SAS PHY  
> channels) driving 46 disks, so there is bound to be contention there  
> especially in high-load situations. I just need it to work and  
> handle load gracefully, not timeout and cause disk "failures"; at
> this point I can''t even scrub the zpools to verify the data we
have
> on there is valid. From a hardware perspective, the 3801E card is  
> spec''ed to handle our architecture; the OS just seems to fall over
> somewhere though and not be able to throttle itself in certain  
> intensive IO situations.
>
> That said, I don''t know whether to point the finger at
LSI''s
> firmware or mpt-driver/ZFS. Sun obviously has a good relationship  
> with LSI as their 1068E is the recommended SAS controller chip and  
> is used in their own products. At least we''ve got a bug filed now,
> and we can hopefully follow this through to find out where the  
> system breaks down.
>
>
> Have you checked in with LSI to verify the IOPS ability of the  
> chip?  Just because it supports having 46 drives attached to one  
> ASIC doesn''t mean it can actually service all 46 at once. 
You''re
> talking (VERY conservatively) 2800 IOPS.
Tim has a valid point. By default, ZFS will queue 35 commands per disk.
For 46 disks that is 1,610 concurrent I/Os.  Historically, it has  
proven to be
relatively easy to crater performance or cause problems with very, very,
very expensive arrays that are easily overrun by Solaris. As a result,  
it is
not uncommon to see references to setting throttles, especially in  
older docs.

Fortunately, this is  simple to test by reducing the number of I/Os ZFS
will queue.  See the Evil Tuning Guide
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29

The mpt source is not open, so the mpt driver''s reaction to 1,610  
concurrent
I/Os can only be guessed from afar -- public LSI docs mention a number  
of 511
concurrent I/Os for SAS1068, but it is not clear to me that is an  
explicit limit.  If
you have success with zfs_vdev_max_pending set to 10, then the mystery
might be solved. Use iostat to observe the wait and actv columns, which
show the number of transactions in the queues.  JCMP?

NB sometimes a driver will have the limit be configurable. For  
example, to get
high performance out of a high-end array attached to a qlc card, I''ve  
set
the execution-throttle in /kernel/drv/qlc.conf to be more than two  
orders of
magnitude greater than its default of 32. /kernel/drv/mpt*.conf does  
not seem
to have a similar throttle.
  -- richard
> Even ignoring that, I know for a fact that the chip can''t handle
raw
> throughput numbers on 46 disks unless you''ve got some very severe
> raid overhead.  That chip is good for roughly 2GB/sec each  
> direction.  46 7200RPM drives can fairly easily push 4x that amount  
> in streaming IO loads.
>
> Long story short, it appears you''ve got a 5lbs bag a 50lbs load...
>
> --Tim
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Tim Cook

2009-Oct-24 00:29 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Fri, Oct 23, 2009 at 7:17 PM, Adam Cheal <acheal at pnimedia.com>
wrote:
> LSI''s sales literature on that card specs "128 devices"
which I take with a
> few hearty grains of salt. I agree that with all 46 drives pumping out
> streamed data, the controller would be overworked BUT the drives will only
> deliver data as fast as the OS tells them to. Just because the speedometer
> says 200 mph max doesn''t mean we should (or even can!) go that
fast.
>
> The IO intensive operations that trigger our timeout issues are a small
> percentage of the actual normal IO we do to the box. Most of the time the
> solution happily serves up archived data, but when it comes time to scrub
or
> do mass operations on the entire dataset bad things happen. It seems a
waste
> to architect a more expensive performance-oriented solution when you
aren''t
> going to use that performance the majority of the time. There is a balance
> between performance and functionality, but I still feel that we should be
> able to make this situation work.
>
> Ideally, the OS could dynamically adapt to slower storage and throttle its
> IO requests accordingly. At the least, it could allow the user to specify
> some IO thresholds so we can "cage the beast" if need be.
We''ve tried some
> manual tuning via kernel parameters to restrict max queued operations per
> vdev and also a "scrub" related one (specifics escape me), but it
still
> manages to overload itself.
> --
>
Where are you planning on queueing up those requests?  The scrub, I can
understand wanting throttling, but what about your user workload?  Unless
you''re talking about EXTREMELY  short bursts of I/O, what do you
suggest the
OS do?  If you''re sending 3000 IOPS at the box from a workstation,
where is
that workload going to sit if you''re only dumping 500 IOPS to disk? 
The
only thing that will change is that your client will timeout instead of your
disks.

I don''t recall seeing what generates the I/O, but I do recall that
it''s
backup.  My assumption would be it''s something coming in over the
network,
in which case I''d say you''re far, far better off throttling at
the network
stack.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/69657521/attachment.html>

Tim Cook

2009-Oct-24 00:32 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Fri, Oct 23, 2009 at 7:17 PM, Richard Elling <richard.elling at
gmail.com>wrote:
>
> Tim has a valid point. By default, ZFS will queue 35 commands per disk.
> For 46 disks that is 1,610 concurrent I/Os.  Historically, it has proven to
> be
> relatively easy to crater performance or cause problems with very, very,
> very expensive arrays that are easily overrun by Solaris. As a result, it
> is
> not uncommon to see references to setting throttles, especially in older
> docs.
>
> Fortunately, this is  simple to test by reducing the number of I/Os ZFS
> will queue.  See the Evil Tuning Guide
>
>
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
>
> The mpt source is not open, so the mpt driver''s reaction to 1,610
> concurrent
> I/Os can only be guessed from afar -- public LSI docs mention a number of
> 511
> concurrent I/Os for SAS1068, but it is not clear to me that is an explicit
> limit.  If
> you have success with zfs_vdev_max_pending set to 10, then the mystery
> might be solved. Use iostat to observe the wait and actv columns, which
> show the number of transactions in the queues.  JCMP?
>
> NB sometimes a driver will have the limit be configurable. For example, to
> get
> high performance out of a high-end array attached to a qlc card,
I''ve set
> the execution-throttle in /kernel/drv/qlc.conf to be more than two orders
> of
> magnitude greater than its default of 32. /kernel/drv/mpt*.conf does not
> seem
> to have a similar throttle.
>  -- richard
>
>
I believe there''s a caveat here though.  That really only helps if the
total
I/O load is actually enough for the controller to handle.  If the sustained
I/O workload is still 1600 concurrent I/O''s, lowering the batch
won''t
actually cause any difference in the timeouts, will it?  It would obviously
eliminate burstiness (yes, I made that word up), but if the total sustained
I/O load is greater than the ASIC can handle, it''s still going to fall
over
and die with a queue of 10, correct?

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091023/9354bfdf/attachment.html>

Adam Cheal

2009-Oct-24 00:58 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

And therein lies the issue. The excessive load that causes the IO issues is
almost always generated locally from a scrub or a local recursive "ls"
used to warm up the SSD-based zpool cache with metadata. The regular network IO
to the box is minimal and is very read-centric; once we load the box up with
archived data (which generally happens in a short amount of time), we simply
serve it out as needed.

As far as queueing goes, I would expect the system to queue bursts of IO in
memory with appropriate timeouts, as required. These timeouts could either be
manually or auto-magically adjusted to deal with the slower storage hardware.
Obviously sustained intense IO requests would eventually blow up the queue so
the goal here is to avoid creating those situations in the first place. We can
throttle the network IO, if needed; I need the OS to know it''s own
local IO boundaries though and not attempt to overwork itself during scrubs etc.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Oct-24 01:54 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Oct 23, 2009, at 5:32 PM, Tim Cook wrote:> On Fri, Oct 23, 2009 at 7:17 PM, Richard Elling <richard.elling at
gmail.com
> > wrote:
>
> Tim has a valid point. By default, ZFS will queue 35 commands per  
> disk.
> For 46 disks that is 1,610 concurrent I/Os.  Historically, it has  
> proven to be
> relatively easy to crater performance or cause problems with very,  
> very,
> very expensive arrays that are easily overrun by Solaris. As a  
> result, it is
> not uncommon to see references to setting throttles, especially in  
> older docs.
>
> Fortunately, this is  simple to test by reducing the number of I/Os  
> ZFS
> will queue.  See the Evil Tuning Guide
>
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
>
> The mpt source is not open, so the mpt driver''s reaction to 1,610
> concurrent
> I/Os can only be guessed from afar -- public LSI docs mention a  
> number of 511
> concurrent I/Os for SAS1068, but it is not clear to me that is an  
> explicit limit.  If
> you have success with zfs_vdev_max_pending set to 10, then the mystery
> might be solved. Use iostat to observe the wait and actv columns,  
> which
> show the number of transactions in the queues.  JCMP?
>
> NB sometimes a driver will have the limit be configurable. For  
> example, to get
> high performance out of a high-end array attached to a qlc card,  
> I''ve set
> the execution-throttle in /kernel/drv/qlc.conf to be more than two  
> orders of
> magnitude greater than its default of 32. /kernel/drv/mpt*.conf does  
> not seem
> to have a similar throttle.
>  -- richard
>
>
>
> I believe there''s a caveat here though.  That really only helps if
> the total I/O load is actually enough for the controller to handle.   
> If the sustained I/O workload is still 1600 concurrent I/O''s,  
> lowering the batch won''t actually cause any difference in the  
> timeouts, will it?  It would obviously eliminate burstiness (yes, I  
> made that word up), but if the total sustained I/O load is greater  
> than the ASIC can handle, it''s still going to fall over and die
with
> a queue of 10, correct?
Yes, but since they are disks, and I''m assuming HDDs here, there is no
chance the disks will be faster than the host''s ability to send I/Os
;-)
iostat will show what the queues look like.
  -- richard

Adam Cheal

2009-Oct-24 03:14 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Here is example of the pool config we use:

# zpool status
  pool: pool002
 state: ONLINE
 scrub: scrub stopped after 0h1m with 0 errors on Fri Oct 23 23:07:52 2009
config:

        NAME         STATE     READ WRITE CKSUM
        pool002      ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c9t18d0  ONLINE       0     0     0
            c9t17d0  ONLINE       0     0     0
            c9t55d0  ONLINE       0     0     0
            c9t13d0  ONLINE       0     0     0
            c9t15d0  ONLINE       0     0     0
            c9t16d0  ONLINE       0     0     0
            c9t11d0  ONLINE       0     0     0
            c9t12d0  ONLINE       0     0     0
            c9t14d0  ONLINE       0     0     0
            c9t9d0   ONLINE       0     0     0
            c9t8d0   ONLINE       0     0     0
            c9t10d0  ONLINE       0     0     0
            c9t29d0  ONLINE       0     0     0
            c9t28d0  ONLINE       0     0     0
            c9t27d0  ONLINE       0     0     0
            c9t23d0  ONLINE       0     0     0
            c9t25d0  ONLINE       0     0     0
            c9t26d0  ONLINE       0     0     0
            c9t21d0  ONLINE       0     0     0
            c9t22d0  ONLINE       0     0     0
            c9t24d0  ONLINE       0     0     0
            c9t19d0  ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c9t30d0  ONLINE       0     0     0
            c9t31d0  ONLINE       0     0     0
            c9t32d0  ONLINE       0     0     0
            c9t33d0  ONLINE       0     0     0
            c9t34d0  ONLINE       0     0     0
            c9t35d0  ONLINE       0     0     0
            c9t36d0  ONLINE       0     0     0
            c9t37d0  ONLINE       0     0     0
            c9t38d0  ONLINE       0     0     0
            c9t39d0  ONLINE       0     0     0
            c9t40d0  ONLINE       0     0     0
            c9t41d0  ONLINE       0     0     0
            c9t42d0  ONLINE       0     0     0
            c9t44d0  ONLINE       0     0     0
            c9t45d0  ONLINE       0     0     0
            c9t46d0  ONLINE       0     0     0
            c9t47d0  ONLINE       0     0     0
            c9t48d0  ONLINE       0     0     0
            c9t49d0  ONLINE       0     0     0
            c9t50d0  ONLINE       0     0     0
            c9t51d0  ONLINE       0     0     0
            c9t52d0  ONLINE       0     0     0
        cache
          c8t2d0     ONLINE       0     0     0
          c8t3d0     ONLINE       0     0     0
        spares
          c9t20d0    AVAIL   
          c9t43d0    AVAIL   

errors: No known data errors

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c8t0d0s0  ONLINE       0     0     0
            c8t1d0s0  ONLINE       0     0     0

errors: No known data errors

...and here is a snapshot of the system using "iostat -indexC 5"
during a scrub of "pool002" (c8 is onboard AHCI controller, c9 is LSI
SAS 3801E):

                          extended device statistics       ---- errors --- 
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c8
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t3d0
 8738.7    0.0 555346.1    0.0  0.1 345.0    0.0   39.5   0 3875   0   1   1   2
c9
  194.8    0.0 11936.9    0.0  0.0  7.9    0.0   40.3   0  87   0   0   0   0
c9t8d0
  194.6    0.0 12927.9    0.0  0.0  7.6    0.0   38.9   0  86   0   0   0   0
c9t9d0
  194.6    0.0 12622.6    0.0  0.0  8.1    0.0   41.7   0  90   0   0   0   0
c9t10d0
  201.6    0.0 13350.9    0.0  0.0  8.0    0.0   39.5   0  90   0   0   0   0
c9t11d0
  194.4    0.0 12902.3    0.0  0.0  7.8    0.0   40.1   0  88   0   0   0   0
c9t12d0
  194.6    0.0 12902.3    0.0  0.0  7.7    0.0   39.3   0  88   0   0   0   0
c9t13d0
  195.4    0.0 12479.0    0.0  0.0  8.5    0.0   43.4   0  92   0   0   0   0
c9t14d0
  197.6    0.0 13107.4    0.0  0.0  8.1    0.0   41.0   0  92   0   0   0   0
c9t15d0
  198.8    0.0 12918.1    0.0  0.0  8.2    0.0   41.4   0  92   0   0   0   0
c9t16d0
  201.0    0.0 13350.3    0.0  0.0  8.1    0.0   40.4   0  91   0   0   0   0
c9t17d0
  201.2    0.0 13325.0    0.0  0.0  7.8    0.0   38.5   0  88   0   0   0   0
c9t18d0
  200.6    0.0 13021.5    0.0  0.0  8.2    0.0   40.7   0  91   0   0   0   0
c9t19d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t20d0
  196.6    0.0 12991.9    0.0  0.0  7.6    0.0   38.8   0  85   0   0   0   0
c9t21d0
  196.4    0.0 11499.3    0.0  0.0  8.0    0.0   40.5   0  89   0   0   0   0
c9t22d0
  197.6    0.0 13030.3    0.0  0.0  8.0    0.0   40.3   0  90   0   0   0   0
c9t23d0
  198.4    0.0 11535.8    0.0  0.0  7.8    0.0   39.3   0  87   0   0   0   0
c9t24d0
  202.2    0.0 13096.3    0.0  0.0  7.9    0.0   39.3   0  89   0   0   0   0
c9t25d0
  193.6    0.0 12457.4    0.0  0.0  8.3    0.0   42.8   0  90   0   0   0   0
c9t26d0
  194.0    0.0 12799.9    0.0  0.0  8.2    0.0   42.1   0  91   0   0   0   0
c9t27d0
  193.0    0.0 12748.8    0.0  0.0  7.9    0.0   41.0   0  88   0   0   0   0
c9t28d0
  194.6    0.0 12863.9    0.0  0.0  7.9    0.0   40.6   0  89   0   0   0   0
c9t29d0
  199.8    0.0 12849.1    0.0  0.0  7.8    0.0   39.0   0  87   0   0   0   0
c9t30d0
  205.0    0.0 13631.9    0.0  0.0  7.8    0.0   38.2   0  88   0   0   0   0
c9t31d0
  204.0    0.0 11674.3    0.0  0.0  7.9    0.0   38.6   0  88   0   0   0   0
c9t32d0
  204.2    0.0 11339.9    0.0  0.0  8.1    0.0   39.7   0  89   0   0   0   0
c9t33d0
  204.8    0.0 11569.7    0.0  0.0  7.7    0.0   37.7   0  86   0   0   0   0
c9t34d0
  205.2    0.0 11268.7    0.0  0.0  7.9    0.0   38.6   0  88   0   0   0   0
c9t35d0
  198.4    0.0 12814.9    0.0  0.0  7.8    0.0   39.5   0  88   0   0   0   0
c9t36d0
  200.4    0.0 13222.3    0.0  0.0  7.9    0.0   39.2   0  88   0   0   0   0
c9t37d0
  200.2    0.0 12324.5    0.0  0.0  7.4    0.0   37.1   0  85   0   0   0   0
c9t38d0
  203.0    0.0 11928.8    0.0  0.0  7.7    0.0   37.7   0  88   0   0   0   0
c9t39d0
  196.2    0.0 12966.3    0.0  0.0  7.5    0.0   38.0   0  84   0   0   0   0
c9t40d0
  195.2    0.0 11544.8    0.0  0.0  7.9    0.0   40.5   0  89   0   0   0   0
c9t41d0
  199.2    0.0 12601.8    0.0  0.0  7.8    0.0   38.9   0  88   0   0   0   0
c9t42d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t43d0
  194.4    0.0 12940.7    0.0  0.0  7.6    0.0   39.2   0  86   0   0   0   0
c9t44d0
  198.2    0.0 13120.6    0.0  0.0  7.5    0.0   38.1   0  86   0   0   0   0
c9t45d0
  201.2    0.0 11713.6    0.0  0.0  7.8    0.0   39.0   0  89   0   0   0   0
c9t46d0
  197.8    0.0 13196.7    0.0  0.0  7.4    0.0   37.4   0  85   0   0   0   0
c9t47d0
  197.4    0.0 13094.3    0.0  0.0  7.6    0.0   38.6   0  87   0   0   0   0
c9t48d0
  195.8    0.0 13017.5    0.0  0.0  7.5    0.0   38.4   0  85   0   1   1   2
c9t49d0
  205.0    0.0 11384.4    0.0  0.0  8.0    0.0   39.0   0  89   0   0   0   0
c9t50d0
  200.6    0.0 13286.6    0.0  0.0  7.5    0.0   37.2   0  85   0   0   0   0
c9t51d0
  200.6    0.0 12931.6    0.0  0.0  7.9    0.0   39.5   0  89   0   0   0   0
c9t52d0
  196.6    0.0 13055.9    0.0  0.0  7.5    0.0   38.3   0  87   0   0   0   0
c9t55d0

I had to abort the scrub shortly after this or we would start seeing the
timeouts.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Oct-24 04:36 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

ok, see below...

On Oct 23, 2009, at 8:14 PM, Adam Cheal wrote:
> Here is example of the pool config we use:
>
> # zpool status
>  pool: pool002
> state: ONLINE
> scrub: scrub stopped after 0h1m with 0 errors on Fri Oct 23 23:07:52  
> 2009
> config:
>
>        NAME         STATE     READ WRITE CKSUM
>        pool002      ONLINE       0     0     0
>          raidz2     ONLINE       0     0     0
>            c9t18d0  ONLINE       0     0     0
>            c9t17d0  ONLINE       0     0     0
>            c9t55d0  ONLINE       0     0     0
>            c9t13d0  ONLINE       0     0     0
>            c9t15d0  ONLINE       0     0     0
>            c9t16d0  ONLINE       0     0     0
>            c9t11d0  ONLINE       0     0     0
>            c9t12d0  ONLINE       0     0     0
>            c9t14d0  ONLINE       0     0     0
>            c9t9d0   ONLINE       0     0     0
>            c9t8d0   ONLINE       0     0     0
>            c9t10d0  ONLINE       0     0     0
>            c9t29d0  ONLINE       0     0     0
>            c9t28d0  ONLINE       0     0     0
>            c9t27d0  ONLINE       0     0     0
>            c9t23d0  ONLINE       0     0     0
>            c9t25d0  ONLINE       0     0     0
>            c9t26d0  ONLINE       0     0     0
>            c9t21d0  ONLINE       0     0     0
>            c9t22d0  ONLINE       0     0     0
>            c9t24d0  ONLINE       0     0     0
>            c9t19d0  ONLINE       0     0     0
>          raidz2     ONLINE       0     0     0
>            c9t30d0  ONLINE       0     0     0
>            c9t31d0  ONLINE       0     0     0
>            c9t32d0  ONLINE       0     0     0
>            c9t33d0  ONLINE       0     0     0
>            c9t34d0  ONLINE       0     0     0
>            c9t35d0  ONLINE       0     0     0
>            c9t36d0  ONLINE       0     0     0
>            c9t37d0  ONLINE       0     0     0
>            c9t38d0  ONLINE       0     0     0
>            c9t39d0  ONLINE       0     0     0
>            c9t40d0  ONLINE       0     0     0
>            c9t41d0  ONLINE       0     0     0
>            c9t42d0  ONLINE       0     0     0
>            c9t44d0  ONLINE       0     0     0
>            c9t45d0  ONLINE       0     0     0
>            c9t46d0  ONLINE       0     0     0
>            c9t47d0  ONLINE       0     0     0
>            c9t48d0  ONLINE       0     0     0
>            c9t49d0  ONLINE       0     0     0
>            c9t50d0  ONLINE       0     0     0
>            c9t51d0  ONLINE       0     0     0
>            c9t52d0  ONLINE       0     0     0
>        cache
>          c8t2d0     ONLINE       0     0     0
>          c8t3d0     ONLINE       0     0     0
>        spares
>          c9t20d0    AVAIL
>          c9t43d0    AVAIL
>
> errors: No known data errors
>
>  pool: rpool
> state: ONLINE
> scrub: none requested
> config:
>
>        NAME          STATE     READ WRITE CKSUM
>        rpool         ONLINE       0     0     0
>          mirror      ONLINE       0     0     0
>            c8t0d0s0  ONLINE       0     0     0
>            c8t1d0s0  ONLINE       0     0     0
>
> errors: No known data errors
>
> ...and here is a snapshot of the system using "iostat -indexC 5"
> during a scrub of "pool002" (c8 is onboard AHCI controller, c9 is
> LSI SAS 3801E):
>
>                          extended device statistics       ----  
> errors ---
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
> trn tot device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t0d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t1d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t2d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t3d0
> 8738.7    0.0 555346.1    0.0  0.1 345.0    0.0   39.5   0 3875    
> 0   1   1   2 c9
You see 345 entries in the active queue. If the controller rolls over at
511 active entries, then it would explain why it would soon begin to
have difficulty.

Meanwhile, it is providing 8,738 IOPS and 555 MB/sec, which is quite
respectable.
>  194.8    0.0 11936.9    0.0  0.0  7.9    0.0   40.3   0  87   0    
> 0   0   0 c9t8d0
These disks are doing almost 200 read IOPS, but are not 100% busy.
Average I/O size is 66 KB, which is not bad, lots of little I/Os could  
be
worse, but at only 11.9 MB/s, you are not near the media bandwidth.
Average service time is 40.3 milliseconds, which is not super, but may
be reflective of contention in the channel.
So there is more capacity to accept I/O commands, but...
>  194.6    0.0 12927.9    0.0  0.0  7.6    0.0   38.9   0  86   0    
> 0   0   0 c9t9d0
>  194.6    0.0 12622.6    0.0  0.0  8.1    0.0   41.7   0  90   0    
> 0   0   0 c9t10d0
>  201.6    0.0 13350.9    0.0  0.0  8.0    0.0   39.5   0  90   0    
> 0   0   0 c9t11d0
>  194.4    0.0 12902.3    0.0  0.0  7.8    0.0   40.1   0  88   0    
> 0   0   0 c9t12d0
>  194.6    0.0 12902.3    0.0  0.0  7.7    0.0   39.3   0  88   0    
> 0   0   0 c9t13d0
>  195.4    0.0 12479.0    0.0  0.0  8.5    0.0   43.4   0  92   0    
> 0   0   0 c9t14d0
>  197.6    0.0 13107.4    0.0  0.0  8.1    0.0   41.0   0  92   0    
> 0   0   0 c9t15d0
>  198.8    0.0 12918.1    0.0  0.0  8.2    0.0   41.4   0  92   0    
> 0   0   0 c9t16d0
>  201.0    0.0 13350.3    0.0  0.0  8.1    0.0   40.4   0  91   0    
> 0   0   0 c9t17d0
>  201.2    0.0 13325.0    0.0  0.0  7.8    0.0   38.5   0  88   0    
> 0   0   0 c9t18d0
>  200.6    0.0 13021.5    0.0  0.0  8.2    0.0   40.7   0  91   0    
> 0   0   0 c9t19d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t20d0
>  196.6    0.0 12991.9    0.0  0.0  7.6    0.0   38.8   0  85   0    
> 0   0   0 c9t21d0
>  196.4    0.0 11499.3    0.0  0.0  8.0    0.0   40.5   0  89   0    
> 0   0   0 c9t22d0
>  197.6    0.0 13030.3    0.0  0.0  8.0    0.0   40.3   0  90   0    
> 0   0   0 c9t23d0
>  198.4    0.0 11535.8    0.0  0.0  7.8    0.0   39.3   0  87   0    
> 0   0   0 c9t24d0
>  202.2    0.0 13096.3    0.0  0.0  7.9    0.0   39.3   0  89   0    
> 0   0   0 c9t25d0
>  193.6    0.0 12457.4    0.0  0.0  8.3    0.0   42.8   0  90   0    
> 0   0   0 c9t26d0
>  194.0    0.0 12799.9    0.0  0.0  8.2    0.0   42.1   0  91   0    
> 0   0   0 c9t27d0
>  193.0    0.0 12748.8    0.0  0.0  7.9    0.0   41.0   0  88   0    
> 0   0   0 c9t28d0
>  194.6    0.0 12863.9    0.0  0.0  7.9    0.0   40.6   0  89   0    
> 0   0   0 c9t29d0
>  199.8    0.0 12849.1    0.0  0.0  7.8    0.0   39.0   0  87   0    
> 0   0   0 c9t30d0
>  205.0    0.0 13631.9    0.0  0.0  7.8    0.0   38.2   0  88   0    
> 0   0   0 c9t31d0
>  204.0    0.0 11674.3    0.0  0.0  7.9    0.0   38.6   0  88   0    
> 0   0   0 c9t32d0
>  204.2    0.0 11339.9    0.0  0.0  8.1    0.0   39.7   0  89   0    
> 0   0   0 c9t33d0
>  204.8    0.0 11569.7    0.0  0.0  7.7    0.0   37.7   0  86   0    
> 0   0   0 c9t34d0
>  205.2    0.0 11268.7    0.0  0.0  7.9    0.0   38.6   0  88   0    
> 0   0   0 c9t35d0
>  198.4    0.0 12814.9    0.0  0.0  7.8    0.0   39.5   0  88   0    
> 0   0   0 c9t36d0
>  200.4    0.0 13222.3    0.0  0.0  7.9    0.0   39.2   0  88   0    
> 0   0   0 c9t37d0
>  200.2    0.0 12324.5    0.0  0.0  7.4    0.0   37.1   0  85   0    
> 0   0   0 c9t38d0
>  203.0    0.0 11928.8    0.0  0.0  7.7    0.0   37.7   0  88   0    
> 0   0   0 c9t39d0
>  196.2    0.0 12966.3    0.0  0.0  7.5    0.0   38.0   0  84   0    
> 0   0   0 c9t40d0
>  195.2    0.0 11544.8    0.0  0.0  7.9    0.0   40.5   0  89   0    
> 0   0   0 c9t41d0
>  199.2    0.0 12601.8    0.0  0.0  7.8    0.0   38.9   0  88   0    
> 0   0   0 c9t42d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t43d0
>  194.4    0.0 12940.7    0.0  0.0  7.6    0.0   39.2   0  86   0    
> 0   0   0 c9t44d0
>  198.2    0.0 13120.6    0.0  0.0  7.5    0.0   38.1   0  86   0    
> 0   0   0 c9t45d0
>  201.2    0.0 11713.6    0.0  0.0  7.8    0.0   39.0   0  89   0    
> 0   0   0 c9t46d0
>  197.8    0.0 13196.7    0.0  0.0  7.4    0.0   37.4   0  85   0    
> 0   0   0 c9t47d0
>  197.4    0.0 13094.3    0.0  0.0  7.6    0.0   38.6   0  87   0    
> 0   0   0 c9t48d0
>  195.8    0.0 13017.5    0.0  0.0  7.5    0.0   38.4   0  85   0    
> 1   1   2 c9t49d0
>  205.0    0.0 11384.4    0.0  0.0  8.0    0.0   39.0   0  89   0    
> 0   0   0 c9t50d0
>  200.6    0.0 13286.6    0.0  0.0  7.5    0.0   37.2   0  85   0    
> 0   0   0 c9t51d0
>  200.6    0.0 12931.6    0.0  0.0  7.9    0.0   39.5   0  89   0    
> 0   0   0 c9t52d0
>  196.6    0.0 13055.9    0.0  0.0  7.5    0.0   38.3   0  87   0    
> 0   0   0 c9t55d0
>
> I had to abort the scrub shortly after this or we would start seeing  
> the timeouts.
yep.  If you set the queue depth to 7, does it complete without  
timeouts?
  -- richard

Markus Kovero

2009-Oct-24 09:01 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

How do you estimate needed queue depth if one has say 64 to 128 disks sitting
behind LSI?
Is it bad idea having queuedepth 1?

Yours
Markus Kovero

________________________________________
L?hett?j?: zfs-discuss-bounces at opensolaris.org [zfs-discuss-bounces at
opensolaris.org] k&#228;ytt&#228;j&#228;n Richard Elling
[richard.elling at gmail.com] puolesta
L?hetetty: 24. lokakuuta 2009 7:36
Vastaanottaja: Adam Cheal
Kopio: zfs-discuss at opensolaris.org
Aihe: Re: [zfs-discuss] SNV_125 MPT warning in logfile

ok, see below...

On Oct 23, 2009, at 8:14 PM, Adam Cheal wrote:
> Here is example of the pool config we use:
>
> # zpool status
>  pool: pool002
> state: ONLINE
> scrub: scrub stopped after 0h1m with 0 errors on Fri Oct 23 23:07:52
> 2009
> config:
>
>        NAME         STATE     READ WRITE CKSUM
>        pool002      ONLINE       0     0     0
>          raidz2     ONLINE       0     0     0
>            c9t18d0  ONLINE       0     0     0
>            c9t17d0  ONLINE       0     0     0
>            c9t55d0  ONLINE       0     0     0
>            c9t13d0  ONLINE       0     0     0
>            c9t15d0  ONLINE       0     0     0
>            c9t16d0  ONLINE       0     0     0
>            c9t11d0  ONLINE       0     0     0
>            c9t12d0  ONLINE       0     0     0
>            c9t14d0  ONLINE       0     0     0
>            c9t9d0   ONLINE       0     0     0
>            c9t8d0   ONLINE       0     0     0
>            c9t10d0  ONLINE       0     0     0
>            c9t29d0  ONLINE       0     0     0
>            c9t28d0  ONLINE       0     0     0
>            c9t27d0  ONLINE       0     0     0
>            c9t23d0  ONLINE       0     0     0
>            c9t25d0  ONLINE       0     0     0
>            c9t26d0  ONLINE       0     0     0
>            c9t21d0  ONLINE       0     0     0
>            c9t22d0  ONLINE       0     0     0
>            c9t24d0  ONLINE       0     0     0
>            c9t19d0  ONLINE       0     0     0
>          raidz2     ONLINE       0     0     0
>            c9t30d0  ONLINE       0     0     0
>            c9t31d0  ONLINE       0     0     0
>            c9t32d0  ONLINE       0     0     0
>            c9t33d0  ONLINE       0     0     0
>            c9t34d0  ONLINE       0     0     0
>            c9t35d0  ONLINE       0     0     0
>            c9t36d0  ONLINE       0     0     0
>            c9t37d0  ONLINE       0     0     0
>            c9t38d0  ONLINE       0     0     0
>            c9t39d0  ONLINE       0     0     0
>            c9t40d0  ONLINE       0     0     0
>            c9t41d0  ONLINE       0     0     0
>            c9t42d0  ONLINE       0     0     0
>            c9t44d0  ONLINE       0     0     0
>            c9t45d0  ONLINE       0     0     0
>            c9t46d0  ONLINE       0     0     0
>            c9t47d0  ONLINE       0     0     0
>            c9t48d0  ONLINE       0     0     0
>            c9t49d0  ONLINE       0     0     0
>            c9t50d0  ONLINE       0     0     0
>            c9t51d0  ONLINE       0     0     0
>            c9t52d0  ONLINE       0     0     0
>        cache
>          c8t2d0     ONLINE       0     0     0
>          c8t3d0     ONLINE       0     0     0
>        spares
>          c9t20d0    AVAIL
>          c9t43d0    AVAIL
>
> errors: No known data errors
>
>  pool: rpool
> state: ONLINE
> scrub: none requested
> config:
>
>        NAME          STATE     READ WRITE CKSUM
>        rpool         ONLINE       0     0     0
>          mirror      ONLINE       0     0     0
>            c8t0d0s0  ONLINE       0     0     0
>            c8t1d0s0  ONLINE       0     0     0
>
> errors: No known data errors
>
> ...and here is a snapshot of the system using "iostat -indexC 5"
> during a scrub of "pool002" (c8 is onboard AHCI controller, c9 is
> LSI SAS 3801E):
>
>                          extended device statistics       ----
> errors ---
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w
> trn tot device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0
> 0   0   0 c8
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0
> 0   0   0 c8t0d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0
> 0   0   0 c8t1d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0
> 0   0   0 c8t2d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0
> 0   0   0 c8t3d0
> 8738.7    0.0 555346.1    0.0  0.1 345.0    0.0   39.5   0 3875
> 0   1   1   2 c9
You see 345 entries in the active queue. If the controller rolls over at
511 active entries, then it would explain why it would soon begin to
have difficulty.

Meanwhile, it is providing 8,738 IOPS and 555 MB/sec, which is quite
respectable.
>  194.8    0.0 11936.9    0.0  0.0  7.9    0.0   40.3   0  87   0
> 0   0   0 c9t8d0
These disks are doing almost 200 read IOPS, but are not 100% busy.
Average I/O size is 66 KB, which is not bad, lots of little I/Os could
be
worse, but at only 11.9 MB/s, you are not near the media bandwidth.
Average service time is 40.3 milliseconds, which is not super, but may
be reflective of contention in the channel.
So there is more capacity to accept I/O commands, but...
>  194.6    0.0 12927.9    0.0  0.0  7.6    0.0   38.9   0  86   0
> 0   0   0 c9t9d0
>  194.6    0.0 12622.6    0.0  0.0  8.1    0.0   41.7   0  90   0
> 0   0   0 c9t10d0
>  201.6    0.0 13350.9    0.0  0.0  8.0    0.0   39.5   0  90   0
> 0   0   0 c9t11d0
>  194.4    0.0 12902.3    0.0  0.0  7.8    0.0   40.1   0  88   0
> 0   0   0 c9t12d0
>  194.6    0.0 12902.3    0.0  0.0  7.7    0.0   39.3   0  88   0
> 0   0   0 c9t13d0
>  195.4    0.0 12479.0    0.0  0.0  8.5    0.0   43.4   0  92   0
> 0   0   0 c9t14d0
>  197.6    0.0 13107.4    0.0  0.0  8.1    0.0   41.0   0  92   0
> 0   0   0 c9t15d0
>  198.8    0.0 12918.1    0.0  0.0  8.2    0.0   41.4   0  92   0
> 0   0   0 c9t16d0
>  201.0    0.0 13350.3    0.0  0.0  8.1    0.0   40.4   0  91   0
> 0   0   0 c9t17d0
>  201.2    0.0 13325.0    0.0  0.0  7.8    0.0   38.5   0  88   0
> 0   0   0 c9t18d0
>  200.6    0.0 13021.5    0.0  0.0  8.2    0.0   40.7   0  91   0
> 0   0   0 c9t19d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0
> 0   0   0 c9t20d0
>  196.6    0.0 12991.9    0.0  0.0  7.6    0.0   38.8   0  85   0
> 0   0   0 c9t21d0
>  196.4    0.0 11499.3    0.0  0.0  8.0    0.0   40.5   0  89   0
> 0   0   0 c9t22d0
>  197.6    0.0 13030.3    0.0  0.0  8.0    0.0   40.3   0  90   0
> 0   0   0 c9t23d0
>  198.4    0.0 11535.8    0.0  0.0  7.8    0.0   39.3   0  87   0
> 0   0   0 c9t24d0
>  202.2    0.0 13096.3    0.0  0.0  7.9    0.0   39.3   0  89   0
> 0   0   0 c9t25d0
>  193.6    0.0 12457.4    0.0  0.0  8.3    0.0   42.8   0  90   0
> 0   0   0 c9t26d0
>  194.0    0.0 12799.9    0.0  0.0  8.2    0.0   42.1   0  91   0
> 0   0   0 c9t27d0
>  193.0    0.0 12748.8    0.0  0.0  7.9    0.0   41.0   0  88   0
> 0   0   0 c9t28d0
>  194.6    0.0 12863.9    0.0  0.0  7.9    0.0   40.6   0  89   0
> 0   0   0 c9t29d0
>  199.8    0.0 12849.1    0.0  0.0  7.8    0.0   39.0   0  87   0
> 0   0   0 c9t30d0
>  205.0    0.0 13631.9    0.0  0.0  7.8    0.0   38.2   0  88   0
> 0   0   0 c9t31d0
>  204.0    0.0 11674.3    0.0  0.0  7.9    0.0   38.6   0  88   0
> 0   0   0 c9t32d0
>  204.2    0.0 11339.9    0.0  0.0  8.1    0.0   39.7   0  89   0
> 0   0   0 c9t33d0
>  204.8    0.0 11569.7    0.0  0.0  7.7    0.0   37.7   0  86   0
> 0   0   0 c9t34d0
>  205.2    0.0 11268.7    0.0  0.0  7.9    0.0   38.6   0  88   0
> 0   0   0 c9t35d0
>  198.4    0.0 12814.9    0.0  0.0  7.8    0.0   39.5   0  88   0
> 0   0   0 c9t36d0
>  200.4    0.0 13222.3    0.0  0.0  7.9    0.0   39.2   0  88   0
> 0   0   0 c9t37d0
>  200.2    0.0 12324.5    0.0  0.0  7.4    0.0   37.1   0  85   0
> 0   0   0 c9t38d0
>  203.0    0.0 11928.8    0.0  0.0  7.7    0.0   37.7   0  88   0
> 0   0   0 c9t39d0
>  196.2    0.0 12966.3    0.0  0.0  7.5    0.0   38.0   0  84   0
> 0   0   0 c9t40d0
>  195.2    0.0 11544.8    0.0  0.0  7.9    0.0   40.5   0  89   0
> 0   0   0 c9t41d0
>  199.2    0.0 12601.8    0.0  0.0  7.8    0.0   38.9   0  88   0
> 0   0   0 c9t42d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0
> 0   0   0 c9t43d0
>  194.4    0.0 12940.7    0.0  0.0  7.6    0.0   39.2   0  86   0
> 0   0   0 c9t44d0
>  198.2    0.0 13120.6    0.0  0.0  7.5    0.0   38.1   0  86   0
> 0   0   0 c9t45d0
>  201.2    0.0 11713.6    0.0  0.0  7.8    0.0   39.0   0  89   0
> 0   0   0 c9t46d0
>  197.8    0.0 13196.7    0.0  0.0  7.4    0.0   37.4   0  85   0
> 0   0   0 c9t47d0
>  197.4    0.0 13094.3    0.0  0.0  7.6    0.0   38.6   0  87   0
> 0   0   0 c9t48d0
>  195.8    0.0 13017.5    0.0  0.0  7.5    0.0   38.4   0  85   0
> 1   1   2 c9t49d0
>  205.0    0.0 11384.4    0.0  0.0  8.0    0.0   39.0   0  89   0
> 0   0   0 c9t50d0
>  200.6    0.0 13286.6    0.0  0.0  7.5    0.0   37.2   0  85   0
> 0   0   0 c9t51d0
>  200.6    0.0 12931.6    0.0  0.0  7.9    0.0   39.5   0  89   0
> 0   0   0 c9t52d0
>  196.6    0.0 13055.9    0.0  0.0  7.5    0.0   38.3   0  87   0
> 0   0   0 c9t55d0
>
> I had to abort the scrub shortly after this or we would start seeing
> the timeouts.
yep.  If you set the queue depth to 7, does it complete without
timeouts?
  -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Adam Cheal

2009-Oct-24 09:49 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

The iostat I posted previously was from a system we had already tuned the
zfs:zfs_vdev_max_pending depth down to 10 (as visible by the max of about 10 in
actv per disk).

I reset this value in /etc/system to 7, rebooted, and started a scrub. iostat
output showed busier disks (%b is higher, which seemed odd) but a cap of about 7
queue items per disk, proving the tuning was effective. iostat at a high-water
mark during the test looked like this:

                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t3d0
 8344.5    0.0 359640.4    0.0  0.1 300.5    0.0   36.0   0 4362 c9
  190.0    0.0 6800.4    0.0  0.0  6.6    0.0   34.8   0  99 c9t8d0
  185.0    0.0 6917.1    0.0  0.0  6.1    0.0   32.9   0  94 c9t9d0
  187.0    0.0 6640.9    0.0  0.0  6.5    0.0   34.6   0  98 c9t10d0
  186.5    0.0 6543.4    0.0  0.0  7.0    0.0   37.5   0 100 c9t11d0
  180.5    0.0 7203.1    0.0  0.0  6.7    0.0   37.2   0 100 c9t12d0
  195.5    0.0 7352.4    0.0  0.0  7.0    0.0   35.8   0 100 c9t13d0
  188.0    0.0 6884.9    0.0  0.0  6.6    0.0   35.2   0  99 c9t14d0
  204.0    0.0 6990.1    0.0  0.0  7.0    0.0   34.3   0 100 c9t15d0
  199.0    0.0 7336.7    0.0  0.0  7.0    0.0   35.2   0 100 c9t16d0
  180.5    0.0 6837.9    0.0  0.0  7.0    0.0   38.8   0 100 c9t17d0
  198.0    0.0 7668.9    0.0  0.0  7.0    0.0   35.3   0 100 c9t18d0
  203.0    0.0 7983.2    0.0  0.0  7.0    0.0   34.5   0 100 c9t19d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t20d0
  195.5    0.0 7096.4    0.0  0.0  6.7    0.0   34.1   0  98 c9t21d0
  189.5    0.0 7757.2    0.0  0.0  6.4    0.0   33.9   0  97 c9t22d0
  195.5    0.0 7645.9    0.0  0.0  6.6    0.0   33.8   0  99 c9t23d0
  194.5    0.0 7925.9    0.0  0.0  7.0    0.0   36.0   0 100 c9t24d0
  188.5    0.0 6725.6    0.0  0.0  6.2    0.0   32.8   0  94 c9t25d0
  188.5    0.0 7199.6    0.0  0.0  6.5    0.0   34.6   0  98 c9t26d0
  196.0    0.0 6666.9    0.0  0.0  6.3    0.0   32.1   0  95 c9t27d0
  193.5    0.0 7455.4    0.0  0.0  6.2    0.0   32.0   0  95 c9t28d0
  189.0    0.0 7400.9    0.0  0.0  6.3    0.0   33.2   0  96 c9t29d0
  182.5    0.0 9397.0    0.0  0.0  7.0    0.0   38.3   0 100 c9t30d0
  192.5    0.0 9179.5    0.0  0.0  7.0    0.0   36.3   0 100 c9t31d0
  189.5    0.0 9431.8    0.0  0.0  7.0    0.0   36.9   0 100 c9t32d0
  187.5    0.0 9082.0    0.0  0.0  7.0    0.0   37.3   0 100 c9t33d0
  188.5    0.0 9368.8    0.0  0.0  7.0    0.0   37.1   0 100 c9t34d0
  180.5    0.0 9332.8    0.0  0.0  7.0    0.0   38.8   0 100 c9t35d0
  183.0    0.0 9690.3    0.0  0.0  7.0    0.0   38.2   0 100 c9t36d0
  186.0    0.0 9193.8    0.0  0.0  7.0    0.0   37.6   0 100 c9t37d0
  180.5    0.0 8233.4    0.0  0.0  7.0    0.0   38.8   0 100 c9t38d0
  175.5    0.0 9085.2    0.0  0.0  7.0    0.0   39.9   0 100 c9t39d0
  177.0    0.0 9340.0    0.0  0.0  7.0    0.0   39.5   0 100 c9t40d0
  175.5    0.0 8831.0    0.0  0.0  7.0    0.0   39.9   0 100 c9t41d0
  190.5    0.0 9177.8    0.0  0.0  7.0    0.0   36.7   0 100 c9t42d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t43d0
  196.0    0.0 9180.5    0.0  0.0  7.0    0.0   35.7   0 100 c9t44d0
  193.5    0.0 9496.8    0.0  0.0  7.0    0.0   36.2   0 100 c9t45d0
  187.0    0.0 8699.5    0.0  0.0  7.0    0.0   37.4   0 100 c9t46d0
  198.5    0.0 9277.0    0.0  0.0  7.0    0.0   35.2   0 100 c9t47d0
  185.5    0.0 9778.3    0.0  0.0  7.0    0.0   37.7   0 100 c9t48d0
  192.0    0.0 8384.2    0.0  0.0  7.0    0.0   36.4   0 100 c9t49d0
  198.5    0.0 8864.7    0.0  0.0  7.0    0.0   35.2   0 100 c9t50d0
  192.0    0.0 9369.8    0.0  0.0  7.0    0.0   36.4   0 100 c9t51d0
  182.5    0.0 8825.7    0.0  0.0  7.0    0.0   38.3   0 100 c9t52d0
  202.0    0.0 7387.9    0.0  0.0  7.0    0.0   34.6   0 100 c9t55d0

...and sure enough about 20 minutes into it I get this (bus reset?):

scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 34,0 (sd49):
       incomplete read- retrying
scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 21,0 (sd30):
       incomplete read- retrying
scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 1e,0 (sd27):
       incomplete read- retrying
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       Rev. 8 LSI, Inc. 1068E found.
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       mpt0 supports power management.
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       mpt0: IOC Operational.

During the "bus reset", iostat output looked like this:

                            extended device statistics       ---- errors --- 
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c8
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t3d0
    0.0    0.0    0.0    0.0  0.0 88.0    0.0    0.0   0 2200   0   3   0   3 c9
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t8d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t9d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t10d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t13d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t14d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t15d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t16d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t17d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t18d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t19d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t20d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t21d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t22d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t23d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t24d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t25d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t26d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t27d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t28d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t29d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   1   0   1
c9t30d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t31d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t32d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   1   0   1
c9t33d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t34d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t35d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t36d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t37d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t38d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t39d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t40d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t41d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t42d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t43d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t44d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t45d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t46d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t47d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t48d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t49d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t50d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t51d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   1   0   1
c9t52d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t55d0

During our previous testing, we had tried even setting this max_pending value
down to 1, but we still hit the problem (albeit it took a little longer to hit
it) and I couldn''t find anything else I could set to throttle IO to the
disk, hence the frustration.

If you hadn''t seen this output, would you say that 7 was a
"reasonable" value for that max_pending queue for our architecture and
should give the LSI controller in this situation enough breathing room to
operate? If so, I *should* be able to scrub the disks successfully (ZFS
isn''t to blame) and therefore have to point the finger at the
mpt-driver/LSI-firmware/disk-firmware instead.
-- 
This message posted from opensolaris.org

Markus Kovero

2009-Oct-24 11:11 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

We actually hit similar issues with LSI, but within workload not scrub, result
is same but it seems to choke on writes rather than reads with suboptimal
performance.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6891413

Anyway, we haven''t experienced this _at all_ with RE3-version of
Western Digital disks..
Issues seem to pop up with 750GB seagate and 1TB WD black-series, so far 2TB
green WDs seem unaffected too, so might it be related to disks firmware due how
they chat with LSI?

Also, we noticed more severe (even RE3 and 2TBWD green) timeouts if disks are
not forced into SATA1-mode, I believe this is known issue with newer 2TB disks
and some other disk controllers and may be caused by bad cabling or
connectivity.

We have never witnessed this behaviour with SAS (fujitsu,ibm..) also. All this
happens with snv 118,122,123 and 125.

Yours
Markus Kovero

________________________________________
L?hett?j?: zfs-discuss-bounces at opensolaris.org [zfs-discuss-bounces at
opensolaris.org] k&#228;ytt&#228;j&#228;n Adam Cheal [acheal at
pnimedia.com] puolesta
L?hetetty: 24. lokakuuta 2009 12:49
Vastaanottaja: zfs-discuss at opensolaris.org
Aihe: Re: [zfs-discuss] SNV_125 MPT warning in logfile

The iostat I posted previously was from a system we had already tuned the
zfs:zfs_vdev_max_pending depth down to 10 (as visible by the max of about 10 in
actv per disk).

I reset this value in /etc/system to 7, rebooted, and started a scrub. iostat
output showed busier disks (%b is higher, which seemed odd) but a cap of about 7
queue items per disk, proving the tuning was effective. iostat at a high-water
mark during the test looked like this:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t3d0
 8344.5    0.0 359640.4    0.0  0.1 300.5    0.0   36.0   0 4362 c9
  190.0    0.0 6800.4    0.0  0.0  6.6    0.0   34.8   0  99 c9t8d0
  185.0    0.0 6917.1    0.0  0.0  6.1    0.0   32.9   0  94 c9t9d0
  187.0    0.0 6640.9    0.0  0.0  6.5    0.0   34.6   0  98 c9t10d0
  186.5    0.0 6543.4    0.0  0.0  7.0    0.0   37.5   0 100 c9t11d0
  180.5    0.0 7203.1    0.0  0.0  6.7    0.0   37.2   0 100 c9t12d0
  195.5    0.0 7352.4    0.0  0.0  7.0    0.0   35.8   0 100 c9t13d0
  188.0    0.0 6884.9    0.0  0.0  6.6    0.0   35.2   0  99 c9t14d0
  204.0    0.0 6990.1    0.0  0.0  7.0    0.0   34.3   0 100 c9t15d0
  199.0    0.0 7336.7    0.0  0.0  7.0    0.0   35.2   0 100 c9t16d0
  180.5    0.0 6837.9    0.0  0.0  7.0    0.0   38.8   0 100 c9t17d0
  198.0    0.0 7668.9    0.0  0.0  7.0    0.0   35.3   0 100 c9t18d0
  203.0    0.0 7983.2    0.0  0.0  7.0    0.0   34.5   0 100 c9t19d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t20d0
  195.5    0.0 7096.4    0.0  0.0  6.7    0.0   34.1   0  98 c9t21d0
  189.5    0.0 7757.2    0.0  0.0  6.4    0.0   33.9   0  97 c9t22d0
  195.5    0.0 7645.9    0.0  0.0  6.6    0.0   33.8   0  99 c9t23d0
  194.5    0.0 7925.9    0.0  0.0  7.0    0.0   36.0   0 100 c9t24d0
  188.5    0.0 6725.6    0.0  0.0  6.2    0.0   32.8   0  94 c9t25d0
  188.5    0.0 7199.6    0.0  0.0  6.5    0.0   34.6   0  98 c9t26d0
  196.0    0.0 6666.9    0.0  0.0  6.3    0.0   32.1   0  95 c9t27d0
  193.5    0.0 7455.4    0.0  0.0  6.2    0.0   32.0   0  95 c9t28d0
  189.0    0.0 7400.9    0.0  0.0  6.3    0.0   33.2   0  96 c9t29d0
  182.5    0.0 9397.0    0.0  0.0  7.0    0.0   38.3   0 100 c9t30d0
  192.5    0.0 9179.5    0.0  0.0  7.0    0.0   36.3   0 100 c9t31d0
  189.5    0.0 9431.8    0.0  0.0  7.0    0.0   36.9   0 100 c9t32d0
  187.5    0.0 9082.0    0.0  0.0  7.0    0.0   37.3   0 100 c9t33d0
  188.5    0.0 9368.8    0.0  0.0  7.0    0.0   37.1   0 100 c9t34d0
  180.5    0.0 9332.8    0.0  0.0  7.0    0.0   38.8   0 100 c9t35d0
  183.0    0.0 9690.3    0.0  0.0  7.0    0.0   38.2   0 100 c9t36d0
  186.0    0.0 9193.8    0.0  0.0  7.0    0.0   37.6   0 100 c9t37d0
  180.5    0.0 8233.4    0.0  0.0  7.0    0.0   38.8   0 100 c9t38d0
  175.5    0.0 9085.2    0.0  0.0  7.0    0.0   39.9   0 100 c9t39d0
  177.0    0.0 9340.0    0.0  0.0  7.0    0.0   39.5   0 100 c9t40d0
  175.5    0.0 8831.0    0.0  0.0  7.0    0.0   39.9   0 100 c9t41d0
  190.5    0.0 9177.8    0.0  0.0  7.0    0.0   36.7   0 100 c9t42d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t43d0
  196.0    0.0 9180.5    0.0  0.0  7.0    0.0   35.7   0 100 c9t44d0
  193.5    0.0 9496.8    0.0  0.0  7.0    0.0   36.2   0 100 c9t45d0
  187.0    0.0 8699.5    0.0  0.0  7.0    0.0   37.4   0 100 c9t46d0
  198.5    0.0 9277.0    0.0  0.0  7.0    0.0   35.2   0 100 c9t47d0
  185.5    0.0 9778.3    0.0  0.0  7.0    0.0   37.7   0 100 c9t48d0
  192.0    0.0 8384.2    0.0  0.0  7.0    0.0   36.4   0 100 c9t49d0
  198.5    0.0 8864.7    0.0  0.0  7.0    0.0   35.2   0 100 c9t50d0
  192.0    0.0 9369.8    0.0  0.0  7.0    0.0   36.4   0 100 c9t51d0
  182.5    0.0 8825.7    0.0  0.0  7.0    0.0   38.3   0 100 c9t52d0
  202.0    0.0 7387.9    0.0  0.0  7.0    0.0   34.6   0 100 c9t55d0

...and sure enough about 20 minutes into it I get this (bus reset?):

scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 34,0 (sd49):
       incomplete read- retrying
scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 21,0 (sd30):
       incomplete read- retrying
scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at
4/pci1000,30a0 at 0/sd at 1e,0 (sd27):
       incomplete read- retrying
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       Rev. 8 LSI, Inc. 1068E found.
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       mpt0 supports power management.
scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
       mpt0: IOC Operational.

During the "bus reset", iostat output looked like this:

                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c8
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c8t3d0
    0.0    0.0    0.0    0.0  0.0 88.0    0.0    0.0   0 2200   0   3   0   3 c9
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t8d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t9d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t10d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t13d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t14d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t15d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t16d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t17d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t18d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t19d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t20d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t21d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t22d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t23d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t24d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t25d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t26d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t27d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t28d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t29d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   1   0   1
c9t30d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t31d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t32d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   1   0   1
c9t33d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t34d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t35d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t36d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t37d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t38d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t39d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t40d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t41d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t42d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t43d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t44d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t45d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t46d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t47d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t48d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t49d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t50d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0
c9t51d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   1   0   1
c9t52d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c9t55d0

During our previous testing, we had tried even setting this max_pending value
down to 1, but we still hit the problem (albeit it took a little longer to hit
it) and I couldn''t find anything else I could set to throttle IO to the
disk, hence the frustration.

If you hadn''t seen this output, would you say that 7 was a
"reasonable" value for that max_pending queue for our architecture and
should give the LSI controller in this situation enough breathing room to
operate? If so, I *should* be able to scrub the disks successfully (ZFS
isn''t to blame) and therefore have to point the finger at the
mpt-driver/LSI-firmware/disk-firmware instead.
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Tim Cook

2009-Oct-24 16:20 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Sat, Oct 24, 2009 at 4:49 AM, Adam Cheal <acheal at pnimedia.com>
wrote:
> The iostat I posted previously was from a system we had already tuned the
> zfs:zfs_vdev_max_pending depth down to 10 (as visible by the max of about
10
> in actv per disk).
>
> I reset this value in /etc/system to 7, rebooted, and started a scrub.
> iostat output showed busier disks (%b is higher, which seemed odd) but a
cap
> of about 7 queue items per disk, proving the tuning was effective. iostat
at
> a high-water mark during the test looked like this:
>
> ...and sure enough about 20 minutes into it I get this (bus reset?):
>
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4
> /pci1000,30a0 at 0/sd at 34,0 (sd49):
>       incomplete read- retrying
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4
> /pci1000,30a0 at 0/sd at 21,0 (sd30):
>       incomplete read- retrying
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4
> /pci1000,30a0 at 0/sd at 1e,0 (sd27):
>       incomplete read- retrying
> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
>       Rev. 8 LSI, Inc. 1068E found.
> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
>       mpt0 supports power management.
> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
(mpt0):
>       mpt0: IOC Operational.
>
> During the "bus reset", iostat output looked like this:
>
>
> During our previous testing, we had tried even setting this max_pending
> value down to 1, but we still hit the problem (albeit it took a little
> longer to hit it) and I couldn''t find anything else I could set to
throttle
> IO to the disk, hence the frustration.
>
> If you hadn''t seen this output, would you say that 7 was a
"reasonable"
> value for that max_pending queue for our architecture and should give the
> LSI controller in this situation enough breathing room to operate? If so, I
> *should* be able to scrub the disks successfully (ZFS isn''t to
blame) and
> therefore have to point the finger at the
> mpt-driver/LSI-firmware/disk-firmware instead.
> --
>
>A little bit of searching google says:
http://downloadmirror.intel.com/17968/eng/ESRT2_IR_readme.txt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091024/cd1a297c/attachment.html>

Tim Cook

2009-Oct-24 16:26 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Sat, Oct 24, 2009 at 11:20 AM, Tim Cook <tim at cook.ms> wrote:
>
>
> On Sat, Oct 24, 2009 at 4:49 AM, Adam Cheal <acheal at pnimedia.com>
wrote:
>
>> The iostat I posted previously was from a system we had already tuned
the
>> zfs:zfs_vdev_max_pending depth down to 10 (as visible by the max of
about 10
>> in actv per disk).
>>
>> I reset this value in /etc/system to 7, rebooted, and started a scrub.
>> iostat output showed busier disks (%b is higher, which seemed odd) but
a cap
>> of about 7 queue items per disk, proving the tuning was effective.
iostat at
>> a high-water mark during the test looked like this:
>>
>
>
>> ...and sure enough about 20 minutes into it I get this (bus reset?):
>>
>>
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4
>> /pci1000,30a0 at 0/sd at 34,0 (sd49):
>>       incomplete read- retrying
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4
>> /pci1000,30a0 at 0/sd at 21,0 (sd30):
>>       incomplete read- retrying
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4
>> /pci1000,30a0 at 0/sd at 1e,0 (sd27):
>>       incomplete read- retrying
>> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0
at 0(mpt0):
>>       Rev. 8 LSI, Inc. 1068E found.
>> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0
at 0(mpt0):
>>       mpt0 supports power management.
>> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0
at 0(mpt0):
>>       mpt0: IOC Operational.
>>
>> During the "bus reset", iostat output looked like this:
>>
>>
>> During our previous testing, we had tried even setting this max_pending
>> value down to 1, but we still hit the problem (albeit it took a little
>> longer to hit it) and I couldn''t find anything else I could
set to throttle
>> IO to the disk, hence the frustration.
>>
>> If you hadn''t seen this output, would you say that 7 was a
"reasonable"
>> value for that max_pending queue for our architecture and should give
the
>> LSI controller in this situation enough breathing room to operate? If
so, I
>> *should* be able to scrub the disks successfully (ZFS isn''t to
blame) and
>> therefore have to point the finger at the
>> mpt-driver/LSI-firmware/disk-firmware instead.
>> --
>>
>>
> A little bit of searching google says:
> http://downloadmirror.intel.com/17968/eng/ESRT2_IR_readme.txt
>
>Huh, good old keyboard shortcuts firing off emails before I''m done with
them.  Anyways, in that link, I found he following:
 3. Updated - to provide NCQ queue depth of 32 (was 8) on 1064e and 1068e
and 1078 internal-only controllers in IR and ESRT2 modes.

Then there''s also this link from someone using a similar controller
under
freebsd:
http://www.nabble.com/mpt-errors-QUEUE-FULL-EVENT,-freebsd-7.0-on-dell-1950-td20019090.html

It would make total sense if you''re having issues and the default queue
depth for that controller is 8 per port.  Even setting it to 1 isn''t
going
to fix your issue if you''ve got 46 drives on one channel/port.

Honestly I''m just taking shots in the dark though.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091024/15419708/attachment.html>

Richard Elling

2009-Oct-24 16:43 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

more below...

On Oct 24, 2009, at 2:49 AM, Adam Cheal wrote:
> The iostat I posted previously was from a system we had already  
> tuned the zfs:zfs_vdev_max_pending depth down to 10 (as visible by  
> the max of about 10 in actv per disk).
>
> I reset this value in /etc/system to 7, rebooted, and started a  
> scrub. iostat output showed busier disks (%b is higher, which seemed  
> odd) but a cap of about 7 queue items per disk, proving the tuning  
> was effective. iostat at a high-water mark during the test looked  
> like this:
>
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t1d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t2d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t3d0
> 8344.5    0.0 359640.4    0.0  0.1 300.5    0.0   36.0   0 4362 c9
>  190.0    0.0 6800.4    0.0  0.0  6.6    0.0   34.8   0  99 c9t8d0
>  185.0    0.0 6917.1    0.0  0.0  6.1    0.0   32.9   0  94 c9t9d0
>  187.0    0.0 6640.9    0.0  0.0  6.5    0.0   34.6   0  98 c9t10d0
>  186.5    0.0 6543.4    0.0  0.0  7.0    0.0   37.5   0 100 c9t11d0
>  180.5    0.0 7203.1    0.0  0.0  6.7    0.0   37.2   0 100 c9t12d0
>  195.5    0.0 7352.4    0.0  0.0  7.0    0.0   35.8   0 100 c9t13d0
>  188.0    0.0 6884.9    0.0  0.0  6.6    0.0   35.2   0  99 c9t14d0
>  204.0    0.0 6990.1    0.0  0.0  7.0    0.0   34.3   0 100 c9t15d0
>  199.0    0.0 7336.7    0.0  0.0  7.0    0.0   35.2   0 100 c9t16d0
>  180.5    0.0 6837.9    0.0  0.0  7.0    0.0   38.8   0 100 c9t17d0
>  198.0    0.0 7668.9    0.0  0.0  7.0    0.0   35.3   0 100 c9t18d0
>  203.0    0.0 7983.2    0.0  0.0  7.0    0.0   34.5   0 100 c9t19d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t20d0
>  195.5    0.0 7096.4    0.0  0.0  6.7    0.0   34.1   0  98 c9t21d0
>  189.5    0.0 7757.2    0.0  0.0  6.4    0.0   33.9   0  97 c9t22d0
>  195.5    0.0 7645.9    0.0  0.0  6.6    0.0   33.8   0  99 c9t23d0
>  194.5    0.0 7925.9    0.0  0.0  7.0    0.0   36.0   0 100 c9t24d0
>  188.5    0.0 6725.6    0.0  0.0  6.2    0.0   32.8   0  94 c9t25d0
>  188.5    0.0 7199.6    0.0  0.0  6.5    0.0   34.6   0  98 c9t26d0
>  196.0    0.0 6666.9    0.0  0.0  6.3    0.0   32.1   0  95 c9t27d0
>  193.5    0.0 7455.4    0.0  0.0  6.2    0.0   32.0   0  95 c9t28d0
>  189.0    0.0 7400.9    0.0  0.0  6.3    0.0   33.2   0  96 c9t29d0
>  182.5    0.0 9397.0    0.0  0.0  7.0    0.0   38.3   0 100 c9t30d0
>  192.5    0.0 9179.5    0.0  0.0  7.0    0.0   36.3   0 100 c9t31d0
>  189.5    0.0 9431.8    0.0  0.0  7.0    0.0   36.9   0 100 c9t32d0
>  187.5    0.0 9082.0    0.0  0.0  7.0    0.0   37.3   0 100 c9t33d0
>  188.5    0.0 9368.8    0.0  0.0  7.0    0.0   37.1   0 100 c9t34d0
>  180.5    0.0 9332.8    0.0  0.0  7.0    0.0   38.8   0 100 c9t35d0
>  183.0    0.0 9690.3    0.0  0.0  7.0    0.0   38.2   0 100 c9t36d0
>  186.0    0.0 9193.8    0.0  0.0  7.0    0.0   37.6   0 100 c9t37d0
>  180.5    0.0 8233.4    0.0  0.0  7.0    0.0   38.8   0 100 c9t38d0
>  175.5    0.0 9085.2    0.0  0.0  7.0    0.0   39.9   0 100 c9t39d0
>  177.0    0.0 9340.0    0.0  0.0  7.0    0.0   39.5   0 100 c9t40d0
>  175.5    0.0 8831.0    0.0  0.0  7.0    0.0   39.9   0 100 c9t41d0
>  190.5    0.0 9177.8    0.0  0.0  7.0    0.0   36.7   0 100 c9t42d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t43d0
>  196.0    0.0 9180.5    0.0  0.0  7.0    0.0   35.7   0 100 c9t44d0
>  193.5    0.0 9496.8    0.0  0.0  7.0    0.0   36.2   0 100 c9t45d0
>  187.0    0.0 8699.5    0.0  0.0  7.0    0.0   37.4   0 100 c9t46d0
>  198.5    0.0 9277.0    0.0  0.0  7.0    0.0   35.2   0 100 c9t47d0
>  185.5    0.0 9778.3    0.0  0.0  7.0    0.0   37.7   0 100 c9t48d0
>  192.0    0.0 8384.2    0.0  0.0  7.0    0.0   36.4   0 100 c9t49d0
>  198.5    0.0 8864.7    0.0  0.0  7.0    0.0   35.2   0 100 c9t50d0
>  192.0    0.0 9369.8    0.0  0.0  7.0    0.0   36.4   0 100 c9t51d0
>  182.5    0.0 8825.7    0.0  0.0  7.0    0.0   38.3   0 100 c9t52d0
>  202.0    0.0 7387.9    0.0  0.0  7.0    0.0   34.6   0 100 c9t55d0
>
> ...and sure enough about 20 minutes into it I get this (bus reset?):
>
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4/ 
> pci1000,30a0 at 0/sd at 34,0 (sd49):
>       incomplete read- retrying
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4/ 
> pci1000,30a0 at 0/sd at 21,0 (sd30):
>       incomplete read- retrying
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,65fa at 4/ 
> pci1000,30a0 at 0/sd at 1e,0 (sd27):
>       incomplete read- retrying
> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
> (mpt0):
>       Rev. 8 LSI, Inc. 1068E found.
> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
> (mpt0):
>       mpt0 supports power management.
> scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,65fa at 4/pci1000,30a0 at 0
> (mpt0):
>       mpt0: IOC Operational.
>
> During the "bus reset", iostat output looked like this:
>
>                            extended device statistics       ----  
> errors ---
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
> trn tot device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t0d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t1d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t2d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c8t3d0
>    0.0    0.0    0.0    0.0  0.0 88.0    0.0    0.0   0 2200   0    
> 3   0   3 c9
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t8d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t9d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t10d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t11d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t12d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t13d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t14d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t15d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t16d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t17d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t18d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t19d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t20d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t21d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t22d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t23d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t24d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t25d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t26d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t27d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t28d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t29d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 1   0   1 c9t30d0
OK, here we see 4 I/Os pending outside  of the host.  The host has
sent them on and is waiting for them to return. This means they are
getting dropped either at the disk or somewhere between the disk
and the controller.

When this happens, the sd driver will time them out, try to clear
the fault by reset, and retry. In other words, the resets you see
are when the system tries to recover.

Since there are many disks with 4 stuck I/Os, I would lean towards
a common cause. What do these disks have in common?  Firmware?
Do they share a SAS expander?
  -- richard
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t31d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t32d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 1   0   1 c9t33d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t34d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t35d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t36d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t37d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t38d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t39d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t40d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t41d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t42d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t43d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t44d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t45d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t46d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t47d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t48d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t49d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t50d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 0   0   0 c9t51d0
>    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0    
> 1   0   1 c9t52d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
> 0   0   0 c9t55d0
>
> During our previous testing, we had tried even setting this  
> max_pending value down to 1, but we still hit the problem (albeit it  
> took a little longer to hit it) and I couldn''t find anything else
I
> could set to throttle IO to the disk, hence the frustration.
>
> If you hadn''t seen this output, would you say that 7 was a  
> "reasonable" value for that max_pending queue for our
architecture
> and should give the LSI controller in this situation enough  
> breathing room to operate? If so, I *should* be able to scrub the  
> disks successfully (ZFS isn''t to blame) and therefore have to
point
> the finger at the mpt-driver/LSI-firmware/disk-firmware instead.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Carson Gaspar

2009-Oct-24 17:30 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On 10/24/09 9:43 AM, Richard Elling wrote:
> OK, here we see 4 I/Os pending outside of the host. The host has
> sent them on and is waiting for them to return. This means they are
> getting dropped either at the disk or somewhere between the disk
> and the controller.
>
> When this happens, the sd driver will time them out, try to clear
> the fault by reset, and retry. In other words, the resets you see
> are when the system tries to recover.
>
> Since there are many disks with 4 stuck I/Os, I would lean towards
> a common cause. What do these disks have in common? Firmware?
> Do they share a SAS expander?
I saw this with my WD 500GB SATA disks (HDS725050KLA360) and LSI firmware 
1.28.02.00 in IT mode, but I (almost?) always had exactly 1 "stuck"
I/O. Note
that my disks were one per channel, no expanders. I have _not_ seen it since 
replacing those disks. So my money is on a bug in the LSI firmware, the drive 
firmware, the drive controller hardware, or some combination thereof.

Note that LSI has released firmware 1.29.00.00. Sadly I cannot find any 
documentation on what has changed. Downloadable from LSI at 
http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas3081e-r/index.html?remote=1&locale=EN

-- 
Carson

Tim Cook

2009-Oct-24 17:54 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

On Sat, Oct 24, 2009 at 12:30 PM, Carson Gaspar <carson at taltos.org>
wrote:
>
> I saw this with my WD 500GB SATA disks (HDS725050KLA360) and LSI firmware
> 1.28.02.00 in IT mode, but I (almost?) always had exactly 1
"stuck" I/O.
> Note that my disks were one per channel, no expanders. I have _not_ seen it
> since replacing those disks. So my money is on a bug in the LSI firmware,
> the drive firmware, the drive controller hardware, or some combination
> thereof.
>
> Note that LSI has released firmware 1.29.00.00. Sadly I cannot find any
> documentation on what has changed. Downloadable from LSI at
>
http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas3081e-r/index.html?remote=1&locale=EN
>
> --
> Carson

Here''s the closest I could find from some Intel release notes.  It came
from: ESRT2_IR_readme.txt and does mention the 1068e chipset, as well as
that firmware rev.

===============Package Information
===============FW and OpROM Package for Native SAS mode, IT/IR mode and Intel(R)
Embedded
Server RAID Technology II

Package version: 2009.10.06
FW Version = 01.29.00 (includes fixed firmware settings)
BIOS (non-RAID) Version = 06.28.00
BIOS (SW RAID) Version = 08.09041155

Supported RAID modes: 0, 1, 1E, 10, 10E and 5 (activation key AXXRAKSW5
required for RAID 5 support)

Supported Intel(R) Server Boards and Systems:
 - S5000PSLSASR, S5000XVNSASR, S5000VSASASR, S5000VCLSASR, S5000VSFSASR
 - SR1500ALSASR, SR1550ALSASR, SR2500ALLXR, S5000PALR (with SAS I/O Module)
 - S5000PSLROMBR (SROMBSAS18E) without HW RAID activation key AXXRAK18E
installed (native SAS or SW RAID modes only) - for HW RAID mode separate
package is available
 - NSC2U, TIGW1U

Supported Intel(R) RAID controller (adapters):
- SASMF8I, SASWT4I, SASUC8I

Intel(R) SAS Entry RAID Module AXX4SASMOD, when inserted in below Intel(R)
Server Boards and Systems:
 - S5520HC / S5520HCV, S5520SC,S5520UR,S5500WB

===============Known Restrictions
===============1. The sasflash versions within this package don''t
support ESRTII
controllers.
2. The sasflash utility for Windows and Linux version within this package
only support Intel(R) IT/IR RAID controllers.  The sasflash utility for
Windows and Linux version within this package don''t support sasflash -o
-e 6
command.
3. The sasflash utility for DOS version doesn''t support the Intel(R)
Server
Boards and Systems due to BIOS limitation.  The DOS version sasflash might
still be supported on 3rd party server boards which don''t have the BIOS
limitation.
4. No PCI 3.0 support
5. No Foreign Configuration Resolution Support
6. No RAID migration Support
7. No mixed RAID mode support ever
8. No Stop On Error support

===============Known Bugs
===============(1)
For Intel(R) chipset S5000P/S5000V/S5000X based server systems, please use
the 32 bit, non-EBC version of sasflash which is
SASFLASH_Ph17-1.22.00.00\sasflash_efi_bios32_rel\sasflash.efi, instead of
the ebc version of sasflash which is in the top package directory and also
in
SASFLASH_Ph17-1.22.00.00\sasflash_efi_ebc_rel\sasflash.efi.  The latter one
may return a wrong sas address with a sasflash -list command in the listed
systems.

(2)
LED behavior does not match between SES and SGPIO for some conditions
(documentation in process).

(3)
When in EFI Optimized Boot mode, the task bar is not displayed in EFI_BSD
after two volumes are created.

(4)
If a system is rebooted while a volume rebuild is in progress, the rebuild
will start over from the beginning.

===============Fixes/Updates
===============Version 2009.10.06
 1. Fixed - MP2 HDD fault LED stays on after rebuild completes
 2. Fixed - System hangs if drive hot-unplugged during stress

Version 2009.07.30
 1. Fixed - SES over i2c for 106x products
 2. Fixed - FW settings updated to support SES over i2c drive lights on
FALSASMP2.

Version 2009.06.15
 1. Fixed - SES over I2C issue for 1078IR.
 2. Updated - 1068e fw to fix SES over I2C on MP2 bug.
 3. Updated - to provide NCQ queue depth of 32 (was 8) on 1064e and 1068e
and 1078 internal-only controllers in IR and ESRT2 modes.
 4. Updated - Firmware to enable SES over I2C on AXX4SASMOD.
 5. Updated - Settings to provide better LED indicators for SGPIO.

Version 2008.12.11
 1. Fixed - Media can''t boot from SATA DVD in some systems in Software
RAID
(ESRT2) mode.
 2. Fixed - Incorrect RAID 5 ECC error handling in Ctrl+M

Version 2008.11.07
 1. Added support for - Enable ICH10 support
 2. Added support for - Software RAID5 to support ICH10R
 3. Added support for - Single Drive RAID 0 (IS) Volume
 4. Fixed - Resolved issue where user could not create a second volume
immediately following the deletion of a second volume.
 5. Fixed - Second hot spare status not shown when first hot spare is
inactive/missing

Version 2008.09.22
 1. Fixed - SWR:During hot PD removal and then quick reboot, not updating
the DDF correctly.

Version 2008.06.16
 1. Fixed - the issue with    The LED functions are not working inside the
OSes for SWR5
 2. Fixed - the issue with    (IR-Only) Volume rebuild fails after cold swap
in IME volume with hotspare
 3. Fixed - the issue with    When a degraded RAID volume with a missing
disk is rebooted, it may resync to a non-RAID disk.
 4. Fixed - the issue with    (IR-Only) Physical Disk firmware download
commands are being rejected
 5. Added support for     (IR-Only) Allow the host to set the name of a RAID
volume

Version 2007.12.05
 1. Fixed incorrect system-dependent firmware settings. This includes fix
for the issue with HDD not being detected in Slot1 with SR2500ALLX system.
 2. Added support for more than one SAS expander
 3. Added graceful handling of situation when more than 8 HDDs are installed
(the limit of 8 drives still exists).

Version 2007.08.20
 1. Updated readme files, added LSI adapters to package, added a new
platform to supported list

Version 2007.07.22
 1. Fixed an issue with 1.6GHz processors when recognizing the SW RAID 5
Activation Key

Version 2007.05.24
 1. Fixed an issue preventing NCQ functionality on the new silicon spins

Version 2007.01.18
 1. Added pass thru support for up to 2 SATA CD/DVD devices when the
controller is in SW RAID mode
    - boot from CD/DVD-ROM with floppy emulation or hard-drive emulation not
supported
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091024/b765e3aa/attachment.html>

Adam Cheal

2009-Oct-24 23:58 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

The controller connects to two disk shelves (expanders), one per port on the
card. If you look back in the thread, you''ll see our zpool config has
one vdev per shelf. All of the disks are Western Digital (model
WD1002FBYS-18A6B0) 1TB 7.2K, firmware rev. 03.00C06. Without actually matching
up the disks with "stuck" IOs, I am assuming they are all on the same
vdev/shelf/controller port.

I communicated with LSI support directly regarding the v1.29 firmware update,
and here''s what they wrote back:

"I have checked with our development team on this one. There are no release
notes available as the functionality of the coding itself has not changed. This
was a minor cleanup and the firmware was assigned a new phase number for these.
There were no defects or added functionality in going from the P16 firmware to
the P17 firmware."

Also, regarding the NCQ depth on the drives I used the LSIUTIL in expert mode
and used options 13/14 to dump the following settings (which are all default):

Multi-pathing:  [0=Disabled, 1=Enabled, default is 0] 
SATA Native Command Queuing:  [0=Disabled, 1=Enabled, default is 1] 
SATA Write Caching:  [0=Disabled, 1=Enabled, default is 1] 
SATA Maximum Queue Depth:  [0 to 255, default is 32] 
Device Missing Report Delay:  [0 to 2047, default is 0] 
Device Missing I/O Delay:  [0 to 255, default is 0] 
Persistence:  [0=Disabled, 1=Enabled, default is 1] 
Physical mapping:  [0=None, 1=DirectAttach, 2=EnclosureSlot, default is 0]
-- 
This message posted from opensolaris.org

Adam Cheal

2009-Oct-25 21:41 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

So, while we are working on resolving this issue with Sun, let me approach this
from the another perspective: what kind of controller/drive ratio would be the
minimum recommended to support a functional OpenSolaris-based archival solution?
Given the following:

- the vast majority of IO to the system is going to be "read"
oriented, other than the initial "load" of the archive shares and
possibly scrubs/re-silvering in the case of failed drives
- we currently have one LSISAS3801E with two external ports; each port connects
to one 23-disk JBOD
- Each JBOD has the ability to take in two external SAS connections if we enable
the "split-backplane" option on it which would split the disk IO path
between the two connectors (12 disks on one connector, 11 on the other); we do
not currently have this enabled
- our current server platform only has 1 x PCIe-x8 slot available; we *could*
look at changing this in the future, but I''d prefer to find a one-card
solution if possible

Here is the math I did that shows the current IO situation (PLEASE correct this
if I am mistaken, as I am somewhat "winging" it here and my head
hurts) :

Based on info from:

http://storageadvisors.adaptec.com/2006/07/26/sas-drive-performance/
http://en.wikipedia.org/wiki/PCI_Express
http://support.wdc.com/product/kb.asp?modelno=WD1002FBYS&x=9&y=8

WD1002FBYS 1TB SATA2 7200rpm drive specs
Avg seek time = 8.9ms
Avg latency = 4.2ms
Max transfer speed = 112 MB/s
Avg transfer speed ~= 65 MB/s

"Random" IO scenario (theoretical numbers):
8.9ms avg seek time + 4.2ms avg latency = 13.1 ms avg access time
1/0.0131 = 76 IOPS/drive
22 (23 - 1 spare) drives x 76 IOPS/drive = 1672 IOPS/shelf
1672 IOPS/shelf x 2 = 3344 IOPS/controller
-or-
22 (23 - 1 spare) drives x 65 MB/s/drive = 1430 MB/s/shelf
1430 MB/s/shelf x 2 = 2860 MB/s controller

Pure "streamed read" IO scenario  (theoretical numbers):
0.0 avg seek time + 4.2ms avg latency = 4.2 ms avg access time
1/0.0042 = 238 IOPS/drive
22 (23 - 1 spare) drives x 238 IOPS/drive = 5236 IOPS/shelf
5236 IOPS/shelf x 2 = 10472 IOPS/controller
-or-
22 (23 - 1 spare) drives x 112 MB/s/drive = 2464 MB/s/shelf
2464 MB/s/shelf x 2 = 4928 MB/s controller

Max. bandwith of single SAS PHY interface = 270MB/s per port (300MB/s -
overhead)

LSISAS3801E has 2 x 4-port SAS connections. Each shelf gets a 4-port
connection, so:

Max controller bandwidth/shelf = 4 x 270 MB/s = 1080 MB/s
Max controller bandwidth = 2 x 1080 MB/s = 2160 MB/s

Max. bandwidth of PCIe x8 interface = 2GB/s
Typical sustained bandwidth of PCIe x8 interface (max - 5% overhead)1.9GB/s

Summary:

Current controller cannot handle max IO load of even random IO scenario
(1430 MB/s per shelf needed, controller can only handle 1080 MB/s per
shelf). Also, PCIe bus can''t push more than 1.9 GB/s sustained over a
single slot, so we are limited by the single card.

Solution:

Connecting 2 x 4-port SAS connectors to one shelf (i.e. enabling split-mode)
would get us 2160 MB/s
/ shelf. This would allow us to remove the controller as a bottleneck
for all but the extreme cached read scenario, but the PCIe bus would
still throttle us to 1.9 GB/s per slot. So, the controller could keep up
with the shelves, but the PCIe bus would have to wait sometimes which
may (?) be a "healthier" situation than overwhelming the controller.

To support two shelves per controller, we could use a LSISAS31601E (4 x 4-port
SAS connectors) but we would hit the PCIe bus limitation again. Moving to two
(or more?) separate PCIe-x8 cards would be best, but we require us to alter our
server platform.

Whew. Thoughts? Comments? Suggestions?
-- 
This message posted from opensolaris.org

David Turnbull

2009-Oct-27 03:41 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

I''m having similar issues, with two AOC-USAS-L8i Supermicro 1068e  
cards mpt2 and mpt3, running 1.26.00.00IT
It seems to only affect a specific revision of disk. (???)

sd67      Soft Errors: 0 Hard Errors: 127 Transport Errors: 3416
Vendor: ATA      Product: WDC WD10EACS-00D Revision: 1A01 Serial No:
Size: 1000.20GB <1000204886016 bytes>

sd58      Soft Errors: 0 Hard Errors: 83 Transport Errors: 2087
Vendor: ATA      Product: WDC WD10EACS-00D Revision: 1A01 Serial No:
Size: 1000.20GB <1000204886016 bytes>

There are 8 other disks on the two controllers:
6xWDC WD10EACS-00Z Revision: 1B01 (no errors)
2xSAMSUNG HD103UJ  Revision: 1113 (no errors)

The two EACS-00D disks are in seperate enclosures with new SAS->SATA  
fanout cables.

Example error messages:

Oct 27 14:26:05 fleet scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/ 
pci1002,5978 at 2/pci15d9,a580 at 0 (mpt2):
Oct 27 14:26:05 fleet   wwn for target has changed

Oct 27 14:25:56 fleet scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/ 
pci1002,5979 at 3/pci15d9,a580 at 0 (mpt3):
Oct 27 14:25:56 fleet   wwn for target has changed

Oct 27 14:25:57 fleet scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/ 
pci1002,5978 at 2/pci15d9,a580 at 0 (mpt2):
Oct 27 14:25:57 fleet   mpt_handle_event_sync: IOCStatus=0x8000,  
IOCLogInfo=0x31110d00

Oct 27 14:25:48 fleet scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/ 
pci1002,5979 at 3/pci15d9,a580 at 0 (mpt3):
Oct 27 14:25:48 fleet   mpt_handle_event_sync: IOCStatus=0x8000,  
IOCLogInfo=0x31110d00

Oct 27 14:26:01 fleet scsi: [ID 365881 kern.info] /pci at 0,0/ 
pci1002,5978 at 2/pci15d9,a580 at 0 (mpt2):
Oct 27 14:26:01 fleet   Log info 0x31110d00 received for target 1.
Oct 27 14:26:01 fleet   scsi_status=0x0, ioc_status=0x804b,  
scsi_state=0xc

Oct 27 14:25:51 fleet scsi: [ID 365881 kern.info] /pci at 0,0/ 
pci1002,5979 at 3/pci15d9,a580 at 0 (mpt3):
Oct 27 14:25:51 fleet   Log info 0x31120403 received for target 2.
Oct 27 14:25:51 fleet   scsi_status=0x0, ioc_status=0x804b,  
scsi_state=0xc

On 22/10/2009, at 10:40 PM, Bruno Sousa wrote:
> Hi all,
>
> Recently i upgrade from snv_118 to snv_125, and suddently i started  
> to see this messages at /var/adm/messages :
>
> Oct 22 12:54:37 SAN02 scsi: [ID 243001 kern.warning] WARNING: / 
> pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:54:37 SAN02      mpt_handle_event: IOCStatus=0x8000,  
> IOCLogInfo=0x3112011a
> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: / 
> pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:47 SAN02      mpt_handle_event_sync: IOCStatus=0x8000,  
> IOCLogInfo=0x3112011a
> Oct 22 12:56:47 SAN02 scsi: [ID 243001 kern.warning] WARNING: / 
> pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:47 SAN02      mpt_handle_event: IOCStatus=0x8000,  
> IOCLogInfo=0x3112011a
> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: / 
> pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:50 SAN02      mpt_handle_event_sync: IOCStatus=0x8000,  
> IOCLogInfo=0x3112011a
> Oct 22 12:56:50 SAN02 scsi: [ID 243001 kern.warning] WARNING: / 
> pci at 0,0/pci10de,376 at a/pci1000,30a0 at 0 (mpt0):
> Oct 22 12:56:50 SAN02      mpt_handle_event: IOCStatus=0x8000,  
> IOCLogInfo=0x3112011a
>
>
> Is this a symptom of a disk error or some change was made in the  
> driver?,that now i have more information, where in the past such  
> information didn''t appear?
>
> Thanks,
> Bruno
>
> I''m using a LSI Logic SAS1068E B3 and i within lsiutil i have this
> behaviour :
>
>
> 1 MPT Port found
>
>    Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev   
> IOC
> 1.  mpt0              LSI Logic SAS1068E B3     105       
> 011a0000     0
>
> Select a device:  [1-1 or 0 to quit] 1
>
> 1.  Identify firmware, BIOS, and/or FCode
> 2.  Download firmware (update the FLASH)
> 4.  Download/erase BIOS and/or FCode (update the FLASH)
> 8.  Scan for devices
> 10.  Change IOC settings (interrupt coalescing)
> 13.  Change SAS IO Unit settings
> 16.  Display attached devices
> 20.  Diagnostics
> 21.  RAID actions
> 22.  Reset bus
> 23.  Reset target
> 42.  Display operating system names for devices
> 45.  Concatenate SAS firmware and NVDATA files
> 59.  Dump PCI config space
> 60.  Show non-default settings
> 61.  Restore default settings
> 66.  Show SAS discovery errors
> 69.  Show board manufacturing information
> 97.  Reset SAS link, HARD RESET
> 98.  Reset SAS link
> 99.  Reset port
> e   Enable expert mode in menus
> p   Enable paged mode
> w   Enable logging
>
> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 20
>
> 1.  Inquiry Test
> 2.  WriteBuffer/ReadBuffer/Compare Test
> 3.  Read Test
> 4.  Write/Read/Compare Test
> 8.  Read Capacity / Read Block Limits Test
> 12.  Display phy counters
> 13.  Clear phy counters
> 14.  SATA SMART Read Test
> 15.  SEP (SCSI Enclosure Processor) Test
> 18.  Report LUNs Test
> 19.  Drive firmware download
> 20.  Expander firmware download
> 21.  Read Logical Blocks
> 99.  Reset port
> e   Enable expert mode in menus
> p   Enable paged mode
> w   Enable logging
>
> Diagnostics menu, select an option:  [1-99 or e/p/w or 0 to quit] 12
>
> Adapter Phy 0:  Link Down, No Errors
>
> Adapter Phy 1:  Link Down, No Errors
>
> Adapter Phy 2:  Link Down, No Errors
>
> Adapter Phy 3:  Link Down, No Errors
>
> Adapter Phy 4:  Link Up, No Errors
>
> Adapter Phy 5:  Link Up, No Errors
>
> Adapter Phy 6:  Link Up, No Errors
>
> Adapter Phy 7:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 0:  Link Up
> Invalid DWord Count                                  79,967,229
> Running Disparity Error Count                        63,036,893
> Loss of DWord Synch Count                                   113
> Phy Reset Problem Count                                       0
>
> Expander (Handle 0009) Phy 1:  Link Up
> Invalid DWord Count                                  79,967,207
> Running Disparity Error Count                        78,339,626
> Loss of DWord Synch Count                                   113
> Phy Reset Problem Count                                       0
>
> Expander (Handle 0009) Phy 2:  Link Up
> Invalid DWord Count                                  76,717,646
> Running Disparity Error Count                        73,334,563
> Loss of DWord Synch Count                                   113
> Phy Reset Problem Count                                       0
>
> Expander (Handle 0009) Phy 3:  Link Up
> Invalid DWord Count                                  79,896,409
> Running Disparity Error Count                        76,199,329
> Loss of DWord Synch Count                                   113
> Phy Reset Problem Count                                       0
>
> Expander (Handle 0009) Phy 4:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 5:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 6:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 7:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 8:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 9:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 10:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 11:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 12:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 13:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 14:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 15:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 16:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 17:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 18:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 19:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 20:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 21:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 22:  Link Up
> Invalid DWord Count                                     743,980
> Running Disparity Error Count                            38,796
> Loss of DWord Synch Count                                     1
> Phy Reset Problem Count                                       0
>
> Expander (Handle 0009) Phy 23:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 24:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 25:  Link Down
> Invalid DWord Count                                       1,755
> Running Disparity Error Count                               408
> Loss of DWord Synch Count                                     0
> Phy Reset Problem Count                                       0
>
> Expander (Handle 0009) Phy 26:  Link Down
> Invalid DWord Count                                       1,127
> Running Disparity Error Count                             1,022
> Loss of DWord Synch Count                                     0
> Phy Reset Problem Count                                       0
>
> Expander (Handle 0009) Phy 27:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 28:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 29:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 30:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 31:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 32:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 33:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 34:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 35:  Link Down, No Errors
>
> Expander (Handle 0009) Phy 36:  Link Up, No Errors
>
> Expander (Handle 0009) Phy 37:  Link Down, No Errors
>
>
> Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 42
>
> mpt0 is /dev/cfg/c5
>
> B___T___L  Type       Operating System Device Name
> ScsiIo to Bus 0 Target 8 failed, IOCStatus = 004b (IOC Terminated)
> 0  10   0  Disk       /dev/rdsk/c5t10d0s2
> 0  11   0  Disk       /dev/rdsk/c5t11d0s2
> 0  12   0  Disk       /dev/rdsk/c5t12d0s2
> 0  13   0  Disk       /dev/rdsk/c5t13d0s2
> 0  14   0  Disk       /dev/rdsk/c5t14d0s2
> 0  15   0  Disk       /dev/rdsk/c5t15d0s2
> 0  16   0  Disk       /dev/rdsk/c5t16d0s2
> 0  17   0  Disk       /dev/rdsk/c5t17d0s2
> 0  18   0  Disk       /dev/rdsk/c5t18d0s2
> 0  19   0  Disk       /dev/rdsk/c5t19d0s2
> 0  20   0  Disk       /dev/rdsk/c5t20d0s2
> 0  21   0  Disk       /dev/rdsk/c5t21d0s2
> 0  22   0  Disk       /dev/rdsk/c5t22d0s2
> 0  23   0  Disk       /dev/rdsk/c5t23d0s2
> 0  24   0  Disk       /dev/rdsk/c5t24d0s2
> 0  25   0  Disk       /dev/rdsk/c5t25d0s2
> 0  26   0  Disk       /dev/rdsk/c5t26d0s2
>
>
> The iostat -En gives :
>
> iostat -En
> c4t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: SEAGATE ST32500N Revision: 3AZQ Serial No:
> Size: 250.06GB <250056000000 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 281 Predictive Failure Analysis: 0
> c4t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: SEAGATE ST32502N Revision: SU0D Serial No:
> Size: 250.06GB <250056000000 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 285 Predictive Failure Analysis: 0
> c3t0d0           Soft Errors: 0 Hard Errors: 9 Transport Errors: 0
> Vendor: TEAC     Product: DV-28E-V         Revision: 1.AC Serial No:
> Size: 0.00GB <0 bytes>
> Media Error: 0 Device Not Ready: 9 No Device: 0 Recoverable: 0
> Illegal Request: 0 Predictive Failure Analysis: 0
> c5t10d0          Soft Errors: 18 Hard Errors: 1 Transport Errors: 9
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t11d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t12d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t13d0          Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 18
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t14d0          Soft Errors: 16 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 16
> Illegal Request: 8 Predictive Failure Analysis: 0
> c5t15d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t16d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t17d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t18d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t19d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t20d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t21d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t22d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t23d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t24d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t25d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
> c5t26d0          Soft Errors: 12 Hard Errors: 0 Transport Errors: 0
> Vendor: ATA      Product: ST31500341AS     Revision: CC1H Serial No:
> Size: 1500.30GB <1500301910016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 12
> Illegal Request: 6 Predictive Failure Analysis: 0
>
>
>
> Thank you,
> Bruno
>
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Travis Tabbal

2009-Nov-01 05:35 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

I am also running 2 of the Supermicro cards. I just upgraded to b126 and it
seems improved. I am running a large file copy locally. I get these warnings in
the dmesg log. When I do, I/O seems to stall for about 60sec. It comes back up
fine, but it''s very annoying. Any hints? I have 4 disks per controller
right now, different brands, sizes, everything. New SATA fanout cables and no
expanders.

The drives on mpt0 and mpt1 are completely different, 4x400GB Seagate drives,
4x1.5TB Samsung drives. I get the problem from both controllers. I
didn''t notice this till about b124. I can reproduce it with rsync
copying files locally between ZFS filesystems and with --bwlimit=10000
(10MB/sec). Keeping the limit low does seem to help.

---------------

Oct 31 23:05:32 nas scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci10de,778 at 10/pci10de,5b1 at 0/pci10de,5b1 at 3/pci15d9,a580 at 0
(mpt1):
Oct 31 23:05:32 nas     Disconnected command timeout for Target 7
Oct 31 23:09:42 nas scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci10de,778 at 10/pci10de,5b1 at 0/pci10de,5b1 at 2/pci15d9,a580 at 0
(mpt0):
Oct 31 23:09:42 nas     Disconnected command timeout for Target 1
Oct 31 23:16:23 nas scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci10de,778 at 10/pci10de,5b1 at 0/pci10de,5b1 at 2/pci15d9,a580 at 0
(mpt0):
Oct 31 23:16:23 nas     Disconnected command timeout for Target 3
Oct 31 23:18:43 nas scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci10de,778 at 10/pci10de,5b1 at 0/pci10de,5b1 at 3/pci15d9,a580 at 0
(mpt1):
Oct 31 23:18:43 nas     Disconnected command timeout for Target 6
Oct 31 23:27:24 nas scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci10de,778 at 10/pci10de,5b1 at 0/pci10de,5b1 at 3/pci15d9,a580 at 0
(mpt1):
Oct 31 23:27:24 nas     Disconnected command timeout for Target 7
-- 
This message posted from opensolaris.org

Jeroen Roodhart

2009-Nov-03 15:46 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

We see the same issue on a x4540 Thor system with 500G disks:

lots of:
...
Nov  3 16:41:46 ....uva.nl scsi: [ID 107833 kern.warning] WARNING: /pci at
3c,0/pci10de,376 at f/pci1000,1000 at 0 (mpt5):
Nov  3 16:41:46 encore.science.uva.nl   Disconnected command timeout for Target
7
...

This system is running nv125 XvM. Seems to occur more when we are using vm-s.
This of course causes very long interruptions on the vm-s as well...
-- 
This message posted from opensolaris.org

Travis Tabbal

2009-Nov-05 03:58 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

It''s easy to reproduce for me under a VM. Create a zvol "zfs
create -V 500G tank/test" then connect it to a VM with "virsh
attach-disk". Even just formatting with ext4 from an Ubuntu 9.10 will cause
the lockup for me. The errors seem to occur more frequently with large files. I
can also reproduce it over NFS and from an Opensolaris zone.

I''m running nv126 XvM right now. I haven''t tried it without
XvM.
-- 
This message posted from opensolaris.org

Jeroen Roodhart

2009-Nov-12 10:05 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

> I''m running nv126 XvM right now. I haven''t tried it
> without XvM.
Without XvM we do not see these issues. We''re running the VMs through
NFS now (using ESXi)...
-- 
This message posted from opensolaris.org

Travis Tabbal

2009-Nov-12 18:37 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

> > I''m running nv126 XvM right now. I haven''t tried
> it
> > without XvM.
> 
> Without XvM we do not see these issues. We''re running
> the VMs through NFS now (using ESXi)...
Interesting. It sounds like it might be an XvM specific bug. I''m glad I
mentioned that in my bug report to Sun. Hopefully they can duplicate it.
I''d like to stick with XvM as I''ve spent a fair amount of time
getting things working well under it.

How did your migration to ESXi go? Are you using it on the same hardware or did
you just switch that server to an NFS server and run the VMs on another box?
-- 
This message posted from opensolaris.org

James C. McPherson

2009-Nov-12 22:21 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

Travis Tabbal wrote:>>> I''m running nv126 XvM right now. I haven''t tried
>> it
>>> without XvM.
>> Without XvM we do not see these issues. We''re running
>> the VMs through NFS now (using ESXi)...
> 
> Interesting. It sounds like it might be an XvM specific bug. I''m
glad I mentioned that in my bug report to Sun. Hopefully they can duplicate it.
I''d like to stick with XvM as I''ve spent a fair amount of time
getting things working well under it.
> 
> How did your migration to ESXi go? Are you using it on the same hardware or
did you just switch that server to an NFS server and run the VMs on another box?

Hi Travis,
your bug showed up - it''s   6900767. Since bugs.opensolaris.org
isn''t a "live" system, you won''t be able to see it
at

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6900767

until tomorrow.


cheers,
James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Jeroen Roodhart

2009-Nov-19 15:35 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

>How did your migration to ESXi go? Are you using it on the same hardware or
did you just switch that server to an NFS server and run the VMs on another box?
The latter, we run these VMs over NFS anyway and had ESXi boxes under test
already. we were already separating "data" exports from "VM"
exports. We use an in-house developed configuration management/bare metal system
which allows us to install new machines pretty easily. In this case we just
provisioned the ESXi VMs to  new "VM" exports on the Thor whilst
re-using the data-exports as they were...

Works pretty well, although the Sun x1027A 10G NICs aren''t yet
supported under ESXi 4...
-- 
This message posted from opensolaris.org

Travis Tabbal

2009-Nov-21 01:09 UTC

head link

[zfs-discuss] SNV_125 MPT warning in logfile

> The latter, we run these VMs over NFS anyway and had
> ESXi boxes under test already. we were already
> separating "data" exports from "VM" exports. We use
> an in-house developed configuration management/bare
> metal system which allows us to install new machines
> pretty easily. In this case we just provisioned the
> ESXi VMs to  new "VM" exports on the Thor whilst
> re-using the data-exports as they were...

Thanks for the info. Unfortunately, I need this box to do double duty and run
the VMs as well. The hardware is capable, this issue with XvM and/or the mpt
driver just needs to get fixed. Other than that, things are running great with
this server.
-- 
This message posted from opensolaris.org

zfs discuss - Oct 2009 - SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile

[zfs-discuss] SNV_125 MPT warning in logfile