thr3ads.net - zfs discuss - [zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA [May 2012]

If this information is useful, please help other people find it:
Share via:

"Antonio S. Cofiño"

2012-May-30 16:25 UTC

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

Dear All,

It may be this not the correct mailing list, but I''m having a ZFS issue
when a disk is failing.

The system is a supermicro motherboard X8DTH-6F in a 4U chassis 
(SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1).
It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2) each 
of them connected to a 4 different HBA (2x LSI 3081E-R (1068 chip) + 2x 
LSI SAS9200-8e (2008 chip)).
This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS) + 34 
SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi HDS723020BLA642))

The system is controlled by Opensolaris (snv_134) and it work normally. 
All the SATA disks are part of the same pool separate by raidz2 vdev 
composed by 11 (~) disks.

The issue arise when on of the disk starts to fail making long time 
accesses. After some time (minutes, but I''m not sure) all the disks, 
connected to the same HBA, start to report errors. This  situation 
produce a general failure on the ZFS making the whole POOL unavailable.

Identifying the original failed disk producing access errors and 
removing it the pool starts to resilver with no problem, and all the 
spurious errors produced by the general error are recovered.

My question is, there is anyway to anticipate this "choking" situation
when a disk is failing, to avoid the general failure?

Any help or suggestion is welcome.

Regards

Antonio

-- 
--
Antonio S. Cofi?o
Grupo de Meteorolog?a de Santander
Dep. de Matem?tica Aplicada y
         Ciencias de la Computaci?n
Universidad de Cantabria
Escuela de Caminos
Avenida de los Castros, 44
39005 Santander, Spain
Tel: (+34) 942 20 1731
Fax: (+34) 942 20 1703
http://www.meteo.unican.es
mailto:antonio.cofino at unican.es

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120530/26248a94/attachment.html>

Richard Elling

2012-May-30 16:52 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

On May 30, 2012, at 9:25 AM, Antonio S. Cofi?o wrote:
> Dear All,
> 
> It may be this not the correct mailing list, but I''m having a ZFS
issue when a disk is failing.
> 
> The system is a supermicro motherboard X8DTH-6F in a 4U chassis
(SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1).
> It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2) each of
them connected to a 4 different HBA (2x LSI 3081E-R (1068 chip) + 2x LSI
SAS9200-8e (2008 chip)).
> This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS) + 34
SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi HDS723020BLA642))
> 
> The system is controlled by Opensolaris (snv_134) and it work normally. All
the SATA disks are part of the same pool separate by raidz2 vdev composed by 11
(~) disks.
> 
> The issue arise when on of the disk starts to fail making long time
accesses. After some time (minutes, but I''m not sure) all the disks,
connected to the same HBA, start to report errors. This  situation produce a
general failure on the ZFS making the whole POOL unavailable.
> 
> Identifying the original failed disk producing access errors and removing
it the pool starts to resilver with no problem, and all the spurious errors
produced by the general error are recovered.
> 
> My question is, there is anyway to anticipate this "choking"
situation when a disk is failing, to avoid the general failure?
No.
> Any help or suggestion is welcome.
The best, proven solution is to not use SATA disks with SAS expanders.
Since that is likely to be beyond your time and budget, consider upgrading to
the
latest HBA and expander firmware.
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120530/4fd0b059/attachment.html>

Jim Klimov

2012-May-30 16:52 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

2012-05-30 20:25, "Antonio S. Cofi?o" wrote:> Dear All,
>
> It may be this not the correct mailing list, but I''m having a ZFS
issue
> when a disk is failing.
I hope other users might help more on specific details, but while
we''re waiting for their answer - please search the list archives.
Similar description of the problem comes up every few months, and
it seems to be a fundamental flaw of (consumerish?) SATA drives
with backplanes, leading to reset storms.

I remember the mechanism being something like this: a problematic
disk is detected and the system tries to have it reset so that it
might stop causing problems. The SATA controller either ignores
the command or takes too long to complete/respond, so the system
goes up the stack and next resets the backplane or ultimately the
controller.

I am not qualified to comment whether this issue is fundamental
(i.e. in SATA protocols) or incidental (cheap drives don''t do
advanced stuff, while expensive SATAs might be ok in this regard).
There were discussions about using SATA-SAS interposers, but they
might not fit mechanically, add latency and instability, and raise
the system price to the point where native SAS disks would be
better...

Now, waiting for experts to chime in on whatever I missed ;)
HTH,
//Jim Klimov

Paul Kraus

2012-May-30 17:47 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

On Wed, May 30, 2012 at 12:52 PM, Richard Elling
<richard.elling at gmail.com> wrote:
> The best, proven solution is to not use SATA disks with SAS expanders.
> Since that is likely to be beyond your time and budget, consider upgrading
> to the
> latest HBA and expander firmware.
    I recently had the problem with a "reset storm" with five J4400
loaded with SATA drives behind two Sun/Oracle dual port SAS
controllers (LSI based). I was told the following by Oracle Support:

1. It is a known issue
2. software updates in Solaris 10U10 address some of it (we are at 10U9).
3. recommended stopping fmservice as that is a trigger (as well as a
failing drive)
4. the problem happens with SAS as well as SATA drives, but is much
less frequent
5. Oracle is pushing for new FW for the HBA to address the issue
6. a chain of three J4400 is much more likely to experience it than a
chain of two (we have one chain of two and one chain of three, the
problem occurred on the chain of three)

Not specifically applicable here, but probably related and might be of
use to someone here.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Assistant Technical Director, LoneStarCon 3 (http://lonestarcon3.org/)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players

Weber, Markus

2012-May-31 16:04 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

Antonio S. Cofi?o wrote:> [...]
> The system is a supermicro motherboard X8DTH-6F in a 4U chassis
> (SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1).
> It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2)
> each of them connected to a 4 different HBA (2x LSI 3081E-R (1068
> chip) + 2x LSI SAS9200-8e (2008 chip)).
> This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS)
> + 34 SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi HDS723020BLA642))
> The issue arise when one of the disk starts to fail making long time
> accesses. After some time (minutes, but I''m not sure) all the
disks,
> connected to the same HBA, start to report errors. This situation
> produce a general failure on the ZFS making the whole POOL unavailable.
> [...]

Have been there and gave up at the end[1]. Could reproduce (even though
it took a bit longer) under most Linux versions (incl. using latest LSI
drivers) and LSI 3081E-R HBA.

Is it just mpt causing the errors or also mpt_sas?

In a lab environment the LSI 9200 HBA behaved better - I/O only dropped
shortly and then continued on the other disks without generating errors.

Had a lengthy Oracle case on this, but all proposed "workarounds" did
not worked for me at all, which had been (some also from other forums)

- disabling NCQ
- allow-bus-device-reset=0; to /kernel/drv/sd.conf
- set zfs:zfs_vdev_max_pending=1
- set mpt:mpt_enable_msi=0
- keep usage below 90%
- no fmservices running and did temporarily did fmadm unload disk-transport
  or other disk access stuff (smartd?)
- tried changing retries-timeout via sd-conf for the disks without any
  success and ended it doing via mdb 

At the end I knew the bad sector of the "bad" disk and by simply dd
this sector once or twice to /dev/zero I could easily bring down the
system/pool without any load on the disk system.

General consensus from various people: don''t use SATA drives on SAS
back-
planes. Some SATA drives might work better, but there seems to be no
guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.

Markus

[1] Search for "What''s wrong with LSI 3081 (1068) + expander +
(bad) SATA
    disk?"
-- 
KPN International

Darmst?dter Landstrasse 184    | 60598 Frankfurt  | Germany
[T] +49 (0)69 96874-298        | [F] -289         | [M] +49 (0)178 5352346
[E] <Markus.Weber at kpn.DE>      | [W] www.kpn.de

KPN International ist ein eingetragenes Markenzeichen der KPN EuroRings B.V.

KPN Eurorings B.V.               | Niederlassung Frankfurt am Main
Amtsgericht Frankfurt HRB56874   | USt.IdNr. DE 225602449
Gesch?ftsf?hrer  Jacobus Snijder & Louis Rustenhoven

"Antonio S. Cofiño"

2012-May-31 16:16 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

Jim,

Thank you for the explanation. I have ''discovered'' that is a
typical
situation that makes the system unstable.

Just for curiosity, this morning it happened again. Below, you can che 
the log oupu. This time a HBA with LSI 1068E Chip, mpt driver, the 
previous one was with a LSI 2008, mpt_sas driver.

In this case the ZFS ''dicovered'' the error and it was able to
self
healing, and the system is working smooth.


Antonio

May 31 10:48:11 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:11 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:11 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:11 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:13 seal.macc.unican.es     Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:13 seal.macc.unican.es     Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:13 seal.macc.unican.es     Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:13 seal.macc.unican.es     Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:16 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:16 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:16 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:16 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:17 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:17 seal.macc.unican.es     Log info 0x31111000 received for 
target 12.
May 31 10:48:17 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:20 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:20 seal.macc.unican.es     SAS Discovery Error on port 0. 
DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
May 31 10:48:22 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:22 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:22 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:22 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:27 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:27 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:27 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:27 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:28 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:28 seal.macc.unican.es     Log info 0x31111000 received for 
target 12.
May 31 10:48:28 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:31 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:31 seal.macc.unican.es     SAS Discovery Error on port 0. 
DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
May 31 10:48:34 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:34 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:34 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:34 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:38 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:38 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:38 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:38 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:38 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:38 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:38 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:38 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:40 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:40 seal.macc.unican.es     Log info 0x31111000 received for 
target 12.
May 31 10:48:40 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:43 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:43 seal.macc.unican.es     SAS Discovery Error on port 0. 
DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
May 31 10:48:45 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:45 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:45 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:45 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:49 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:49 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:49 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:49 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:48:49 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:49 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:49 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:49 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:51 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:51 seal.macc.unican.es     Log info 0x31111000 received for 
target 12.
May 31 10:48:51 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:54 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:54 seal.macc.unican.es     SAS Discovery Error on port 0. 
DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
May 31 10:48:56 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:56 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:56 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:56 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:59 seal.macc.unican.es scsi: [ID 107833 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:48:59 seal.macc.unican.es     Disconnected command timeout for 
Target 10
May 31 10:49:01 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:49:01 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:49:01 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     Log info 0x31140000 received for 
target 10.
May 31 10:49:01 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x8048, scsi_state=0xc
May 31 10:49:01 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:49:01 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:49:01 seal.macc.unican.es scsi: [ID 107833 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     passthrough command timeout
May 31 10:49:01 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     Rev. 8 LSI, Inc. 1068E found.
May 31 10:49:01 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:01 seal.macc.unican.es     mpt2 supports power management.
May 31 10:49:02 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:02 seal.macc.unican.es     mpt2: IOC Operational.
May 31 10:49:16 seal.macc.unican.es scsi: [ID 107833 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:49:16 seal.macc.unican.es     Can only start 1 task management 
command at a time
May 31 10:50:16 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:50:16 seal.macc.unican.es     Rev. 8 LSI, Inc. 1068E found.
May 31 10:50:16 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:50:16 seal.macc.unican.es     mpt2 supports power management.
May 31 10:50:16 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:50:16 seal.macc.unican.es     mpt2: IOC Operational.
May 31 10:50:47 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:50:47 seal.macc.unican.es     Rev. 8 LSI, Inc. 1068E found.
May 31 10:50:47 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:50:47 seal.macc.unican.es     mpt2 supports power management.
May 31 10:50:50 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:50:50 seal.macc.unican.es     mpt2: IOC Operational.
May 31 10:51:16 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:51:16 seal.macc.unican.es     Rev. 8 LSI, Inc. 1068E found.
May 31 10:51:16 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:51:16 seal.macc.unican.es     mpt2 supports power management.
May 31 10:51:20 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:51:20 seal.macc.unican.es     mpt2: IOC Operational.
May 31 10:52:46 seal.macc.unican.es scsi: [ID 107833 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:46 seal.macc.unican.es     Disconnected command timeout for 
Target 11
May 31 10:52:47 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:47 seal.macc.unican.es     Log info 0x31140000 received for 
target 11.
May 31 10:52:47 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x8048, scsi_state=0xc
May 31 10:52:47 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:47 seal.macc.unican.es     Log info 0x31130000 received for 
target 11.
May 31 10:52:47 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x8048, scsi_state=0xc
May 31 10:52:47 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:47 seal.macc.unican.es     Log info 0x31130000 received for 
target 11.
May 31 10:52:47 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x8048, scsi_state=0xc
May 31 10:52:47 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:47 seal.macc.unican.es     Log info 0x31130000 received for 
target 11.
May 31 10:52:47 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x8048, scsi_state=0xc
May 31 10:52:47 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:47 seal.macc.unican.es     Log info 0x31130000 received for 
target 11.
May 31 10:52:47 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x8048, scsi_state=0xc
May 31 10:52:51 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:51 seal.macc.unican.es     mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:52:51 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:51 seal.macc.unican.es     mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31111000
May 31 10:52:53 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:53 seal.macc.unican.es     Log info 0x31111000 received for 
target 11.
May 31 10:52:53 seal.macc.unican.es     scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:52:56 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:52:56 seal.macc.unican.es     SAS Discovery Error on port 0. 
DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
May 31 10:53:37 seal.macc.unican.es scsi: [ID 107833 kern.warning] 
WARNING: /pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:53:37 seal.macc.unican.es     passthrough command timeout
May 31 10:53:37 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:53:37 seal.macc.unican.es     Rev. 8 LSI, Inc. 1068E found.
May 31 10:53:37 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:53:37 seal.macc.unican.es     mpt2 supports power management.
May 31 10:53:37 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci at 7a,0/pci8086,3410 at 9/pci1000,3140 at 0 (mpt2):
May 31 10:53:37 seal.macc.unican.es     mpt2: IOC Operational.
May 31 10:54:10 seal.macc.unican.es fmd: [ID 377184 daemon.error] 
SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
May 31 10:54:10 seal.macc.unican.es EVENT-TIME: Thu May 31 10:54:09 CEST 
2012
May 31 10:54:10 seal.macc.unican.es PLATFORM: X8DTH-i-6-iF-6F, CSN: 
1234567890, HOSTNAME: seal.macc.unican.es
May 31 10:54:10 seal.macc.unican.es SOURCE: zfs-diagnosis, REV: 1.0
May 31 10:54:10 seal.macc.unican.es EVENT-ID: 
5d33a13b-61e3-cf16-86a7-e9587d510170
May 31 10:54:10 seal.macc.unican.es DESC: The number of I/O errors 
associated with a ZFS device exceeded
May 31 10:54:10 seal.macc.unican.es          acceptable levels.  Refer 
to http://sun.com/msg/ZFS-8000-FD for more information.
May 31 10:54:10 seal.macc.unican.es AUTO-RESPONSE: The device has been 
offlined and marked as faulted.  An attempt
May 31 10:54:10 seal.macc.unican.es          will be made to activate a 
hot spare if available.
May 31 10:54:10 seal.macc.unican.es IMPACT: Fault tolerance of the pool 
may be compromised.
May 31 10:54:10 seal.macc.unican.es REC-ACTION: Run ''zpool status
-x''
and replace the bad device.

--
Antonio S. Cofi?o
Grupo de Meteorolog?a de Santander
Dep. de Matem?tica Aplicada y
         Ciencias de la Computaci?n
Universidad de Cantabria
Escuela de Caminos
Avenida de los Castros, 44
39005 Santander, Spain
Tel: (+34) 942 20 1731
Fax: (+34) 942 20 1703
http://www.meteo.unican.es
mailto:antonio.cofino at unican.es


El 30/05/2012 18:52, Jim Klimov escribi?:> 2012-05-30 20:25, "Antonio S. Cofi?o" wrote:
>> Dear All,
>>
>> It may be this not the correct mailing list, but I''m having a
ZFS issue
>> when a disk is failing.
>
> I hope other users might help more on specific details, but while
> we''re waiting for their answer - please search the list archives.
> Similar description of the problem comes up every few months, and
> it seems to be a fundamental flaw of (consumerish?) SATA drives
> with backplanes, leading to reset storms.
>
> I remember the mechanism being something like this: a problematic
> disk is detected and the system tries to have it reset so that it
> might stop causing problems. The SATA controller either ignores
> the command or takes too long to complete/respond, so the system
> goes up the stack and next resets the backplane or ultimately the
> controller.
>
> I am not qualified to comment whether this issue is fundamental
> (i.e. in SATA protocols) or incidental (cheap drives don''t do
> advanced stuff, while expensive SATAs might be ok in this regard).
> There were discussions about using SATA-SAS interposers, but they
> might not fit mechanically, add latency and instability, and raise
> the system price to the point where native SAS disks would be
> better...
>
> Now, waiting for experts to chime in on whatever I missed ;)
> HTH,
> //Jim Klimov
>

"Antonio S. Cofiño"

2012-May-31 16:45 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

Markus,

After Jim''s answer I have started to read bout the well known issue.
> Is it just mpt causing the errors or also mpt_sas?
Both drivers are causing the reset storm (See my answer to Jim''s
e-mail).
> General consensus from various people: don''t use SATA drives on
SAS back-
> planes. Some SATA drives might work better, but there seems to be no
> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
In the Paul Kraus''s answer it mentions that Oracle support says (among 
other things)>> 4. the problem happens with SAS as well as SATA drives, but is much
>> less frequent
>
That means, that using SAS drives it will reduce the probability of the 
issue but no guarantee exists.

> General consensus from various people: don''t use SATA drives on
SAS back-
> planes. Some SATA drives might work better, but there seems to be no
> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
Yes, may be the ''general consensus'' is right but
''general consensus''
said me to use hardware based raid solutions. But I started to do
''risky
business'' (as some vendors told me) using ZFS  and have ended 
discovering how robust is ZFS for this kind of protocol errors.

 From my complete naive point of view it appears more a issue with the 
HBA''s FW than a issue with SATA drives.

With you answers I have make a lot of re-search helping me to learn new 
things.

Please more comments and help are welcome (from some SAS expert?).

Antonio

--
Antonio S. Cofi?o


El 31/05/2012 18:04, Weber, Markus escribi?:> Antonio S. Cofi?o wrote:
>> [...]
>> The system is a supermicro motherboard X8DTH-6F in a 4U chassis
>> (SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1).
>> It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2)
>> each of them connected to a 4 different HBA (2x LSI 3081E-R (1068
>> chip) + 2x LSI SAS9200-8e (2008 chip)).
>> This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS)
>> + 34 SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi
HDS723020BLA642))
>> The issue arise when one of the disk starts to fail making long time
>> accesses. After some time (minutes, but I''m not sure) all the
disks,
>> connected to the same HBA, start to report errors. This situation
>> produce a general failure on the ZFS making the whole POOL unavailable.
>> [...]
>
> Have been there and gave up at the end[1]. Could reproduce (even though
> it took a bit longer) under most Linux versions (incl. using latest LSI
> drivers) and LSI 3081E-R HBA.
>
> Is it just mpt causing the errors or also mpt_sas?
>
> In a lab environment the LSI 9200 HBA behaved better - I/O only dropped
> shortly and then continued on the other disks without generating errors.
>
> Had a lengthy Oracle case on this, but all proposed "workarounds"
did
> not worked for me at all, which had been (some also from other forums)
>
> - disabling NCQ
> - allow-bus-device-reset=0; to /kernel/drv/sd.conf
> - set zfs:zfs_vdev_max_pending=1
> - set mpt:mpt_enable_msi=0
> - keep usage below 90%
> - no fmservices running and did temporarily did fmadm unload disk-transport
>    or other disk access stuff (smartd?)
> - tried changing retries-timeout via sd-conf for the disks without any
>    success and ended it doing via mdb
>
> At the end I knew the bad sector of the "bad" disk and by simply
dd
> this sector once or twice to /dev/zero I could easily bring down the
> system/pool without any load on the disk system.
>
>
> General consensus from various people: don''t use SATA drives on
SAS back-
> planes. Some SATA drives might work better, but there seems to be no
> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
>
> Markus
>
>
>
> [1] Search for "What''s wrong with LSI 3081 (1068) + expander
+ (bad) SATA
>      disk?"

Richard Elling

2012-May-31 18:06 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

On May 31, 2012, at 9:45 AM, Antonio S. Cofi?o wrote:
> Markus,
> 
> After Jim''s answer I have started to read bout the well known
issue.
> 
>> Is it just mpt causing the errors or also mpt_sas?
> 
> Both drivers are causing the reset storm (See my answer to Jim''s
e-mail).
No. Resets are corrective actions that occur because of command timeouts. 
The cause is the command timeout.
>> General consensus from various people: don''t use SATA drives
on SAS back-
>> planes. Some SATA drives might work better, but there seems to be no
>> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
> 
> In the Paul Kraus''s answer it mentions that Oracle support says
(among other things)
>>> 4. the problem happens with SAS as well as SATA drives, but is much
>>> less frequent
>> 
> 
> That means, that using SAS drives it will reduce the probability of the
issue but no guarantee exists.
I have seen broken SAS drives crush expanders/HBAs such that POST would not run.
Obviously, at this point there is no OS running, so we can''t blame the
OS or drivers.

I have a few SATA disks that have the same affect on motherboards. I call them, 
the "drives of doom" :-)
>> General consensus from various people: don''t use SATA drives
on SAS back-
>> planes. Some SATA drives might work better, but there seems to be no
>> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
> 
> Yes, may be the ''general consensus'' is right but
''general consensus'' said me to use hardware based raid
solutions. But I started to do ''risky business'' (as some
vendors told me) using ZFS  and have ended discovering how robust is ZFS for
this kind of protocol errors.
"hardware" RAID solutions are also susceptible to these failure modes.
> From my complete naive point of view it appears more a issue with the
HBA''s FW than a issue with SATA drives.
There are multiple contributors. But perhaps the most difficult to overcome is
the
fundamental differences in the SAS and SATA protocols. Let''s just agree
that SATA
was not designed for network-like fabrics.
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120531/af12c878/attachment.html>

Hung-Sheng Tsao Ph.D.

2012-May-31 19:48 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

just FYI
this is from intel
http://www.intel.com/support/motherboards/server/sb/CS-031831.htm

Another observation:
Oracle/Sun has move  away from SATA to SAS in ZFS storage/Appliance

If you want to go deeper  take look these presentations
http://www.scsita.org/sas_library/tutorials/
and other presentations  on the site
regards


On 5/31/2012 12:45 PM, "Antonio S. Cofi?o"
wrote:> Markus,
>
> After Jim''s answer I have started to read bout the well known
issue.
>
>> Is it just mpt causing the errors or also mpt_sas?
>
> Both drivers are causing the reset storm (See my answer to Jim''s
e-mail).
>
>> General consensus from various people: don''t use SATA drives
on SAS
>> back-
>> planes. Some SATA drives might work better, but there seems to be no
>> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
>
> In the Paul Kraus''s answer it mentions that Oracle support says
(among
> other things)
>>> 4. the problem happens with SAS as well as SATA drives, but is much
>>> less frequent
>>
>
> That means, that using SAS drives it will reduce the probability of 
> the issue but no guarantee exists.
>
>
>> General consensus from various people: don''t use SATA drives
on SAS
>> back-
>> planes. Some SATA drives might work better, but there seems to be no
>> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
>
> Yes, may be the ''general consensus'' is right but
''general consensus''
> said me to use hardware based raid solutions. But I started to do 
> ''risky business'' (as some vendors told me) using ZFS  and
have ended
> discovering how robust is ZFS for this kind of protocol errors.
>
> From my complete naive point of view it appears more a issue with the 
> HBA''s FW than a issue with SATA drives.
>
> With you answers I have make a lot of re-search helping me to learn 
> new things.
>
> Please more comments and help are welcome (from some SAS expert?).
>
> Antonio
>
> -- 
> Antonio S. Cofi?o
>
>
> El 31/05/2012 18:04, Weber, Markus escribi?:
>> Antonio S. Cofi?o wrote:
>>> [...]
>>> The system is a supermicro motherboard X8DTH-6F in a 4U chassis
>>> (SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1).
>>> It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2)
>>> each of them connected to a 4 different HBA (2x LSI 3081E-R (1068
>>> chip) + 2x LSI SAS9200-8e (2008 chip)).
>>> This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS)
>>> + 34 SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi 
>>> HDS723020BLA642))
>>> The issue arise when one of the disk starts to fail making long
time
>>> accesses. After some time (minutes, but I''m not sure) all
the disks,
>>> connected to the same HBA, start to report errors. This situation
>>> produce a general failure on the ZFS making the whole POOL
unavailable.
>>> [...]
>>
>> Have been there and gave up at the end[1]. Could reproduce (even though
>> it took a bit longer) under most Linux versions (incl. using latest LSI
>> drivers) and LSI 3081E-R HBA.
>>
>> Is it just mpt causing the errors or also mpt_sas?
>>
>> In a lab environment the LSI 9200 HBA behaved better - I/O only dropped
>> shortly and then continued on the other disks without generating
errors.
>>
>> Had a lengthy Oracle case on this, but all proposed
"workarounds" did
>> not worked for me at all, which had been (some also from other forums)
>>
>> - disabling NCQ
>> - allow-bus-device-reset=0; to /kernel/drv/sd.conf
>> - set zfs:zfs_vdev_max_pending=1
>> - set mpt:mpt_enable_msi=0
>> - keep usage below 90%
>> - no fmservices running and did temporarily did fmadm unload 
>> disk-transport
>>    or other disk access stuff (smartd?)
>> - tried changing retries-timeout via sd-conf for the disks without any
>>    success and ended it doing via mdb
>>
>> At the end I knew the bad sector of the "bad" disk and by
simply dd
>> this sector once or twice to /dev/zero I could easily bring down the
>> system/pool without any load on the disk system.
>>
>>
>> General consensus from various people: don''t use SATA drives
on SAS
>> back-
>> planes. Some SATA drives might work better, but there seems to be no
>> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
>>
>> Markus
>>
>>
>>
>> [1] Search for "What''s wrong with LSI 3081 (1068) +
expander + (bad)
>> SATA
>>      disk?"
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 600 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120531/c8a727b4/attachment.vcf>

Edward Ned Harvey

2012-Jun-01 11:54 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Richard Elling
> 
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of "Antonio S. Cofi?o"
>> 
>> My question is, there is anyway to anticipate this "choking"
situation
when a>> disk is failing, to avoid the general failure?
> 
> No.
Yes.
But not necessarily using the setup that you are currently using - that is
not quite clear from your original email.

If you have 4 HBA''s, you want to arrange your raid such that you could
survive the complete loss of the entire HBA.  This would mean you build your
pool out of a bunch of 4-disk raidz vdev''s, or perhaps a bunch of
8-disk
raidz2 vdev''s.

The whole problem you''re facing is that some bad disk brings down the
whole
bus with it...  Make your redundancy able to survive the loss of a bus.

Richard Elling

2012-Jun-01 15:10 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

On Jun 1, 2012, at 4:54 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Richard Elling
>> 
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of "Antonio S.
Cofi?o"
>>> 
>>> My question is, there is anyway to anticipate this
"choking" situation
> when a
>>> disk is failing, to avoid the general failure?
>> 
>> No.
> 
> Yes.
:-)
> But not necessarily using the setup that you are currently using - that is
> not quite clear from your original email.
> 
> If you have 4 HBA''s, you want to arrange your raid such that you
could
> survive the complete loss of the entire HBA.  This would mean you build
your
> pool out of a bunch of 4-disk raidz vdev''s, or perhaps a bunch of
8-disk
> raidz2 vdev''s.
> 
> The whole problem you''re facing is that some bad disk brings down
the whole
> bus with it...  Make your redundancy able to survive the loss of a bus.
We''ve had luck eliminating expanders from the design, too :-)

But this is also one of those cases where the failure results in a "wounded
soldier" case -- not dead, but not able to keep fighting effectively. The
result
is a massive slowdown of the system that can be best described as a DoS
condition.
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120601/5605cbca/attachment.html>

Jeff Bacon

2012-Jun-02 03:32 UTC

head link

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

> Have been there and gave up at the end[1]. Could reproduce (even though
> it took a bit longer) under most Linux versions (incl. using latest LSI
> drivers) and LSI 3081E-R HBA.
> 
> Is it just mpt causing the errors or also mpt_sas?
This is anecdotal, but I would say that the LSI1068 cards and the
SASX28/SASX36 expander chips were far more prone to issues. I''m not
saying it can''t or doesn''t happen on LSI2008/SAS2X28-36 - I
had
a prod box the other day that hard-hung because a Constellation went
south, had to have someone go yank the disk - but it sure
seems to happen a lot less with the newer kit and Phase10+
firmware. Got all sorts of mixes of SAS+SATA in large qty,
mostly pretty stable. 

Even having a single SASX28/36 expander in the mix seems to add
to the problems somewhat, for reasons that are not entirely clear. 

I have a couple around still. Hell, I have one box that is nothing
_but_ 1068s and old 846E1s - 4 of ''em to be exact, 86 old 1TB cuda.12
drives - which pretty much just works. Granted it doesn''t work that
hard; it''s archival material - but it runs and rarely errors. Still
Sol10U9 even.

*throws up hands*


-bacon

Reasonably Related Threads

Search for more possibly parallel threads

zfs discuss - May 2012 - Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

[zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

[zfs-discuss] Disk failure chokes all the disks attached to the failing disk HBA

Reasonably Related Threads