thr3ads.net - zfs discuss - [zfs-discuss] Apparent SAS HBA failure-- now what? [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Dave Pooser

2010-Nov-06 18:27 UTC

[zfs-discuss] Apparent SAS HBA failure-- now what?

My setup: A SuperMicro 24-drive chassis with Intel dual-processor
motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives,
divided into three pools with each pool a single eight-disk RAID-Z2. (Boot
is an SSD connected to motherboard SATA.)

This morning I got a cheerful email from my monitoring script: "Zchecker
has
discovered a problem on bigdawg." The full output is below, but I have one
unavailable pool and two degraded pools, with all my problem disks connected
to controller c10. I have multiple spare controllers available.

First question-- is there an easy way to identify which controller is c10?
Second question-- What is the best way to handle replacement (of either the
bad controller or of all three controllers if I can''t identify the bad
controller)? I was thinking that I should be able to shut the server down,
remove the controller(s), install the replacement controller(s), check to
see that all the drives are visible, run zpool clear for each pool and then
do another scrub to verify the problem has been resolved. Does that sound
like a good plan?

==pool: uberdisk1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run ''zpool
clear''.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go
config:

    NAME         STATE     READ WRITE CKSUM
    uberdisk1    UNAVAIL     55     0     0  insufficient replicas
      raidz2     UNAVAIL    112     0     0  insufficient replicas
        c9t0d0   ONLINE       0     0     0
        c9t1d0   ONLINE       0     0     0
        c9t2d0   ONLINE       0     0     0
        c10t0d0  UNAVAIL     43    30     0  experienced I/O failures
        c10t1d0  REMOVED      0     0     0
        c10t2d0  ONLINE      74     0     0
        c11t1d0  ONLINE       0     0     0
        c11t2d0  ONLINE       0     0     0

errors: 1 data errors, use ''-v'' for a list

  pool: uberdisk2
 state: DEGRADED
 scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go
config:

    NAME         STATE     READ WRITE CKSUM
    uberdisk2    DEGRADED     0     0     0
      raidz2     DEGRADED     0     0     0
        c9t3d0   ONLINE       0     0     0
        c9t4d0   ONLINE       0     0     0
        c9t5d0   ONLINE       0     0     0
        c10t3d0  REMOVED      0     0     0
        c10t4d0  REMOVED      0     0     0
        c11t3d0  ONLINE       0     0     0
        c11t4d0  ONLINE       0     0     0
        c11t5d0  ONLINE       0     0     0

errors: No known data errors

  pool: uberdisk3
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run ''zpool
clear''.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go
config:

    NAME         STATE     READ WRITE CKSUM
    uberdisk3    DEGRADED     1     0     0
      raidz2     DEGRADED     4     0     0
        c9t6d0   ONLINE       0     0     0
        c9t7d0   ONLINE       0     0     0
        c10t5d0  ONLINE       5     0     0
        c10t6d0  ONLINE      98    94     0
        c10t7d0  REMOVED      0     0     0
        c11t6d0  ONLINE       0     0     0
        c11t7d0  ONLINE       0     0     0
        c11t8d0  ONLINE       0     0     0

errors: 1 data errors, use ''-v'' for a list

-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com

Khushil Dep

2010-Nov-06 18:35 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

Can you send output of iostat -xCzn as well as fmadm faulty please? Is. This
an E2 chassis? Are you using interposers?

On 6 Nov 2010 18:28, "Dave Pooser" <dave.zfs at alfordmedia.com>
wrote:

My setup: A SuperMicro 24-drive chassis with Intel dual-processor
motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives,
divided into three pools with each pool a single eight-disk RAID-Z2. (Boot
is an SSD connected to motherboard SATA.)

This morning I got a cheerful email from my monitoring script: "Zchecker
has
discovered a problem on bigdawg." The full output is below, but I have one
unavailable pool and two degraded pools, with all my problem disks connected
to controller c10. I have multiple spare controllers available.

First question-- is there an easy way to identify which controller is c10?
Second question-- What is the best way to handle replacement (of either the
bad controller or of all three controllers if I can''t identify the bad
controller)? I was thinking that I should be able to shut the server down,
remove the controller(s), install the replacement controller(s), check to
see that all the drives are visible, run zpool clear for each pool and then
do another scrub to verify the problem has been resolved. Does that sound
like a good plan?

==pool: uberdisk1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run ''zpool
clear''.
  see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go
config:

   NAME         STATE     READ WRITE CKSUM
   uberdisk1    UNAVAIL     55     0     0  insufficient replicas
     raidz2     UNAVAIL    112     0     0  insufficient replicas
       c9t0d0   ONLINE       0     0     0
       c9t1d0   ONLINE       0     0     0
       c9t2d0   ONLINE       0     0     0
       c10t0d0  UNAVAIL     43    30     0  experienced I/O failures
       c10t1d0  REMOVED      0     0     0
       c10t2d0  ONLINE      74     0     0
       c11t1d0  ONLINE       0     0     0
       c11t2d0  ONLINE       0     0     0

errors: 1 data errors, use ''-v'' for a list

 pool: uberdisk2
 state: DEGRADED
 scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go
config:

   NAME         STATE     READ WRITE CKSUM
   uberdisk2    DEGRADED     0     0     0
     raidz2     DEGRADED     0     0     0
       c9t3d0   ONLINE       0     0     0
       c9t4d0   ONLINE       0     0     0
       c9t5d0   ONLINE       0     0     0
       c10t3d0  REMOVED      0     0     0
       c10t4d0  REMOVED      0     0     0
       c11t3d0  ONLINE       0     0     0
       c11t4d0  ONLINE       0     0     0
       c11t5d0  ONLINE       0     0     0

errors: No known data errors

 pool: uberdisk3
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run ''zpool
clear''.
  see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go
config:

   NAME         STATE     READ WRITE CKSUM
   uberdisk3    DEGRADED     1     0     0
     raidz2     DEGRADED     4     0     0
       c9t6d0   ONLINE       0     0     0
       c9t7d0   ONLINE       0     0     0
       c10t5d0  ONLINE       5     0     0
       c10t6d0  ONLINE      98    94     0
       c10t7d0  REMOVED      0     0     0
       c11t6d0  ONLINE       0     0     0
       c11t7d0  ONLINE       0     0     0
       c11t8d0  ONLINE       0     0     0

errors: 1 data errors, use ''-v'' for a list

--
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/e80e9115/attachment.html>

Dave Pooser

2010-Nov-06 18:56 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep" <khushil.dep at
gmail.com> wrote:
> Is this  an E2 chassis? Are you using interposers?
No, it?s an SC846A chassis. There are no interposers or expanders; six
SFF-8087 ?iPass? cables go from ports on the HBA to ports on the backplane.
> Can you send output of iostat -xCzn as well as fmadm faulty please?
(please pardon my line wrap)


# iostat -xCzn
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  255.0   15.9 20667.5 1424.4  0.0  3.0    0.0   11.2   0  35 c9
   34.4    2.3 2837.7  198.5  0.0  0.4    0.0   11.1   0   5 c9t0d0
   34.3    2.3 2837.6  198.5  0.0  0.4    0.0   11.3   0   5 c9t1d0
   34.4    2.3 2837.7  198.5  0.0  0.4    0.0   11.1   0   5 c9t2d0
   35.9    1.9 2918.2  162.1  0.0  0.4    0.0   11.9   0   5 c9t3d0
   35.8    1.9 2918.3  162.1  0.0  0.5    0.0   12.1   0   5 c9t4d0
   35.8    1.9 2918.2  162.1  0.0  0.5    0.0   11.9   0   5 c9t5d0
   22.2    1.7 1703.0  171.3  0.0  0.2    0.0    9.5   0   3 c9t6d0
   22.1    1.7 1696.8  171.2  0.0  0.2    0.0    9.5   0   3 c9t7d0
  239.2   15.8 19217.1 1433.5  0.0  2.8    0.0   10.8   0  32 c10
   34.6    2.3 2837.8  198.5  0.0  0.4    0.0   10.9   0   5 c10t0d0
   34.5    2.3 2837.7  198.5  0.0  0.4    0.0   11.0   0   5 c10t1d0
   34.4    2.3 2837.6  198.5  0.0  0.4    0.0   11.3   0   5 c10t2d0
   34.5    1.9 2800.5  162.1  0.0  0.4    0.0   12.0   0   5 c10t3d0
   34.5    1.9 2800.4  162.1  0.0  0.4    0.0   12.0   0   5 c10t4d0
   22.2    1.7 1703.1  171.3  0.0  0.2    0.0    9.5   0   3 c10t5d0
   22.2    1.7 1697.0  171.2  0.0  0.2    0.0    9.3   0   3 c10t6d0
   22.3    1.7 1703.1  171.3  0.0  0.2    0.0    9.2   0   3 c10t7d0
  243.5   15.5 19527.7 1397.1  0.0  2.8    0.0   10.9   0  32 c11
   34.5    2.3 2837.8  198.5  0.0  0.4    0.0   11.1   0   5 c11t1d0
   34.5    2.3 2837.9  198.5  0.0  0.4    0.0   11.0   0   5 c11t2d0
   35.8    1.9 2918.3  162.1  0.0  0.5    0.0   12.1   0   5 c11t3d0
   35.9    1.9 2918.2  162.1  0.0  0.5    0.0   11.9   0   5 c11t4d0
   36.2    1.9 2918.5  162.1  0.0  0.4    0.0   11.2   0   5 c11t5d0
   22.1    1.7 1696.8  171.2  0.0  0.2    0.0    9.5   0   3 c11t6d0
   22.2    1.7 1703.1  171.3  0.0  0.2    0.0    9.5   0   3 c11t7d0
   22.3    1.7 1697.1  171.2  0.0  0.2    0.0    9.2   0   3 c11t8d0
    0.0    0.0    1.0    0.3  0.0  0.0    0.5    1.4   0   0 c8d0


# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
                  faulted but still in service
Problem in  : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
                  faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                     acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
              for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                     will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
                  faulted but still in service
Problem in  : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
                  faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                     acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
              for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                     will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
                  faulted and taken out of service
Problem in  : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
                  faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                     acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
              for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                     will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:23 896d10f1-fa11-69bb-ae78-d18a56fd3288  ZFS-8000-HC    Major

Fault class : fault.fs.zfs.io_failure_wait
Affects     : zfs://pool=uberdisk1
                  faulted but still in service
Problem in  : zfs://pool=uberdisk1
                  faulty

Description : The ZFS pool has experienced currently unrecoverable I/O
                    failures.  Refer to http://sun.com/msg/ZFS-8000-HC for
more
              information.

Response    : No automated response will be taken.

Impact      : Read and write I/Os cannot be serviced.

Action      : Make sure the affected devices are connected, then run
                    ''zpool clear''.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:30 989d0590-9e27-cd11-cba5-d7dbf7127ce1  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk3/vdev=e0209de35309a6f8
                  faulted but still in service
Problem in  : zfs://pool=uberdisk3/vdev=e0209de35309a6f8
                  faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                     acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
              for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                     will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:51 a2d736ac-14e9-cbf7-db28-84e25bfd4a3e  ZFS-8000-HC    Major

Fault class : fault.fs.zfs.io_failure_wait
Affects     : zfs://pool=uberdisk3
                  faulted but still in service
Problem in  : zfs://pool=uberdisk3
                  faulty

Description : The ZFS pool has experienced currently unrecoverable I/O
                    failures.  Refer to http://sun.com/msg/ZFS-8000-HC for
more
              information.

Response    : No automated response will be taken.

Impact      : Read and write I/Os cannot be serviced.

Action      : Make sure the affected devices are connected, then run
                    ''zpool clear''.

-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com

Khushil Dep

2010-Nov-06 19:21 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

Sorry u meant iostat -En I''m looking for errors....

On 6 Nov 2010 18:56, "Dave Pooser" <dave.zfs at alfordmedia.com>
wrote:

On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep" <khushil.dep at
gmail.com> wrote:
> Is this  an E2 chassis? Are you using interposers?
No, it?s an SC846A chassis. There are no interposers or expanders; six
SFF-8087 ?iPass? cables go from ports on the HBA to ports on the backplane.

> Can you send output of iostat -xCzn as well as fmadm faulty please?(please pardon my line wrap)


# iostat -xCzn
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 255.0   15.9 20667.5 1424.4  0.0  3.0    0.0   11.2   0  35 c9
  34.4    2.3 2837.7  198.5  0.0  0.4    0.0   11.1   0   5 c9t0d0
  34.3    2.3 2837.6  198.5  0.0  0.4    0.0   11.3   0   5 c9t1d0
  34.4    2.3 2837.7  198.5  0.0  0.4    0.0   11.1   0   5 c9t2d0
  35.9    1.9 2918.2  162.1  0.0  0.4    0.0   11.9   0   5 c9t3d0
  35.8    1.9 2918.3  162.1  0.0  0.5    0.0   12.1   0   5 c9t4d0
  35.8    1.9 2918.2  162.1  0.0  0.5    0.0   11.9   0   5 c9t5d0
  22.2    1.7 1703.0  171.3  0.0  0.2    0.0    9.5   0   3 c9t6d0
  22.1    1.7 1696.8  171.2  0.0  0.2    0.0    9.5   0   3 c9t7d0
 239.2   15.8 19217.1 1433.5  0.0  2.8    0.0   10.8   0  32 c10
  34.6    2.3 2837.8  198.5  0.0  0.4    0.0   10.9   0   5 c10t0d0
  34.5    2.3 2837.7  198.5  0.0  0.4    0.0   11.0   0   5 c10t1d0
  34.4    2.3 2837.6  198.5  0.0  0.4    0.0   11.3   0   5 c10t2d0
  34.5    1.9 2800.5  162.1  0.0  0.4    0.0   12.0   0   5 c10t3d0
  34.5    1.9 2800.4  162.1  0.0  0.4    0.0   12.0   0   5 c10t4d0
  22.2    1.7 1703.1  171.3  0.0  0.2    0.0    9.5   0   3 c10t5d0
  22.2    1.7 1697.0  171.2  0.0  0.2    0.0    9.3   0   3 c10t6d0
  22.3    1.7 1703.1  171.3  0.0  0.2    0.0    9.2   0   3 c10t7d0
 243.5   15.5 19527.7 1397.1  0.0  2.8    0.0   10.9   0  32 c11
  34.5    2.3 2837.8  198.5  0.0  0.4    0.0   11.1   0   5 c11t1d0
  34.5    2.3 2837.9  198.5  0.0  0.4    0.0   11.0   0   5 c11t2d0
  35.8    1.9 2918.3  162.1  0.0  0.5    0.0   12.1   0   5 c11t3d0
  35.9    1.9 2918.2  162.1  0.0  0.5    0.0   11.9   0   5 c11t4d0
  36.2    1.9 2918.5  162.1  0.0  0.4    0.0   11.2   0   5 c11t5d0
  22.1    1.7 1696.8  171.2  0.0  0.2    0.0    9.5   0   3 c11t6d0
  22.2    1.7 1703.1  171.3  0.0  0.2    0.0    9.5   0   3 c11t7d0
  22.3    1.7 1697.1  171.2  0.0  0.2    0.0    9.2   0   3 c11t8d0
   0.0    0.0    1.0    0.3  0.0  0.0    0.5    1.4   0   0 c8d0


# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
                 faulted but still in service
Problem in  : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
                 faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                    acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
             for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                    will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
                 faulted but still in service
Problem in  : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
                 faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                    acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
             for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                    will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
                 faulted and taken out of service
Problem in  : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
                 faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                    acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
             for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                    will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:23 896d10f1-fa11-69bb-ae78-d18a56fd3288  ZFS-8000-HC    Major

Fault class : fault.fs.zfs.io_failure_wait
Affects     : zfs://pool=uberdisk1
                 faulted but still in service
Problem in  : zfs://pool=uberdisk1
                 faulty

Description : The ZFS pool has experienced currently unrecoverable I/O
                   failures.  Refer to http://sun.com/msg/ZFS-8000-HC for
more
             information.

Response    : No automated response will be taken.

Impact      : Read and write I/Os cannot be serviced.

Action      : Make sure the affected devices are connected, then run
                   ''zpool clear''.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:30 989d0590-9e27-cd11-cba5-d7dbf7127ce1  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=uberdisk3/vdev=e0209de35309a6f8
                 faulted but still in service
Problem in  : zfs://pool=uberdisk3/vdev=e0209de35309a6f8
                 faulty

Description : The number of I/O errors associated with a ZFS device exceeded
                    acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
             for more information.

Response    : The device has been offlined and marked as faulted.  An
attempt
                    will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Nov 06 06:33:51 a2d736ac-14e9-cbf7-db28-84e25bfd4a3e  ZFS-8000-HC    Major

Fault class : fault.fs.zfs.io_failure_wait
Affects     : zfs://pool=uberdisk3
                 faulted but still in service
Problem in  : zfs://pool=uberdisk3
                 faulty

Description : The ZFS pool has experienced currently unrecoverable I/O
                   failures.  Refer to http://sun.com/msg/ZFS-8000-HC for
more
             information.

Response    : No automated response will be taken.

Impact      : Read and write I/Os cannot be serviced.

Action      : Make sure the affected devices are connected, then run
                   ''zpool clear''.

--

Dave Pooser, ACSA
Manager of Information Services
Alford Media http://www.alfordmedia.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/3ccf8aaa/attachment-0001.html>

Dave Pooser

2010-Nov-06 19:25 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep" <khushil.dep at
gmail.com> wrote:
> Sorry I meant iostat -En I''m looking for errors....
#  iostat -En
c8d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: IMATION-MAC25-0 Revision:  Serial No: 87A0079B1808000 Size: 63.89GB
<63887523840 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 
c9t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t2d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t3d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t4d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t5d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t6d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t7d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t1d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t2d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t3d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t4d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t5d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t6d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t7d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t8d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t0d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 8
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t1d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 8
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t2d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 16
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t3d0          Soft Errors: 0 Hard Errors: 3 Transport Errors: 13
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t4d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 19
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t5d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 1
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t6d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 12
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t7d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 9
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com

Khushil Dep

2010-Nov-06 19:35 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

Similar to what I''ve seen before, SATA disks in a 846 chassis with
hardware
and transport errors. Though in that occasion it was an E2 chassis with
interposers. How long has this system been up? Is it production or can you
offline and check all firmware on lsi controllers are up to date and match
each other?

Do and fmdump -u UUID - V on those faults and get the serial numbers of
disks that have failed. Trial and error unless you wrote down which went
where I''m afraid.

If Hitachi provide a tool like SeaTool from Segate, run it against a disk
and see if its really faulty or if the hba it was connected to is on the
blink.

Restore from backup might be inevitable unless your snapping and auto
syncing to another system?

On 6 Nov 2010 19:25, "Dave Pooser" <dave.zfs at alfordmedia.com>
wrote:

On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep" <khushil.dep at
gmail.com> wrote:
> Sorry I meant iostat -En ...#  iostat -En
c8d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: IMATION-MAC25-0 Revision:  Serial No: 87A0079B1808000 Size: 63.89GB
<63887523840 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c9t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t2d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t3d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t4d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t5d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t6d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t7d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t1d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t2d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t3d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t4d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t5d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t6d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t7d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t8d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t0d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 8
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t1d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 8
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t2d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 16
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t3d0          Soft Errors: 0 Hard Errors: 3 Transport Errors: 13
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t4d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 19
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t5d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 1
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t6d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 12
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t7d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 9
Vendor: ATA      Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

--

Dave Pooser, ACSA
Manager of Information Services
Alford Media http://www.alfordmedia.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/888369e5/attachment-0001.html>

Dave Pooser

2010-Nov-06 19:45 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep" <khushil.dep at
gmail.com> wrote:
> Similar to what I''ve seen before, SATA disks in a 846 chassis with
hardware
> and transport errors. Though in that occasion it was an E2 chassis with
> interposers. How long has this system been up? Is it production or can you
> offline and check all firmware on lsi controllers are up to date and match
> each other? 
It''s been up for about 6 months. I can offline them.
> Do and fmdump -u UUID - V on those faults and get the serial numbers of
disks
> that have failed. Trial and error unless you wrote down which went where
I''m
> afraid. 
Here''s the thing, though-- I''m really not at all sure
it''s the disks that
failed. The idea that coincidentally I''m going to have had eight of 24
disks
report major errors, all at the same time (because I scrub weekly and
didn''t
catch any errors last scrub), all on the same controller-- well, that seems
much less likely than the idea that I just have a bad controller that needs
replacing.
-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com

Khushil Dep

2010-Nov-06 20:14 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

The fmdump will let you get the serial of one disk and id the controller its
on so you can swap it out and check.

On 6 Nov 2010 19:45, "Dave Pooser" <dave.zfs at alfordmedia.com>
wrote:

On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep" <khushil.dep at
gmail.com> wrote:
> Similar to what I''ve seen...It''s been up for about 6 months. I can offline them.

> Do and fmdump -u UUID - V on those faults and get the serial numbers of
disks> that have failed....Here''s the thing, though-- I''m really not at all sure
it''s the disks that
failed. The idea that coincidentally I''m going to have had eight of 24
disks
report major errors, all at the same time (because I scrub weekly and
didn''t
catch any errors last scrub), all on the same controller-- well, that seems
much less likely than the idea that I just have a bad controller that needs
replacing.
--

Dave Pooser, ACSA
Manager of Information Services
Alford Media http://www.alfordmedia.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/ba07dc95/attachment.html>

McBofh

2010-Nov-06 23:32 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

On  7/11/10 04:27 AM, Dave Pooser wrote:> My setup: A SuperMicro 24-drive chassis with Intel dual-processor
> motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives,
> divided into three pools with each pool a single eight-disk RAID-Z2. (Boot
> is an SSD connected to motherboard SATA.)
>
> This morning I got a cheerful email from my monitoring script:
"Zchecker has
> discovered a problem on bigdawg." The full output is below, but I have
one
> unavailable pool and two degraded pools, with all my problem disks
connected
> to controller c10. I have multiple spare controllers available.
>
> First question-- is there an easy way to identify which controller is c10?
ls -alrt /dev/cfg/c10

will show you the physical path, which you can then follow


$ ls -lart /dev/cfg/c3
    1 lrwxrwxrwx   1 root     root          55 Nov 12  2009 /dev/cfg/c3 ->
../../devices/pci at 0,0/pci10de,376 at a/pci1000,3150 at 0:scsi


you can also make use of fmtopo -V:

# /usr/lib/fm/fmd/fmtopo -V

...

hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hos
tbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
   group: protocol                       version: 1   stability: Private/Private
     resource          fmri     
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis
-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
     label             string    PCIE0 Slot
     FRU               fmri     
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis
-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0
     ASRU              fmri      dev:////pci at 0,0/pci10de,376 at
a/pci1000,3150 at 0
   group: authority                      version: 1   stability: Private/Private
     product-id        string    Sun-Ultra-40-M2-Workstation
     chassis-id        string    0802FMY00N
     server-id         string    blinder
   group: io                             version: 1   stability: Private/Private
     dev               string    /pci at 0,0/pci10de,376 at a/pci1000,3150 at 0
     driver            string    mpt
     module            fmri      mod:///mod-name=mpt/mod-id=57
   group: pci                            version: 1   stability: Private/Private
     device-id         string    58
     extended-capabilities string    pciexdev
     class-code        string    10000
     vendor-id         string    1000
     assigned-addresses uint32[]  [ 2164391952 0 16384 0 256 2197946388 0
2686517248 0 16384 2197946396 0 2686451712 0 65536 ]


note the "label" and "FRU" properties in the protocol group.


McB

Jeff Bacon

2010-Nov-08 00:21 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

Wow, sounds familiar - binderedondat. I thought it was just when using
expanders... guess it''s just anything 1068-based. Lost a 20TB pool to
having the controller basically just hose up what it was doing and write
scragged data to the disk. 

1) The suggestion using the serial number of the drive to trace back to
what''s connected to what is good, assuming you can pull drives to look
at their serial numbers. 

2) One thing I''ve done over the years is, given that I often use the
same motherboards, is physically map out the PCI slot addresses - 

/dev/cfg/c2 ../../devices/pci at 0,0/pci8086,340a at 3/pci1000,30a0 at 0:scsi
/dev/cfg/c3
../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at f:scsi
/dev/cfg/c4
../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at v0:scsi
/dev/cfg/c5
../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at f:scsi
/dev/cfg/c6
../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at v0:scsi
/dev/cfg/c7 ../../devices/pci at 7a,0/pci8086,3410 at 9/pci1000,30a0 at 0:scsi
                                   ^^^^^^^^^^^^^^^ this part will
correspond with a physical slot 
                         ^^^^^^^^^ if you have a SM dual-IOH board,
these represent the two IOH-36s
                                                ^ on a single-IOH board,
I''ve noted that this often
                                                corresponds to the
physical slot number (it''s "unit-address" from DDI)

So far, it''s involved "stick a card in a slot/reboot/reconfig/see
what
address it''s at/note it down" or other forms of reverse
engineering.
Handy to have occasionally. If you''re doing a BYO, taking the time up
front to figure this out is a Good Idea. 

3) Get a copy of "lsiutil" for solaris (available from LSI''s
site) -
it''s an easy way to check out the controller and see if it''s
there or
whether it sees the drives or what.

(There is a newer version of lsiutil that supports the 2008s...
strangely, it''s not available from the LSI site. Their tech support
didn''t even know it existed when I asked. I got my copy off someone on
hardforum.) 

4) Things you didn''t want to know: the LSI1068 actually has a very
small
write cache on board. So if you manage a certain set of situations
(namely, setting the "device i/o timeout" in the BIOS to something
other
than 0, then having a SATA drive blow up in a certain way such that it
hangs for longer than the timeout you set), the mpt driver (it seems)
can get impatient and re-initialize the controller, or that''s what it
looks like. Great way to scrag a volume. :(
 
5) your basic plan seems sound.

> Message: 1
> Date: Sat, 06 Nov 2010 13:27:08 -0500
> From: Dave Pooser <dave.zfs at alfordmedia.com>
> To: <zfs-discuss at opensolaris.org>
> Subject: [zfs-discuss] Apparent SAS HBA failure-- now what?
> Message-ID: <C8FB082C.34460%dave.zfs at alfordmedia.com>
> Content-Type: text/plain;	charset="US-ASCII"
> 
> First question-- is there an easy way to identify which controller is
c10?> Second question-- What is the best way to handle replacement (of
either the> bad controller or of all three controllers if I can''t identify the
bad
> controller)? I was thinking that I should be able to shut the server
down,> remove the controller(s), install the replacement controller(s), check
to> see that all the drives are visible, run zpool clear for each pool and
then> do another scrub to verify the problem has been resolved. Does that
sound> like a good plan?

McBofh

2010-Nov-08 00:35 UTC

head link

[zfs-discuss] Apparent SAS HBA failure-- now what?

On  8/11/10 10:21 AM, Jeff Bacon wrote:> Wow, sounds familiar - binderedondat. I thought it was just when using
> expanders... guess it''s just anything 1068-based. Lost a 20TB pool
to
> having the controller basically just hose up what it was doing and write
> scragged data to the disk.
>
> 1) The suggestion using the serial number of the drive to trace back to
> what''s connected to what is good, assuming you can pull drives to
look
> at their serial numbers.
Except that this could cause more problems if you happen to
pull the wrong one in the middle of a resilver operation. Or
anything, really.

> 2) One thing I''ve done over the years is, given that I often use
the
> same motherboards, is physically map out the PCI slot addresses -
>
> /dev/cfg/c2 ../../devices/pci at 0,0/pci8086,340a at 3/pci1000,30a0 at
0:scsi
> /dev/cfg/c3
> ../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at
f:scsi
> /dev/cfg/c4
> ../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at
v0:scsi
> /dev/cfg/c5
> ../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at
f:scsi
> /dev/cfg/c6
> ../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at
v0:scsi
> /dev/cfg/c7 ../../devices/pci at 7a,0/pci8086,3410 at 9/pci1000,30a0 at
0:scsi
>                                     ^^^^^^^^^^^^^^^ this part will
> correspond with a physical slot
>                           ^^^^^^^^^ if you have a SM dual-IOH board,
> these represent the two IOH-36s
>                                                  ^ on a single-IOH board,
> I''ve noted that this often
>                                                  corresponds to the
> physical slot number (it''s "unit-address" from DDI)
>
> So far, it''s involved "stick a card in a
slot/reboot/reconfig/see what
> address it''s at/note it down" or other forms of reverse
engineering.
> Handy to have occasionally. If you''re doing a BYO, taking the time
up
> front to figure this out is a Good Idea.
This is what FMA''s libtopo solves for you. Fairly well, too:

# /usr/lib/fm/fmd/fmtopo -V
....


hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
   group: protocol                       version: 1   stability: Private/Private
     resource          fmri     
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
     label             string    PCIE0 Slot
     FRU               fmri     
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0
     ASRU              fmri      dev:////pci at 0,0/pci10de,376 at
a/pci1000,3150 at 0
   group: authority                      version: 1   stability: Private/Private
     product-id        string    Sun-Ultra-40-M2-Workstation
     chassis-id        string    0802FMY00N
     server-id         string    blinder
   group: io                             version: 1   stability: Private/Private
     dev               string    /pci at 0,0/pci10de,376 at a/pci1000,3150 at 0
     driver            string    mpt
     module            fmri      mod:///mod-name=mpt/mod-id=57
   group: pci                            version: 1   stability: Private/Private
     device-id         string    58
     extended-capabilities string    pciexdev
     class-code        string    10000
     vendor-id         string    1000
     assigned-addresses uint32[]  [ 2164391952 0 16384 0 256 2197946388 0
2686517248 0 16384 2197946396 0 2686451712 0 65536 ]



Note the label property.




> 3) Get a copy of "lsiutil" for solaris (available from
LSI''s site) -
> it''s an easy way to check out the controller and see if
it''s there or
> whether it sees the drives or what.
> (There is a newer version of lsiutil that supports the 2008s...
> strangely, it''s not available from the LSI site. Their tech
support
> didn''t even know it existed when I asked. I got my copy off
someone on
> hardforum.)
Sigh. Sun (and now Oracle) didn''t distribute lsiutil due at least in
part to the likelihood of customers killing their hbas. Which, having
used lsiutil (to recover from failed operations) is depressingly easy.

The replacement is sas2ircu. I believe the same caveats apply.
  > 4) Things you didn''t want to know: the LSI1068 actually has a very
small
> write cache on board. So if you manage a certain set of situations
> (namely, setting the "device i/o timeout" in the BIOS to
something other
> than 0, then having a SATA drive blow up in a certain way such that it
> hangs for longer than the timeout you set), the mpt driver (it seems)
> can get impatient and re-initialize the controller, or that''s what
it
> looks like. Great way to scrag a volume. :(
The 1068 also has a limitation of 122 devices for its logical target-id
concept. But we don''t talk about that in polite company :-)

Please, go and have a poke around the output from libtopo. I think
you''ll
be pleasantly surprised at what you can discover with it.


McB

Maybe Matching Threads

Search for more seemingly similar threads

zfs discuss - Nov 2010 - Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

[zfs-discuss] Apparent SAS HBA failure-- now what?

Maybe Matching Threads