My setup: A SuperMicro 24-drive chassis with Intel dual-processor motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives, divided into three pools with each pool a single eight-disk RAID-Z2. (Boot is an SSD connected to motherboard SATA.) This morning I got a cheerful email from my monitoring script: "Zchecker has discovered a problem on bigdawg." The full output is below, but I have one unavailable pool and two degraded pools, with all my problem disks connected to controller c10. I have multiple spare controllers available. First question-- is there an easy way to identify which controller is c10? Second question-- What is the best way to handle replacement (of either the bad controller or of all three controllers if I can''t identify the bad controller)? I was thinking that I should be able to shut the server down, remove the controller(s), install the replacement controller(s), check to see that all the drives are visible, run zpool clear for each pool and then do another scrub to verify the problem has been resolved. Does that sound like a good plan? ==pool: uberdisk1 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run ''zpool clear''. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go config: NAME STATE READ WRITE CKSUM uberdisk1 UNAVAIL 55 0 0 insufficient replicas raidz2 UNAVAIL 112 0 0 insufficient replicas c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 c10t0d0 UNAVAIL 43 30 0 experienced I/O failures c10t1d0 REMOVED 0 0 0 c10t2d0 ONLINE 74 0 0 c11t1d0 ONLINE 0 0 0 c11t2d0 ONLINE 0 0 0 errors: 1 data errors, use ''-v'' for a list pool: uberdisk2 state: DEGRADED scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go config: NAME STATE READ WRITE CKSUM uberdisk2 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c10t3d0 REMOVED 0 0 0 c10t4d0 REMOVED 0 0 0 c11t3d0 ONLINE 0 0 0 c11t4d0 ONLINE 0 0 0 c11t5d0 ONLINE 0 0 0 errors: No known data errors pool: uberdisk3 state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run ''zpool clear''. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go config: NAME STATE READ WRITE CKSUM uberdisk3 DEGRADED 1 0 0 raidz2 DEGRADED 4 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c10t5d0 ONLINE 5 0 0 c10t6d0 ONLINE 98 94 0 c10t7d0 REMOVED 0 0 0 c11t6d0 ONLINE 0 0 0 c11t7d0 ONLINE 0 0 0 c11t8d0 ONLINE 0 0 0 errors: 1 data errors, use ''-v'' for a list -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com
Can you send output of iostat -xCzn as well as fmadm faulty please? Is. This an E2 chassis? Are you using interposers? On 6 Nov 2010 18:28, "Dave Pooser" <dave.zfs at alfordmedia.com> wrote: My setup: A SuperMicro 24-drive chassis with Intel dual-processor motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives, divided into three pools with each pool a single eight-disk RAID-Z2. (Boot is an SSD connected to motherboard SATA.) This morning I got a cheerful email from my monitoring script: "Zchecker has discovered a problem on bigdawg." The full output is below, but I have one unavailable pool and two degraded pools, with all my problem disks connected to controller c10. I have multiple spare controllers available. First question-- is there an easy way to identify which controller is c10? Second question-- What is the best way to handle replacement (of either the bad controller or of all three controllers if I can''t identify the bad controller)? I was thinking that I should be able to shut the server down, remove the controller(s), install the replacement controller(s), check to see that all the drives are visible, run zpool clear for each pool and then do another scrub to verify the problem has been resolved. Does that sound like a good plan? ==pool: uberdisk1 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run ''zpool clear''. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go config: NAME STATE READ WRITE CKSUM uberdisk1 UNAVAIL 55 0 0 insufficient replicas raidz2 UNAVAIL 112 0 0 insufficient replicas c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 c10t0d0 UNAVAIL 43 30 0 experienced I/O failures c10t1d0 REMOVED 0 0 0 c10t2d0 ONLINE 74 0 0 c11t1d0 ONLINE 0 0 0 c11t2d0 ONLINE 0 0 0 errors: 1 data errors, use ''-v'' for a list pool: uberdisk2 state: DEGRADED scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go config: NAME STATE READ WRITE CKSUM uberdisk2 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c10t3d0 REMOVED 0 0 0 c10t4d0 REMOVED 0 0 0 c11t3d0 ONLINE 0 0 0 c11t4d0 ONLINE 0 0 0 c11t5d0 ONLINE 0 0 0 errors: No known data errors pool: uberdisk3 state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run ''zpool clear''. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go config: NAME STATE READ WRITE CKSUM uberdisk3 DEGRADED 1 0 0 raidz2 DEGRADED 4 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c10t5d0 ONLINE 5 0 0 c10t6d0 ONLINE 98 94 0 c10t7d0 REMOVED 0 0 0 c11t6d0 ONLINE 0 0 0 c11t7d0 ONLINE 0 0 0 c11t8d0 ONLINE 0 0 0 errors: 1 data errors, use ''-v'' for a list -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/e80e9115/attachment.html>
On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep" <khushil.dep at gmail.com> wrote:> Is this an E2 chassis? Are you using interposers?No, it?s an SC846A chassis. There are no interposers or expanders; six SFF-8087 ?iPass? cables go from ports on the HBA to ports on the backplane.> Can you send output of iostat -xCzn as well as fmadm faulty please?(please pardon my line wrap) # iostat -xCzn extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 255.0 15.9 20667.5 1424.4 0.0 3.0 0.0 11.2 0 35 c9 34.4 2.3 2837.7 198.5 0.0 0.4 0.0 11.1 0 5 c9t0d0 34.3 2.3 2837.6 198.5 0.0 0.4 0.0 11.3 0 5 c9t1d0 34.4 2.3 2837.7 198.5 0.0 0.4 0.0 11.1 0 5 c9t2d0 35.9 1.9 2918.2 162.1 0.0 0.4 0.0 11.9 0 5 c9t3d0 35.8 1.9 2918.3 162.1 0.0 0.5 0.0 12.1 0 5 c9t4d0 35.8 1.9 2918.2 162.1 0.0 0.5 0.0 11.9 0 5 c9t5d0 22.2 1.7 1703.0 171.3 0.0 0.2 0.0 9.5 0 3 c9t6d0 22.1 1.7 1696.8 171.2 0.0 0.2 0.0 9.5 0 3 c9t7d0 239.2 15.8 19217.1 1433.5 0.0 2.8 0.0 10.8 0 32 c10 34.6 2.3 2837.8 198.5 0.0 0.4 0.0 10.9 0 5 c10t0d0 34.5 2.3 2837.7 198.5 0.0 0.4 0.0 11.0 0 5 c10t1d0 34.4 2.3 2837.6 198.5 0.0 0.4 0.0 11.3 0 5 c10t2d0 34.5 1.9 2800.5 162.1 0.0 0.4 0.0 12.0 0 5 c10t3d0 34.5 1.9 2800.4 162.1 0.0 0.4 0.0 12.0 0 5 c10t4d0 22.2 1.7 1703.1 171.3 0.0 0.2 0.0 9.5 0 3 c10t5d0 22.2 1.7 1697.0 171.2 0.0 0.2 0.0 9.3 0 3 c10t6d0 22.3 1.7 1703.1 171.3 0.0 0.2 0.0 9.2 0 3 c10t7d0 243.5 15.5 19527.7 1397.1 0.0 2.8 0.0 10.9 0 32 c11 34.5 2.3 2837.8 198.5 0.0 0.4 0.0 11.1 0 5 c11t1d0 34.5 2.3 2837.9 198.5 0.0 0.4 0.0 11.0 0 5 c11t2d0 35.8 1.9 2918.3 162.1 0.0 0.5 0.0 12.1 0 5 c11t3d0 35.9 1.9 2918.2 162.1 0.0 0.5 0.0 11.9 0 5 c11t4d0 36.2 1.9 2918.5 162.1 0.0 0.4 0.0 11.2 0 5 c11t5d0 22.1 1.7 1696.8 171.2 0.0 0.2 0.0 9.5 0 3 c11t6d0 22.2 1.7 1703.1 171.3 0.0 0.2 0.0 9.5 0 3 c11t7d0 22.3 1.7 1697.1 171.2 0.0 0.2 0.0 9.2 0 3 c11t8d0 0.0 0.0 1.0 0.3 0.0 0.0 0.5 1.4 0 0 c8d0 # fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2 ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulted but still in service Problem in : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8 ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulted but still in service Problem in : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulted and taken out of service Problem in : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:23 896d10f1-fa11-69bb-ae78-d18a56fd3288 ZFS-8000-HC Major Fault class : fault.fs.zfs.io_failure_wait Affects : zfs://pool=uberdisk1 faulted but still in service Problem in : zfs://pool=uberdisk1 faulty Description : The ZFS pool has experienced currently unrecoverable I/O failures. Refer to http://sun.com/msg/ZFS-8000-HC for more information. Response : No automated response will be taken. Impact : Read and write I/Os cannot be serviced. Action : Make sure the affected devices are connected, then run ''zpool clear''. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:30 989d0590-9e27-cd11-cba5-d7dbf7127ce1 ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk3/vdev=e0209de35309a6f8 faulted but still in service Problem in : zfs://pool=uberdisk3/vdev=e0209de35309a6f8 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:51 a2d736ac-14e9-cbf7-db28-84e25bfd4a3e ZFS-8000-HC Major Fault class : fault.fs.zfs.io_failure_wait Affects : zfs://pool=uberdisk3 faulted but still in service Problem in : zfs://pool=uberdisk3 faulty Description : The ZFS pool has experienced currently unrecoverable I/O failures. Refer to http://sun.com/msg/ZFS-8000-HC for more information. Response : No automated response will be taken. Impact : Read and write I/Os cannot be serviced. Action : Make sure the affected devices are connected, then run ''zpool clear''. -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com
Sorry u meant iostat -En I''m looking for errors.... On 6 Nov 2010 18:56, "Dave Pooser" <dave.zfs at alfordmedia.com> wrote: On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep" <khushil.dep at gmail.com> wrote:> Is this an E2 chassis? Are you using interposers?No, it?s an SC846A chassis. There are no interposers or expanders; six SFF-8087 ?iPass? cables go from ports on the HBA to ports on the backplane.> Can you send output of iostat -xCzn as well as fmadm faulty please?(please pardon my line wrap) # iostat -xCzn extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 255.0 15.9 20667.5 1424.4 0.0 3.0 0.0 11.2 0 35 c9 34.4 2.3 2837.7 198.5 0.0 0.4 0.0 11.1 0 5 c9t0d0 34.3 2.3 2837.6 198.5 0.0 0.4 0.0 11.3 0 5 c9t1d0 34.4 2.3 2837.7 198.5 0.0 0.4 0.0 11.1 0 5 c9t2d0 35.9 1.9 2918.2 162.1 0.0 0.4 0.0 11.9 0 5 c9t3d0 35.8 1.9 2918.3 162.1 0.0 0.5 0.0 12.1 0 5 c9t4d0 35.8 1.9 2918.2 162.1 0.0 0.5 0.0 11.9 0 5 c9t5d0 22.2 1.7 1703.0 171.3 0.0 0.2 0.0 9.5 0 3 c9t6d0 22.1 1.7 1696.8 171.2 0.0 0.2 0.0 9.5 0 3 c9t7d0 239.2 15.8 19217.1 1433.5 0.0 2.8 0.0 10.8 0 32 c10 34.6 2.3 2837.8 198.5 0.0 0.4 0.0 10.9 0 5 c10t0d0 34.5 2.3 2837.7 198.5 0.0 0.4 0.0 11.0 0 5 c10t1d0 34.4 2.3 2837.6 198.5 0.0 0.4 0.0 11.3 0 5 c10t2d0 34.5 1.9 2800.5 162.1 0.0 0.4 0.0 12.0 0 5 c10t3d0 34.5 1.9 2800.4 162.1 0.0 0.4 0.0 12.0 0 5 c10t4d0 22.2 1.7 1703.1 171.3 0.0 0.2 0.0 9.5 0 3 c10t5d0 22.2 1.7 1697.0 171.2 0.0 0.2 0.0 9.3 0 3 c10t6d0 22.3 1.7 1703.1 171.3 0.0 0.2 0.0 9.2 0 3 c10t7d0 243.5 15.5 19527.7 1397.1 0.0 2.8 0.0 10.9 0 32 c11 34.5 2.3 2837.8 198.5 0.0 0.4 0.0 11.1 0 5 c11t1d0 34.5 2.3 2837.9 198.5 0.0 0.4 0.0 11.0 0 5 c11t2d0 35.8 1.9 2918.3 162.1 0.0 0.5 0.0 12.1 0 5 c11t3d0 35.9 1.9 2918.2 162.1 0.0 0.5 0.0 11.9 0 5 c11t4d0 36.2 1.9 2918.5 162.1 0.0 0.4 0.0 11.2 0 5 c11t5d0 22.1 1.7 1696.8 171.2 0.0 0.2 0.0 9.5 0 3 c11t6d0 22.2 1.7 1703.1 171.3 0.0 0.2 0.0 9.5 0 3 c11t7d0 22.3 1.7 1697.1 171.2 0.0 0.2 0.0 9.2 0 3 c11t8d0 0.0 0.0 1.0 0.3 0.0 0.0 0.5 1.4 0 0 c8d0 # fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2 ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulted but still in service Problem in : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8 ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulted but still in service Problem in : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulted and taken out of service Problem in : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:23 896d10f1-fa11-69bb-ae78-d18a56fd3288 ZFS-8000-HC Major Fault class : fault.fs.zfs.io_failure_wait Affects : zfs://pool=uberdisk1 faulted but still in service Problem in : zfs://pool=uberdisk1 faulty Description : The ZFS pool has experienced currently unrecoverable I/O failures. Refer to http://sun.com/msg/ZFS-8000-HC for more information. Response : No automated response will be taken. Impact : Read and write I/Os cannot be serviced. Action : Make sure the affected devices are connected, then run ''zpool clear''. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:30 989d0590-9e27-cd11-cba5-d7dbf7127ce1 ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk3/vdev=e0209de35309a6f8 faulted but still in service Problem in : zfs://pool=uberdisk3/vdev=e0209de35309a6f8 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 06 06:33:51 a2d736ac-14e9-cbf7-db28-84e25bfd4a3e ZFS-8000-HC Major Fault class : fault.fs.zfs.io_failure_wait Affects : zfs://pool=uberdisk3 faulted but still in service Problem in : zfs://pool=uberdisk3 faulty Description : The ZFS pool has experienced currently unrecoverable I/O failures. Refer to http://sun.com/msg/ZFS-8000-HC for more information. Response : No automated response will be taken. Impact : Read and write I/Os cannot be serviced. Action : Make sure the affected devices are connected, then run ''zpool clear''. -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/3ccf8aaa/attachment-0001.html>
On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep" <khushil.dep at gmail.com> wrote:> Sorry I meant iostat -En I''m looking for errors....# iostat -En c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: IMATION-MAC25-0 Revision: Serial No: 87A0079B1808000 Size: 63.89GB <63887523840 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c9t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t0d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 8 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 8 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t2d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 16 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t3d0 Soft Errors: 0 Hard Errors: 3 Transport Errors: 13 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t4d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 19 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t5d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 1 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t6d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 12 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 9 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com
Similar to what I''ve seen before, SATA disks in a 846 chassis with hardware and transport errors. Though in that occasion it was an E2 chassis with interposers. How long has this system been up? Is it production or can you offline and check all firmware on lsi controllers are up to date and match each other? Do and fmdump -u UUID - V on those faults and get the serial numbers of disks that have failed. Trial and error unless you wrote down which went where I''m afraid. If Hitachi provide a tool like SeaTool from Segate, run it against a disk and see if its really faulty or if the hba it was connected to is on the blink. Restore from backup might be inevitable unless your snapping and auto syncing to another system? On 6 Nov 2010 19:25, "Dave Pooser" <dave.zfs at alfordmedia.com> wrote: On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep" <khushil.dep at gmail.com> wrote:> Sorry I meant iostat -En ...# iostat -En c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: IMATION-MAC25-0 Revision: Serial No: 87A0079B1808000 Size: 63.89GB <63887523840 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c9t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t0d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 8 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 8 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t2d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 16 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t3d0 Soft Errors: 0 Hard Errors: 3 Transport Errors: 13 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t4d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 19 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t5d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 1 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t6d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 12 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 9 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/888369e5/attachment-0001.html>
On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep" <khushil.dep at gmail.com> wrote:> Similar to what I''ve seen before, SATA disks in a 846 chassis with hardware > and transport errors. Though in that occasion it was an E2 chassis with > interposers. How long has this system been up? Is it production or can you > offline and check all firmware on lsi controllers are up to date and match > each other?It''s been up for about 6 months. I can offline them.> Do and fmdump -u UUID - V on those faults and get the serial numbers of disks > that have failed. Trial and error unless you wrote down which went where I''m > afraid.Here''s the thing, though-- I''m really not at all sure it''s the disks that failed. The idea that coincidentally I''m going to have had eight of 24 disks report major errors, all at the same time (because I scrub weekly and didn''t catch any errors last scrub), all on the same controller-- well, that seems much less likely than the idea that I just have a bad controller that needs replacing. -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com
The fmdump will let you get the serial of one disk and id the controller its on so you can swap it out and check. On 6 Nov 2010 19:45, "Dave Pooser" <dave.zfs at alfordmedia.com> wrote: On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep" <khushil.dep at gmail.com> wrote:> Similar to what I''ve seen...It''s been up for about 6 months. I can offline them.> Do and fmdump -u UUID - V on those faults and get the serial numbers ofdisks> that have failed....Here''s the thing, though-- I''m really not at all sure it''s the disks that failed. The idea that coincidentally I''m going to have had eight of 24 disks report major errors, all at the same time (because I scrub weekly and didn''t catch any errors last scrub), all on the same controller-- well, that seems much less likely than the idea that I just have a bad controller that needs replacing. -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101106/ba07dc95/attachment.html>
On 7/11/10 04:27 AM, Dave Pooser wrote:> My setup: A SuperMicro 24-drive chassis with Intel dual-processor > motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives, > divided into three pools with each pool a single eight-disk RAID-Z2. (Boot > is an SSD connected to motherboard SATA.) > > This morning I got a cheerful email from my monitoring script: "Zchecker has > discovered a problem on bigdawg." The full output is below, but I have one > unavailable pool and two degraded pools, with all my problem disks connected > to controller c10. I have multiple spare controllers available. > > First question-- is there an easy way to identify which controller is c10?ls -alrt /dev/cfg/c10 will show you the physical path, which you can then follow $ ls -lart /dev/cfg/c3 1 lrwxrwxrwx 1 root root 55 Nov 12 2009 /dev/cfg/c3 -> ../../devices/pci at 0,0/pci10de,376 at a/pci1000,3150 at 0:scsi you can also make use of fmtopo -V: # /usr/lib/fm/fmd/fmtopo -V ... hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hos tbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis -id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 label string PCIE0 Slot FRU fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis -id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0 ASRU fmri dev:////pci at 0,0/pci10de,376 at a/pci1000,3150 at 0 group: authority version: 1 stability: Private/Private product-id string Sun-Ultra-40-M2-Workstation chassis-id string 0802FMY00N server-id string blinder group: io version: 1 stability: Private/Private dev string /pci at 0,0/pci10de,376 at a/pci1000,3150 at 0 driver string mpt module fmri mod:///mod-name=mpt/mod-id=57 group: pci version: 1 stability: Private/Private device-id string 58 extended-capabilities string pciexdev class-code string 10000 vendor-id string 1000 assigned-addresses uint32[] [ 2164391952 0 16384 0 256 2197946388 0 2686517248 0 16384 2197946396 0 2686451712 0 65536 ] note the "label" and "FRU" properties in the protocol group. McB
Wow, sounds familiar - binderedondat. I thought it was just when using expanders... guess it''s just anything 1068-based. Lost a 20TB pool to having the controller basically just hose up what it was doing and write scragged data to the disk. 1) The suggestion using the serial number of the drive to trace back to what''s connected to what is good, assuming you can pull drives to look at their serial numbers. 2) One thing I''ve done over the years is, given that I often use the same motherboards, is physically map out the PCI slot addresses - /dev/cfg/c2 ../../devices/pci at 0,0/pci8086,340a at 3/pci1000,30a0 at 0:scsi /dev/cfg/c3 ../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at f:scsi /dev/cfg/c4 ../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at v0:scsi /dev/cfg/c5 ../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at f:scsi /dev/cfg/c6 ../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at v0:scsi /dev/cfg/c7 ../../devices/pci at 7a,0/pci8086,3410 at 9/pci1000,30a0 at 0:scsi ^^^^^^^^^^^^^^^ this part will correspond with a physical slot ^^^^^^^^^ if you have a SM dual-IOH board, these represent the two IOH-36s ^ on a single-IOH board, I''ve noted that this often corresponds to the physical slot number (it''s "unit-address" from DDI) So far, it''s involved "stick a card in a slot/reboot/reconfig/see what address it''s at/note it down" or other forms of reverse engineering. Handy to have occasionally. If you''re doing a BYO, taking the time up front to figure this out is a Good Idea. 3) Get a copy of "lsiutil" for solaris (available from LSI''s site) - it''s an easy way to check out the controller and see if it''s there or whether it sees the drives or what. (There is a newer version of lsiutil that supports the 2008s... strangely, it''s not available from the LSI site. Their tech support didn''t even know it existed when I asked. I got my copy off someone on hardforum.) 4) Things you didn''t want to know: the LSI1068 actually has a very small write cache on board. So if you manage a certain set of situations (namely, setting the "device i/o timeout" in the BIOS to something other than 0, then having a SATA drive blow up in a certain way such that it hangs for longer than the timeout you set), the mpt driver (it seems) can get impatient and re-initialize the controller, or that''s what it looks like. Great way to scrag a volume. :( 5) your basic plan seems sound.> Message: 1 > Date: Sat, 06 Nov 2010 13:27:08 -0500 > From: Dave Pooser <dave.zfs at alfordmedia.com> > To: <zfs-discuss at opensolaris.org> > Subject: [zfs-discuss] Apparent SAS HBA failure-- now what? > Message-ID: <C8FB082C.34460%dave.zfs at alfordmedia.com> > Content-Type: text/plain; charset="US-ASCII" > > First question-- is there an easy way to identify which controller isc10?> Second question-- What is the best way to handle replacement (ofeither the> bad controller or of all three controllers if I can''t identify the bad > controller)? I was thinking that I should be able to shut the serverdown,> remove the controller(s), install the replacement controller(s), checkto> see that all the drives are visible, run zpool clear for each pool andthen> do another scrub to verify the problem has been resolved. Does thatsound> like a good plan?
On 8/11/10 10:21 AM, Jeff Bacon wrote:> Wow, sounds familiar - binderedondat. I thought it was just when using > expanders... guess it''s just anything 1068-based. Lost a 20TB pool to > having the controller basically just hose up what it was doing and write > scragged data to the disk. > > 1) The suggestion using the serial number of the drive to trace back to > what''s connected to what is good, assuming you can pull drives to look > at their serial numbers.Except that this could cause more problems if you happen to pull the wrong one in the middle of a resilver operation. Or anything, really.> 2) One thing I''ve done over the years is, given that I often use the > same motherboards, is physically map out the PCI slot addresses - > > /dev/cfg/c2 ../../devices/pci at 0,0/pci8086,340a at 3/pci1000,30a0 at 0:scsi > /dev/cfg/c3 > ../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at f:scsi > /dev/cfg/c4 > ../../devices/pci at 7a,0/pci8086,340c at 5/pci1000,3010 at 0/iport at v0:scsi > /dev/cfg/c5 > ../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at f:scsi > /dev/cfg/c6 > ../../devices/pci at 7a,0/pci8086,340e at 7/pci1000,3010 at 0/iport at v0:scsi > /dev/cfg/c7 ../../devices/pci at 7a,0/pci8086,3410 at 9/pci1000,30a0 at 0:scsi > ^^^^^^^^^^^^^^^ this part will > correspond with a physical slot > ^^^^^^^^^ if you have a SM dual-IOH board, > these represent the two IOH-36s > ^ on a single-IOH board, > I''ve noted that this often > corresponds to the > physical slot number (it''s "unit-address" from DDI) > > So far, it''s involved "stick a card in a slot/reboot/reconfig/see what > address it''s at/note it down" or other forms of reverse engineering. > Handy to have occasionally. If you''re doing a BYO, taking the time up > front to figure this out is a Good Idea.This is what FMA''s libtopo solves for you. Fairly well, too: # /usr/lib/fm/fmd/fmtopo -V .... hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 label string PCIE0 Slot FRU fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0 ASRU fmri dev:////pci at 0,0/pci10de,376 at a/pci1000,3150 at 0 group: authority version: 1 stability: Private/Private product-id string Sun-Ultra-40-M2-Workstation chassis-id string 0802FMY00N server-id string blinder group: io version: 1 stability: Private/Private dev string /pci at 0,0/pci10de,376 at a/pci1000,3150 at 0 driver string mpt module fmri mod:///mod-name=mpt/mod-id=57 group: pci version: 1 stability: Private/Private device-id string 58 extended-capabilities string pciexdev class-code string 10000 vendor-id string 1000 assigned-addresses uint32[] [ 2164391952 0 16384 0 256 2197946388 0 2686517248 0 16384 2197946396 0 2686451712 0 65536 ] Note the label property.> 3) Get a copy of "lsiutil" for solaris (available from LSI''s site) - > it''s an easy way to check out the controller and see if it''s there or > whether it sees the drives or what. > (There is a newer version of lsiutil that supports the 2008s... > strangely, it''s not available from the LSI site. Their tech support > didn''t even know it existed when I asked. I got my copy off someone on > hardforum.)Sigh. Sun (and now Oracle) didn''t distribute lsiutil due at least in part to the likelihood of customers killing their hbas. Which, having used lsiutil (to recover from failed operations) is depressingly easy. The replacement is sas2ircu. I believe the same caveats apply.> 4) Things you didn''t want to know: the LSI1068 actually has a very small > write cache on board. So if you manage a certain set of situations > (namely, setting the "device i/o timeout" in the BIOS to something other > than 0, then having a SATA drive blow up in a certain way such that it > hangs for longer than the timeout you set), the mpt driver (it seems) > can get impatient and re-initialize the controller, or that''s what it > looks like. Great way to scrag a volume. :(The 1068 also has a limitation of 122 devices for its logical target-id concept. But we don''t talk about that in polite company :-) Please, go and have a poke around the output from libtopo. I think you''ll be pleasantly surprised at what you can discover with it. McB