Robert Milkowski
2007-Mar-29 10:39 UTC
[zfs-discuss] Detecting failed drive under MPxIO + ZFS
Hello storage-discuss, First - I''m aware of Proposal: ZFS hotplug support and autoconfiguration by Eric Schrock. I have presented each physical disk from EMC CX3-40 as a LUN and then created RAID-10 using zfs. All devices are under MPxIO, system is S10U3+patches (x64). Now I removed physically two disks from the array. Mar 29 12:02:10 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:02:10 XXXXXXX.srv /scsi_vhci/disk at g6006016062231b0070bf2a6318d9db11 (sd81): path /pci at 0,0/pci10de,5d at e/pci1077,143 at 0,1/fp at 0, 0 (fp1) target address 5006016841e03566,b is now STANDBY because of an externally initiated failover Mar 29 12:02:10 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:02:10 XXXXXXX.srv /scsi_vhci/disk at g6006016062231b0070bf2a6318d9db11 (sd81): path /pci at 0,0/pci10de,5d at e/pci1077,143 at 0/fp at 0,0 (fp0) target address 5006016041e03566,b is now ONLINE because of an externally initiated failover Mar 29 12:02:15 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:02:15 XXXXXXX.srv /scsi_vhci/disk at g6006016062231b00148e596c18d9db11 (sd80): path /pci at 0,0/pci10de,5d at e/pci1077,143 at 0/fp at 0,0 (fp0) target address 5006016041e03566,c is now STANDBY because of an externally initiated failover Mar 29 12:02:15 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:02:15 XXXXXXX.srv /scsi_vhci/disk at g6006016062231b00148e596c18d9db11 (sd80): path /pci at 0,0/pci10de,5d at e/pci1077,143 at 0,1/fp at 0, 0 (fp1) target address 5006016841e03566,c is now ONLINE because of an externally initiated failover Mar 29 12:02:20 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:02:20 XXXXXXX.srv /scsi_vhci/disk at g6006016062231b003651838018d9db11 (sd78): path /pci at 0,0/pci10de,5d at e/pci1077,143 at 0/fp at 0,0 (fp0) target address 5006016041e03566,e is now STANDBY because of an externally initiated failover Mar 29 12:02:20 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:02:20 XXXXXXX.srv /scsi_vhci/disk at g6006016062231b003651838018d9db11 (sd78): path /pci at 0,0/pci10de,5d at e/pci1077,143 at 0,1/fp at 0, 0 (fp1) target address 5006016841e03566,e is now ONLINE because of an externally initiated failover Mar 29 12:03:24 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:03:24 XXXXXXX.srv /scsi_vhci/disk at g6006016062231b00ba53c25a19d9db11 (sd64): path /pci at 0,0/pci10de,5d at e/pci1077,143 at 0,1/fp at 0, 0 (fp1) target address 5006016841e03566,1c is now STANDBY because of an externally initiated failover Mar 29 12:03:29 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:03:29 XXXXXXX.srv Initiating failover for device disk (GUID 6006016062231b00ba53c25a19d9db11) Mar 29 12:03:31 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:03:31 XXXXXXX.srv Initiating failover for device disk (GUID 6006016062231b00ba53c25a19d9db11) Mar 29 12:03:33 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:03:33 XXXXXXX.srv Initiating failover for device disk (GUID 6006016062231b00ba53c25a19d9db11) Mar 29 12:03:35 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:03:35 XXXXXXX.srv Initiating failover for device disk (GUID 6006016062231b00ba53c25a19d9db11) Mar 29 12:03:36 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Mar 29 12:03:36 XXXXXXX.srv Initiating failover for device disk (GUID 6006016062231b00ba53c25a19d9db11) [...] The last two lines are constantly repeating (several entries per second). bash-3.00# iostat -xnz 1|egrep " c6|devic" [skipping first output] extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 100 0 c6t6006016062231B00A07323E419D9DB11d0 0.0 0.0 0.0 0.0 34.0 0.0 0.0 0.0 100 0 c6t6006016062231B00BA53C25A19D9DB11d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 100 0 c6t6006016062231B00A07323E419D9DB11d0 0.0 0.0 0.0 0.0 34.0 0.0 0.0 0.0 100 0 c6t6006016062231B00BA53C25A19D9DB11d0 It''s been like that for ever 30 minutes (and it still is). ZFS hasn''t noticed too of course. 1. zpool status bash-3.00# zpool status pool: f4-1 state: ONLINE scrub: scrub stopped with 0 errors on Wed Mar 28 12:11:50 2007 ^C^C^C^C^C I can''t stop it and I can''t get output (zpool list, zfs list are working). 2. MPxIO - it tries to failover disk to second SP but looks like it tries it forever (or very very long). After some time it should have generated disk IO failure... 3. I guess that in such a case Eric''s proposal probably won''t help and the real problem is with MPxIO - right? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Torrey McMahon
2007-Mar-30 01:14 UTC
[zfs-discuss] Re: [storage-discuss] Detecting failed drive under MPxIO + ZFS
Robert Milkowski wrote:> > 2. MPxIO - it tries to failover disk to second SP but looks like it > tries it forever (or very very long). After some time it should > have generated disk IO failure... >Are there any other hosts connected to this storage array? It looks like there might be an other host ping-ponging the LUNs with this box.> 3. I guess that in such a case Eric''s proposal probably won''t help and > the real problem is with MPxIO - right?Well....I wouldn''t say it''s mpxio''s fault either. At least not at this point.
Robert Milkowski
2007-Mar-30 17:26 UTC
[zfs-discuss] Re[2]: [storage-discuss] Detecting failed drive under MPxIO + ZFS
Hello Torrey, Friday, March 30, 2007, 3:14:27 AM, you wrote: TM> Robert Milkowski wrote:>> >> 2. MPxIO - it tries to failover disk to second SP but looks like it >> tries it forever (or very very long). After some time it should >> have generated disk IO failure... >>TM> Are there any other hosts connected to this storage array? It looks like TM> there might be an other host ping-ponging the LUNs with this box. There''s another host connected (two hosts, each with two links, both FCAL directly, both MPxIO). However second hosts wasn''t accessing any disk at this time (but was online) and I can''t find any log entries on a second host from that time frame. ? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com