Robert Milkowski
2007-Mar-29  10:39 UTC
[zfs-discuss] Detecting failed drive under MPxIO + ZFS
Hello storage-discuss,
First - I''m aware of Proposal: ZFS hotplug support and
autoconfiguration by Eric Schrock.
I have presented each physical disk from EMC CX3-40 as a LUN and then
created RAID-10 using zfs. All devices are under MPxIO, system is
S10U3+patches (x64).
Now I removed physically two disks from the array.
Mar 29 12:02:10 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:02:10 XXXXXXX.srv    /scsi_vhci/disk at
g6006016062231b0070bf2a6318d9db11 (sd81): path /pci at 0,0/pci10de,5d at
e/pci1077,143 at 0,1/fp at 0,
0 (fp1) target address 5006016841e03566,b is now STANDBY because of an
externally initiated failover
Mar 29 12:02:10 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:02:10 XXXXXXX.srv    /scsi_vhci/disk at
g6006016062231b0070bf2a6318d9db11 (sd81): path /pci at 0,0/pci10de,5d at
e/pci1077,143 at 0/fp at 0,0
(fp0) target address 5006016041e03566,b is now ONLINE because of an externally
initiated failover
Mar 29 12:02:15 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:02:15 XXXXXXX.srv    /scsi_vhci/disk at
g6006016062231b00148e596c18d9db11 (sd80): path /pci at 0,0/pci10de,5d at
e/pci1077,143 at 0/fp at 0,0
(fp0) target address 5006016041e03566,c is now STANDBY because of an externally
initiated failover
Mar 29 12:02:15 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:02:15 XXXXXXX.srv    /scsi_vhci/disk at
g6006016062231b00148e596c18d9db11 (sd80): path /pci at 0,0/pci10de,5d at
e/pci1077,143 at 0,1/fp at 0,
0 (fp1) target address 5006016841e03566,c is now ONLINE because of an externally
initiated failover
Mar 29 12:02:20 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:02:20 XXXXXXX.srv    /scsi_vhci/disk at
g6006016062231b003651838018d9db11 (sd78): path /pci at 0,0/pci10de,5d at
e/pci1077,143 at 0/fp at 0,0
(fp0) target address 5006016041e03566,e is now STANDBY because of an externally
initiated failover
Mar 29 12:02:20 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:02:20 XXXXXXX.srv    /scsi_vhci/disk at
g6006016062231b003651838018d9db11 (sd78): path /pci at 0,0/pci10de,5d at
e/pci1077,143 at 0,1/fp at 0,
0 (fp1) target address 5006016841e03566,e is now ONLINE because of an externally
initiated failover
Mar 29 12:03:24 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:03:24 XXXXXXX.srv    /scsi_vhci/disk at
g6006016062231b00ba53c25a19d9db11 (sd64): path /pci at 0,0/pci10de,5d at
e/pci1077,143 at 0,1/fp at 0,
0 (fp1) target address 5006016841e03566,1c is now STANDBY because of an
externally initiated failover
Mar 29 12:03:29 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:03:29 XXXXXXX.srv    Initiating failover for device disk (GUID
6006016062231b00ba53c25a19d9db11)
Mar 29 12:03:31 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:03:31 XXXXXXX.srv    Initiating failover for device disk (GUID
6006016062231b00ba53c25a19d9db11)
Mar 29 12:03:33 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:03:33 XXXXXXX.srv    Initiating failover for device disk (GUID
6006016062231b00ba53c25a19d9db11)
Mar 29 12:03:35 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:03:35 XXXXXXX.srv    Initiating failover for device disk (GUID
6006016062231b00ba53c25a19d9db11)
Mar 29 12:03:36 XXXXXXX.srv scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Mar 29 12:03:36 XXXXXXX.srv    Initiating failover for device disk (GUID
6006016062231b00ba53c25a19d9db11)
[...]
The last two lines are constantly repeating (several entries per
second).
bash-3.00# iostat -xnz 1|egrep " c6|devic"
[skipping first output]
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  3.0  0.0    0.0    0.0 100   0
c6t6006016062231B00A07323E419D9DB11d0
    0.0    0.0    0.0    0.0 34.0  0.0    0.0    0.0 100   0
c6t6006016062231B00BA53C25A19D9DB11d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  3.0  0.0    0.0    0.0 100   0
c6t6006016062231B00A07323E419D9DB11d0
    0.0    0.0    0.0    0.0 34.0  0.0    0.0    0.0 100   0
c6t6006016062231B00BA53C25A19D9DB11d0
It''s been like that for ever 30 minutes (and it still is).
ZFS hasn''t noticed too of course.
1. zpool status
bash-3.00# zpool status
  pool: f4-1
 state: ONLINE
 scrub: scrub stopped with 0 errors on Wed Mar 28 12:11:50 2007
^C^C^C^C^C
I can''t stop it and I can''t get output (zpool list, zfs list
are
working).
2. MPxIO - it tries to failover disk to second SP but looks like it
   tries it forever (or very very long). After some time it should
   have generated disk IO failure...
3. I guess that in such a case Eric''s proposal probably won''t
help and
   the real problem is with MPxIO - right?
-- 
Best regards,
 Robert                          mailto:rmilkowski at task.gda.pl
                                     http://milek.blogspot.com
Torrey McMahon
2007-Mar-30  01:14 UTC
[zfs-discuss] Re: [storage-discuss] Detecting failed drive under MPxIO + ZFS
Robert Milkowski wrote:> > 2. MPxIO - it tries to failover disk to second SP but looks like it > tries it forever (or very very long). After some time it should > have generated disk IO failure... >Are there any other hosts connected to this storage array? It looks like there might be an other host ping-ponging the LUNs with this box.> 3. I guess that in such a case Eric''s proposal probably won''t help and > the real problem is with MPxIO - right?Well....I wouldn''t say it''s mpxio''s fault either. At least not at this point.
Robert Milkowski
2007-Mar-30  17:26 UTC
[zfs-discuss] Re[2]: [storage-discuss] Detecting failed drive under MPxIO + ZFS
Hello Torrey, Friday, March 30, 2007, 3:14:27 AM, you wrote: TM> Robert Milkowski wrote:>> >> 2. MPxIO - it tries to failover disk to second SP but looks like it >> tries it forever (or very very long). After some time it should >> have generated disk IO failure... >>TM> Are there any other hosts connected to this storage array? It looks like TM> there might be an other host ping-ponging the LUNs with this box. There''s another host connected (two hosts, each with two links, both FCAL directly, both MPxIO). However second hosts wasn''t accessing any disk at this time (but was online) and I can''t find any log entries on a second host from that time frame. ? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com