thr3ads.net - Ocfs2 users - [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA [Jun 2009]

If this information is useful, please help other people find it:
Share via:

florian.engelmann at bt.com

2009-Jun-05 09:26 UTC

[Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

Hello,
our Debian etch cluster nodes are panicing because of ocfs2 fencing if
one SAN path fails.

modinfo ocfs2
filename:       /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
author:         Oracle
license:        GPL
description:    OCFS2 1.3.3
version:        1.3.3
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:        ocfs2_dlm,ocfs2_nodemanager,jbd
srcversion:     0798424846E68F10172C203

modinfo ocfs2_dlmfs
filename:
/lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
author:         Oracle
license:        GPL
description:    OCFS2 DLMFS 1.3.3
version:        1.3.3
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:        ocfs2_dlm,ocfs2_nodemanager
srcversion:     E3780E12396118282B3C1AD

defr1elcbtd02:~# modinfo ocfs2_dlm
filename:
/lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
author:         Oracle
license:        GPL
description:    OCFS2 DLM 1.3.3
version:        1.3.3
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:        ocfs2_nodemanager
srcversion:     7DC395EA08AE4CE826C5B92

modinfo ocfs2_nodemanager
filename:
/lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
author:         Oracle
license:        GPL
description:    OCFS2 Node Manager 1.3.3
version:        1.3.3
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:        configfs
srcversion:     C4C9871302E1910B78DAE40

modinfo qla2xxx
filename:
/lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
author:         QLogic Corporation
description:    QLogic Fibre Channel HBA Driver
license:        GPL
version:        8.01.07-k1
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:        scsi_mod,scsi_transport_fc,firmware_class
alias:          pci:v00001077d00002100sv*sd*bc*sc*i*
alias:          pci:v00001077d00002200sv*sd*bc*sc*i*
alias:          pci:v00001077d00002300sv*sd*bc*sc*i*
alias:          pci:v00001077d00002312sv*sd*bc*sc*i*
alias:          pci:v00001077d00002322sv*sd*bc*sc*i*
alias:          pci:v00001077d00006312sv*sd*bc*sc*i*
alias:          pci:v00001077d00006322sv*sd*bc*sc*i*
alias:          pci:v00001077d00002422sv*sd*bc*sc*i*
alias:          pci:v00001077d00002432sv*sd*bc*sc*i*
alias:          pci:v00001077d00005422sv*sd*bc*sc*i*
alias:          pci:v00001077d00005432sv*sd*bc*sc*i*
srcversion:     B8E1608E257391DCAFD9224
parm:           ql2xfdmienable:Enables FDMI registratons Default is 0 -
no FDMI. 1 - perfom FDMI. (int)
parm:           extended_error_logging:Option to enable extended error
logging, Default is 0 - no logging. 1 - log errors. (int)
parm:           ql2xallocfwdump:Option to enable allocation of memory
for a firmware dump during HBA initialization.  Memory allocation
requirements vary by ISP type.  Default is 1 - allocate memory. (int)
parm:           ql2xloginretrycount:Specify an alternate value for the
NVRAM login retry count. (int)
parm:           ql2xplogiabsentdevice:Option to enable PLOGI to devices
that are not present after a Fabric scan.  This is needed for several
broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI. (int)
parm:           qlport_down_retry:Maximum number of command retries to a
port that returns a PORT-DOWN status. (int)
parm:           ql2xlogintimeout:Login timeout value in seconds. (int)

modinfo dm_multipath
filename:
/lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-multipath.ko
description:    device-mapper multipath target
author:         Sistina Software <dm-devel at redhat.com>
license:        GPL
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:        dm-mod

modinfo dm_mod
filename:       /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-mod.ko
description:    device-mapper driver
author:         Joe Thornber <dm-devel at redhat.com>
license:        GPL
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:
parm:           major:The major number of the device mapper (uint)

modinfo dm_round_robin
filename:
/lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-round-robin.ko
description:    device-mapper round-robin multipath path selector
author:         Sistina Software <dm-devel at redhat.com>
license:        GPL
vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
depends:        dm-multipath

There is no self compiled software just the official repository was
used.
The nodes are connected to our two independent SANs. The storage systems
are EMC Clariion CX3-20f and EMC Clariion CX500.

multipath.conf:
defaults {
        rr_min_io                       1000
        polling_interval                2
        no_path_retry                   5
        user_friendly_names             yes
}

blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"
        devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
        device {
                vendor "DGC"
                product "LUNZ" # EMC Management LUN
        }
        device {
                vendor "ATA"  #We do not need mutlipathing for local
drives
                product "*"
        }
        device {
                vendor "AMI" # No multipathing for SUN Virtual devices
                product "*"
        }
        device {
                vendor "HITACHI" # No multipathing for local scsi
disks
                product "H101414SCSUN146G"
        }
}

devices {
        ## Device attributes for EMC CLARiiON
        device {
                vendor                  "DGC"
                product                 "*"
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s
/block/%n"
                prio_callout            "/sbin/mpath_prio_emc /dev/%n"
                hardware_handler        "1 emc"
                features                "1 queue_if_no_path"
                no_path_retry           fail
                path_checker            emc_clariion
                path_selector           "round-robin 0"
                failback                immediate
                user_friendly_names     yes
        }
}

multipaths {
        multipath {
                wwid
3600601603ac511001c7c92fec775dd11
                alias                   stosan01_lun070
        }
}

multipath -ll:
stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 DGC,RAID 5
[size=133G][features=0][hwhandler=1 emc]
\_ round-robin 0 [prio=2][active]
 \_ 0:0:1:1 sdd 8:48  [active][ready]
 \_ 1:0:1:1 sdh 8:112 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:0:1 sdb 8:16  [active][ready]
 \_ 1:0:0:1 sdf 8:80  [active][ready]


As we use lvm2 we added /dev/sd* to the filter:
filter = [ "r|/dev/cdrom|", "r|/dev/sd.*|" ]

Here is what happened and what we did to reconstruct the situation to
find a solution:

On 02.06.2009 we did something wrong with the zoning on one of our two
SANs and all servers (about 40) lost one path to the SAN. Only two
servers crashed. Those two are our Debian etch heartbeat cluster
described above.
The console showed a kernel panic because of ocfs2 was fencing both
nodes.

This was the message:
O2hb_write_timeout: 165 ERROR: Heartbeat write timeout to device dm-7
after 12000 milliseconds

So we decided to change the o2cb settings to:
O2CB_HEARTBEAT_THRESHOLD=31
O2CB_IDLE_TIMEOUT_MS=30000
O2CB_KEEPALIVE_DELAY_MS=2000
O2CB_RECONNECT_DELAY_MS=2000

We switched all cluster resources to the 1st node to test the new
settings on the second node. We removed the 2nd node from the zoning (we
also tested shutting down the port with the same result) and got a
different error but still ending up with a kernel panic:

Jun  4 16:41:05 defr1elcbtd02 kernel: o2net: no longer connected to node
defr1elcbtd01 (num 0) at 192.168.0.101:7777
Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-0: blocked FC remote
port time out: removing target and saving binding
Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-1: blocked FC remote
port time out: removing target and saving binding
Jun  4 16:41:27 defr1elcbtd02 kernel: sd 0:0:1:1: SCSI error: return
code = 0x00010000
Jun  4 16:41:27 defr1elcbtd02 kernel: end_request: I/O error, dev sdd,
sector 1672
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath: Failing
path 8:48.
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath: Failing
path 8:16.
Jun  4 16:41:27 defr1elcbtd02 kernel: scsi 0:0:1:1: rejecting I/O to
device being removed
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc: long
trespass command will be send
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc:
honor reservation bit will not be set (default)
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: table: 253:7:
multipath: error getting device
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: ioctl: error adding
target to table
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc: long
trespass command will be send
Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc:
honor reservation bit will not be set (default)
Jun  4 16:41:29 defr1elcbtd02 kernel: device-mapper: multipath emc:
emc_pg_init: sending switch-over command
Jun  4 16:42:01 defr1elcbtd02 kernel:
(10751,1):dlm_send_remote_convert_request:395 ERROR: status = -107
Jun  4 16:42:01 defr1elcbtd02 kernel:
(10751,1):dlm_wait_for_node_death:374 5EE89BC01EFC405E9197C198DEEAE678:
waiting 5000ms for notification of death of node 0
Jun  4 16:42:07 defr1elcbtd02 kernel:
(10751,1):dlm_send_remote_convert_request:395 ERROR: status = -107
Jun  4 16:42:07 defr1elcbtd02 kernel:
(10751,1):dlm_wait_for_node_death:374 5EE89BC01EFC405E9197C198DEEAE678:
waiting 5000ms for notification of death of node 0
[...]
After 60 seconds:

(8,0): o2quo_make_decision:143 ERROR: fending this node because it is
connected to a half-quorum of 1 out of 2 nodes which doesn't include the
lowest active node 0


multipath -ll changed to:
stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 DGC,RAID 5
[size=133G][features=0][hwhandler=1 emc]
\_ round-robin 0 [prio=1][active]
 \_ 0:0:1:1 sdd 8:48  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:0:1 sdb 8:16  [active][ready]

The ocfs2 filesystem is still mounted an writable. Even if I enable the
zoneing (or the FC port) again within the 60 seconds ocfs2 does not
reconnect to node 1 and panics the kernel after 60 seconds while
multipath -ll shows both path again.

I do not understand at all what the Ethernet heartbeat connection of
ocfs2 has to do with the SAN connection.

The strangest thing at all is - this does not happen always! After some
reboots the system keeps running stable even if I shutdown a FC port and
enable it again many times. There is no constant behaviour... It happens
most of the times, but at about 10% it does not happen and everything is
working as intended.

Any explanations or ideas what causes this behaviour?

I will test this on Debian lenny to see if the Debian version makes a
difference.

Best regards,
Florian

Srinivas Eeda

2009-Jun-05 18:25 UTC

head link

[Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

Florian,
the problem here seems to be with network. The nodes are running into 
network heartbeat timeout and hence second node is getting fenced. Do 
you see o2net thread consuming 100% cpu on any node? if not then 
probably check your network
thanks,
--Srini

florian.engelmann at bt.com wrote:> Hello,
> our Debian etch cluster nodes are panicing because of ocfs2 fencing if
> one SAN path fails.
>
> modinfo ocfs2
> filename:       /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
> author:         Oracle
> license:        GPL
> description:    OCFS2 1.3.3
> version:        1.3.3
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:        ocfs2_dlm,ocfs2_nodemanager,jbd
> srcversion:     0798424846E68F10172C203
>
> modinfo ocfs2_dlmfs
> filename:
> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
> author:         Oracle
> license:        GPL
> description:    OCFS2 DLMFS 1.3.3
> version:        1.3.3
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:        ocfs2_dlm,ocfs2_nodemanager
> srcversion:     E3780E12396118282B3C1AD
>
> defr1elcbtd02:~# modinfo ocfs2_dlm
> filename:
> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
> author:         Oracle
> license:        GPL
> description:    OCFS2 DLM 1.3.3
> version:        1.3.3
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:        ocfs2_nodemanager
> srcversion:     7DC395EA08AE4CE826C5B92
>
> modinfo ocfs2_nodemanager
> filename:
> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
> author:         Oracle
> license:        GPL
> description:    OCFS2 Node Manager 1.3.3
> version:        1.3.3
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:        configfs
> srcversion:     C4C9871302E1910B78DAE40
>
> modinfo qla2xxx
> filename:
> /lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> author:         QLogic Corporation
> description:    QLogic Fibre Channel HBA Driver
> license:        GPL
> version:        8.01.07-k1
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:        scsi_mod,scsi_transport_fc,firmware_class
> alias:          pci:v00001077d00002100sv*sd*bc*sc*i*
> alias:          pci:v00001077d00002200sv*sd*bc*sc*i*
> alias:          pci:v00001077d00002300sv*sd*bc*sc*i*
> alias:          pci:v00001077d00002312sv*sd*bc*sc*i*
> alias:          pci:v00001077d00002322sv*sd*bc*sc*i*
> alias:          pci:v00001077d00006312sv*sd*bc*sc*i*
> alias:          pci:v00001077d00006322sv*sd*bc*sc*i*
> alias:          pci:v00001077d00002422sv*sd*bc*sc*i*
> alias:          pci:v00001077d00002432sv*sd*bc*sc*i*
> alias:          pci:v00001077d00005422sv*sd*bc*sc*i*
> alias:          pci:v00001077d00005432sv*sd*bc*sc*i*
> srcversion:     B8E1608E257391DCAFD9224
> parm:           ql2xfdmienable:Enables FDMI registratons Default is 0 -
> no FDMI. 1 - perfom FDMI. (int)
> parm:           extended_error_logging:Option to enable extended error
> logging, Default is 0 - no logging. 1 - log errors. (int)
> parm:           ql2xallocfwdump:Option to enable allocation of memory
> for a firmware dump during HBA initialization.  Memory allocation
> requirements vary by ISP type.  Default is 1 - allocate memory. (int)
> parm:           ql2xloginretrycount:Specify an alternate value for the
> NVRAM login retry count. (int)
> parm:           ql2xplogiabsentdevice:Option to enable PLOGI to devices
> that are not present after a Fabric scan.  This is needed for several
> broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI. (int)
> parm:           qlport_down_retry:Maximum number of command retries to a
> port that returns a PORT-DOWN status. (int)
> parm:           ql2xlogintimeout:Login timeout value in seconds. (int)
>
> modinfo dm_multipath
> filename:
> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-multipath.ko
> description:    device-mapper multipath target
> author:         Sistina Software <dm-devel at redhat.com>
> license:        GPL
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:        dm-mod
>
> modinfo dm_mod
> filename:       /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-mod.ko
> description:    device-mapper driver
> author:         Joe Thornber <dm-devel at redhat.com>
> license:        GPL
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:
> parm:           major:The major number of the device mapper (uint)
>
> modinfo dm_round_robin
> filename:
> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-round-robin.ko
> description:    device-mapper round-robin multipath path selector
> author:         Sistina Software <dm-devel at redhat.com>
> license:        GPL
> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> depends:        dm-multipath
>
> There is no self compiled software just the official repository was
> used.
> The nodes are connected to our two independent SANs. The storage systems
> are EMC Clariion CX3-20f and EMC Clariion CX500.
>
> multipath.conf:
> defaults {
>         rr_min_io                       1000
>         polling_interval                2
>         no_path_retry                   5
>         user_friendly_names             yes
> }
>
> blacklist {
>         devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
>         devnode "^hd[a-z][[0-9]*]"
>         devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
>         device {
>                 vendor "DGC"
>                 product "LUNZ" # EMC Management LUN
>         }
>         device {
>                 vendor "ATA"  #We do not need mutlipathing for
local
> drives
>                 product "*"
>         }
>         device {
>                 vendor "AMI" # No multipathing for SUN Virtual
devices
>                 product "*"
>         }
>         device {
>                 vendor "HITACHI" # No multipathing for local scsi
disks
>                 product "H101414SCSUN146G"
>         }
> }
>
> devices {
>         ## Device attributes for EMC CLARiiON
>         device {
>                 vendor                  "DGC"
>                 product                 "*"
>                 path_grouping_policy    group_by_prio
>                 getuid_callout          "/sbin/scsi_id -g -u -s
> /block/%n"
>                 prio_callout            "/sbin/mpath_prio_emc
/dev/%n"
>                 hardware_handler        "1 emc"
>                 features                "1 queue_if_no_path"
>                 no_path_retry           fail
>                 path_checker            emc_clariion
>                 path_selector           "round-robin 0"
>                 failback                immediate
>                 user_friendly_names     yes
>         }
> }
>
> multipaths {
>         multipath {
>                 wwid
> 3600601603ac511001c7c92fec775dd11
>                 alias                   stosan01_lun070
>         }
> }
>
> multipath -ll:
> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 DGC,RAID 5
> [size=133G][features=0][hwhandler=1 emc]
> \_ round-robin 0 [prio=2][active]
>  \_ 0:0:1:1 sdd 8:48  [active][ready]
>  \_ 1:0:1:1 sdh 8:112 [active][ready]
> \_ round-robin 0 [prio=0][enabled]
>  \_ 0:0:0:1 sdb 8:16  [active][ready]
>  \_ 1:0:0:1 sdf 8:80  [active][ready]
>
>
> As we use lvm2 we added /dev/sd* to the filter:
> filter = [ "r|/dev/cdrom|", "r|/dev/sd.*|" ]
>
> Here is what happened and what we did to reconstruct the situation to
> find a solution:
>
> On 02.06.2009 we did something wrong with the zoning on one of our two
> SANs and all servers (about 40) lost one path to the SAN. Only two
> servers crashed. Those two are our Debian etch heartbeat cluster
> described above.
> The console showed a kernel panic because of ocfs2 was fencing both
> nodes.
>
> This was the message:
> O2hb_write_timeout: 165 ERROR: Heartbeat write timeout to device dm-7
> after 12000 milliseconds
>
> So we decided to change the o2cb settings to:
> O2CB_HEARTBEAT_THRESHOLD=31
> O2CB_IDLE_TIMEOUT_MS=30000
> O2CB_KEEPALIVE_DELAY_MS=2000
> O2CB_RECONNECT_DELAY_MS=2000
>
> We switched all cluster resources to the 1st node to test the new
> settings on the second node. We removed the 2nd node from the zoning (we
> also tested shutting down the port with the same result) and got a
> different error but still ending up with a kernel panic:
>
> Jun  4 16:41:05 defr1elcbtd02 kernel: o2net: no longer connected to node
> defr1elcbtd01 (num 0) at 192.168.0.101:7777
> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-0: blocked FC remote
> port time out: removing target and saving binding
> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-1: blocked FC remote
> port time out: removing target and saving binding
> Jun  4 16:41:27 defr1elcbtd02 kernel: sd 0:0:1:1: SCSI error: return
> code = 0x00010000
> Jun  4 16:41:27 defr1elcbtd02 kernel: end_request: I/O error, dev sdd,
> sector 1672
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath: Failing
> path 8:48.
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath: Failing
> path 8:16.
> Jun  4 16:41:27 defr1elcbtd02 kernel: scsi 0:0:1:1: rejecting I/O to
> device being removed
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc: long
> trespass command will be send
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc:
> honor reservation bit will not be set (default)
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: table: 253:7:
> multipath: error getting device
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: ioctl: error adding
> target to table
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc: long
> trespass command will be send
> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath emc:
> honor reservation bit will not be set (default)
> Jun  4 16:41:29 defr1elcbtd02 kernel: device-mapper: multipath emc:
> emc_pg_init: sending switch-over command
> Jun  4 16:42:01 defr1elcbtd02 kernel:
> (10751,1):dlm_send_remote_convert_request:395 ERROR: status = -107
> Jun  4 16:42:01 defr1elcbtd02 kernel:
> (10751,1):dlm_wait_for_node_death:374 5EE89BC01EFC405E9197C198DEEAE678:
> waiting 5000ms for notification of death of node 0
> Jun  4 16:42:07 defr1elcbtd02 kernel:
> (10751,1):dlm_send_remote_convert_request:395 ERROR: status = -107
> Jun  4 16:42:07 defr1elcbtd02 kernel:
> (10751,1):dlm_wait_for_node_death:374 5EE89BC01EFC405E9197C198DEEAE678:
> waiting 5000ms for notification of death of node 0
> [...]
> After 60 seconds:
>
> (8,0): o2quo_make_decision:143 ERROR: fending this node because it is
> connected to a half-quorum of 1 out of 2 nodes which doesn't include
the
> lowest active node 0
>
>
> multipath -ll changed to:
> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 DGC,RAID 5
> [size=133G][features=0][hwhandler=1 emc]
> \_ round-robin 0 [prio=1][active]
>  \_ 0:0:1:1 sdd 8:48  [active][ready]
> \_ round-robin 0 [prio=0][enabled]
>  \_ 0:0:0:1 sdb 8:16  [active][ready]
>
> The ocfs2 filesystem is still mounted an writable. Even if I enable the
> zoneing (or the FC port) again within the 60 seconds ocfs2 does not
> reconnect to node 1 and panics the kernel after 60 seconds while
> multipath -ll shows both path again.
>
> I do not understand at all what the Ethernet heartbeat connection of
> ocfs2 has to do with the SAN connection.
>
> The strangest thing at all is - this does not happen always! After some
> reboots the system keeps running stable even if I shutdown a FC port and
> enable it again many times. There is no constant behaviour... It happens
> most of the times, but at about 10% it does not happen and everything is
> working as intended.
>
> Any explanations or ideas what causes this behaviour?
>
> I will test this on Debian lenny to see if the Debian version makes a
> difference.
>
> Best regards,
> Florian
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Ocfs2 users - Jun 2009 - ocfs2 fencing with multipath and dual channel HBA

[Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

[Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA