thr3ads.net - Ocfs2 users - [Ocfs2-users] Fencing options [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Angelo McComis

2010-Jan-13 02:38 UTC

[Ocfs2-users] Fencing options

After learning more about what fencing means when you see it in  
action. (the default of emergency_restart(); ). I'm now researching  
how to determine what causes a fencing to occur.

This is sles10.2 on the 2.6.16-42.5 kernel which means 1.4.1-sles is  
the version of ocfs2.

I know the default reply from Sunil will be to ask Novell....  :-).  
But we actually have a support partnership with HP and since they're  
not Novell, we have to wait for their backline contacts to make  
connection. Which is why I'm asking the users community simultaneous  
to the support call. The call has been open for 8 hrs now with no call  
back yet.

We have a set of 6 servers in a cluster and they're  only in a cluster  
for the sake of ocfs2 for a shared volume. Today within a one minute  
time span, node 1 says he lost connectivity to node 2 and 3, followed  
about a minute later by saying he lost connectivity to node 0 and 5. 1  
and 4 stayed up. But 2, 3, 0, and 5 all were evicted and rebooted.

This happened on the prod cluster and simultaneously on our nonprod  
cluster simultaneously. The only difference between nonprod and prod  
is that nonprod has 7 nodes rather than 6... On the nonprod cluster, 4  
out of 7 servers rebooted due to node eviction.

This set of servers are setup across two blade chassis and the nic  
config is a private vlan, non routed. It's eth1 using a 192.168.x.y  
scheme. The blade servers were running a load average of about 1.1 or  
so but are 8ways (dual quad core) which isn't exactly taxing the  
boxes.  The LAN environment is 10gbit fiber from the connect modules  
on the chassis to the switches and are gig uplink on the blades  
themselves. Ifconfig shows no evidence of packet loss.

Questions:
Can we set up redundant heartbeat ip connections?  Can we also add a  
disk heartbeat?  If it truly is network connectivity, can we set the  
timeout to be more lenient? And can we change the fencing to something  
other than machine reset? Eg unmount the volume, change it to read  
only, etc?

Thanks...

Angelo

Ulf Zimmermann

2010-Jan-13 18:01 UTC

head link

[Ocfs2-users] Fencing options

> -----Original Message-----
> 
> Questions:
> Can we set up redundant heartbeat ip connections?  Can we also add a
> disk heartbeat?  If it truly is network connectivity, can we set the
> timeout to be more lenient? And can we change the fencing to something
> other than machine reset? Eg unmount the volume, change it to read
> only, etc?
There is a network and a disk heartbeat afik. The timeouts are controlled via
/etc/sysconfig/o2cb
(On RedHat at least, not sure if Suse follows the same way). In there you have:

  # O2CB_ENABELED: 'true' means to load the driver on boot.
  O2CB_ENABLED=true

  # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
  O2CB_BOOTCLUSTER=dbtest

  # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
  O2CB_HEARTBEAT_THRESHOLD=76

  # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered
dead.
  O2CB_IDLE_TIMEOUT_MS=30000

  # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
  O2CB_KEEPALIVE_DELAY_MS=2000

  # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
  O2CB_RECONNECT_DELAY_MS=2000

The above values is what we use on our clusters, we got three 2-node, one 4-node
and one 6-node cluster.
These are all running RedHat EL4 on HP hardware (DL360 g4, g5 or DL380 g5).
> 
> Thanks...
> 
> Angelo
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Sunil Mushran

2010-Jan-13 18:47 UTC

head link

[Ocfs2-users] Fencing options

The problem was likely storage related and not network related.

Do you have netconsole setup? If so, look at the logs. It will tell
you as to why that node was fenced.

Angelo McComis wrote:> After learning more about what fencing means when you see it in  
> action. (the default of emergency_restart(); ). I'm now researching  
> how to determine what causes a fencing to occur.
>
> This is sles10.2 on the 2.6.16-42.5 kernel which means 1.4.1-sles is  
> the version of ocfs2.
>
> I know the default reply from Sunil will be to ask Novell....  :-).  
> But we actually have a support partnership with HP and since they're  
> not Novell, we have to wait for their backline contacts to make  
> connection. Which is why I'm asking the users community simultaneous  
> to the support call. The call has been open for 8 hrs now with no call  
> back yet.
>
> We have a set of 6 servers in a cluster and they're  only in a cluster
> for the sake of ocfs2 for a shared volume. Today within a one minute  
> time span, node 1 says he lost connectivity to node 2 and 3, followed  
> about a minute later by saying he lost connectivity to node 0 and 5. 1  
> and 4 stayed up. But 2, 3, 0, and 5 all were evicted and rebooted.
>
> This happened on the prod cluster and simultaneously on our nonprod  
> cluster simultaneously. The only difference between nonprod and prod  
> is that nonprod has 7 nodes rather than 6... On the nonprod cluster, 4  
> out of 7 servers rebooted due to node eviction.
>
> This set of servers are setup across two blade chassis and the nic  
> config is a private vlan, non routed. It's eth1 using a 192.168.x.y  
> scheme. The blade servers were running a load average of about 1.1 or  
> so but are 8ways (dual quad core) which isn't exactly taxing the  
> boxes.  The LAN environment is 10gbit fiber from the connect modules  
> on the chassis to the switches and are gig uplink on the blades  
> themselves. Ifconfig shows no evidence of packet loss.
>
> Questions:
> Can we set up redundant heartbeat ip connections?  Can we also add a  
> disk heartbeat?  If it truly is network connectivity, can we set the  
> timeout to be more lenient? And can we change the fencing to something  
> other than machine reset? Eg unmount the volume, change it to read  
> only, etc?
>
> Thanks...
>
> Angelo
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Angelo McComis

2010-Jan-13 20:23 UTC

head link

[Ocfs2-users] Fencing options

Some more about my setup, which started the discussion...

Version info, mount options, etc. are herein.

If there are recommended changes to this, I'm open to suggestions
here. This is mostly an "out of the box" configuration.

We are not running Oracle DB, just using this for a shared place for
transaction files between application servers doing parallel
processing.

So - Do we want the mount "datavolume, noatime" added to just _netdev
and heartbeat=local?  Will that help or hurt?  Also, do we want to
turn up the number of HEARTBEAT_THRESHOLD?



BEERGOGGLES1:~# modinfo ocfs2
filename:       /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/ocfs2.ko
license:        GPL
author:         Oracle
version:        1.4.1-1-SLES
description:    OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
f922955d99ef972235bd0c1fc236c5ddbb368611)
srcversion:     986DD1EE4F5ABD8A44FF925
depends:        ocfs2_dlm,jbd,ocfs2_nodemanager
supported:      yes
vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1

BEERGOGGLES1:~# modinfo ocfs2_dlm
filename:
/lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
license:        GPL
author:         Oracle
version:        1.4.1-1-SLES
description:    OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008
(build f922955d99ef972235bd0c1fc236c5ddbb368611)
srcversion:     FDB660B2EB59EF106C6305F
depends:        ocfs2_nodemanager
supported:      yes
vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1
parm:           dlm_purge_interval_ms:int
parm:           dlm_purge_locks_max:int

BEERGOGGLES1:~# modinfo jbd
filename:       /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/jbd/jbd.ko
license:        GPL
srcversion:     DCCDE02902B83F98EF81090
depends:
supported:      yes
vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1

BEERGOGGLES1:~# modinfo ocfs2_nodemanager
filename:
/lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
license:        GPL
author:         Oracle
license:        GPL
author:         Oracle
version:        1.4.1-1-SLES
description:    OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42
UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611)
srcversion:     B87371708A8B5E1828E14CD
depends:        configfs
supported:      yes
vermagic:       2.6.16.60-0.42.5-smp SMP gcc-4.1

BEERGOGGLES1:~# /etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 31
  Network idle timeout: 30000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active

BEERGOGGLES1:~# mount | grep ocfs2
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/evms/prod_app on /opt/VendorApsp/sharedapp type ocfs2
(rw,_netdev,heartbeat=local)

BEERGOGGLES1:~# cat /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# On Debian based systems the preferred method is running
# 'dpkg-reconfigure ocfs2-tools'.
#

# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD
# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
considered dead.
O2CB_IDLE_TIMEOUT_MS
# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
O2CB_KEEPALIVE_DELAY_MS
# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS
# O2CB_HEARTBEAT_MODE: Whether to use the native "kernel" or the
"user"
# driven heartbeat (for example, for integration with heartbeat 2.0.x)
O2CB_HEARTBEAT_MODE="kernel"

Ocfs2 users - Jan 2010 - Fencing options

[Ocfs2-users] Fencing options

[Ocfs2-users] Fencing options

[Ocfs2-users] Fencing options

[Ocfs2-users] Fencing options