I have a 4 node SLES 10 cluster with all nodes attached to a SAN via fiber. The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf. I can mount the volume on any single node but as soon as I mount it on the second node, it fences one of the nodes. There is never more than one node active at a time. When I check the status of the nodes (quickly before they get fenced) the satus shows they are heartbeating. # /etc/init.d/o2cb status Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Checking O2CB heartbeat: Active ======== Here are the logs from 2 machines (NOTE that this is the logs from 2 machines at the same time as they were captured via remote syslog on a 3rd machine machine) of what happens when the node vs2 is already running, and node vs3 joins the cluster (mounts the ocfs2 file system). In this instance vs3 gets fenced. Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 (num 2) at 10.1.1.13:7777 Jan 18 14:52:41 vs3 kernel: o2net: connected to node vs2 (num 1) at 10.1.1.12:7777 Jan 18 14:52:45 vs3 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Node 2 joins domain 89FC5CB6C98B43B998AB8492874EA6CA Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 Jan 18 14:52:45 vs3 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 Jan 18 14:52:45 vs3 kernel: kjournald starting. Commit interval 5 seconds Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2, slot 0) Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 too short Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num 2) at 10.1.1.13:7777 has been idle for 10 seconds, shutting it down. Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1169153561.99906 now 1169153571.93951 dr 1169153566.98 030 adv 1169153566.98039:1169153566.98040 func (09ab0f3c:504) 1169153565.211482:1169153565.211485) Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num 1) at 10.1.1.12:7777 Jan 18 14:52:51 vs2 kernel: o2net: no longer connected to node vs3 (num 2) at 10.1.1.13:7777 ========= I previously had configured ocfs2 for userspace heartbeating but couldn't get that running so I reconfigured for disk based. Could that now be the cause of this problem? Where do the nodes write the heartbeats? I see nothing on the ocfs2 system. Also, I have no /config directory that is mentioned in the docs. Is that normal? Here is /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 10.1.1.11 number = 0 name = vs1 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.1.1.12 number = 1 name = vs2 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.1.1.13 number = 2 name = vs3 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.1.1.14 number = 3 name = vs4 cluster = ocfs2 cluster: node_count = 4 name = ocfs2 Regards, Any tips on how I can go about diagnosing this problem? Thanks, John Lange
John, it's hard to tell without seeing the messages on the surviving node. Do you remember how many node slots you have created when formating the volume? Maybe you configured just 1?, if so, use tunefs.ocfs2 to increase the number of slots If that's not the problem, please copy paste the corresponding messages on the surviving node. thanks, --Srini. John Lange wrote:> I have a 4 node SLES 10 cluster with all nodes attached to a SAN via > fiber. > > The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf. > > I can mount the volume on any single node but as soon as I mount it on > the second node, it fences one of the nodes. There is never more than > one node active at a time. > > When I check the status of the nodes (quickly before they get fenced) > the satus shows they are heartbeating. > > # /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Checking O2CB heartbeat: Active > > ======== > > Here are the logs from 2 machines (NOTE that this is the logs from 2 > machines at the same time as they were captured via remote syslog on a > 3rd machine machine) of what happens when the node vs2 is already > running, and node vs3 joins the cluster (mounts the ocfs2 file system). > In this instance vs3 gets fenced. > > Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 (num 2) at 10.1.1.13:7777 > Jan 18 14:52:41 vs3 kernel: o2net: connected to node vs2 (num 1) at 10.1.1.12:7777 > Jan 18 14:52:45 vs3 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) > Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Node 2 joins domain 89FC5CB6C98B43B998AB8492874EA6CA > Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 > Jan 18 14:52:45 vs3 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 > Jan 18 14:52:45 vs3 kernel: kjournald starting. Commit interval 5 seconds > Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2, slot 0) > Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 too short > Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num 2) at 10.1.1.13:7777 has been idle for 10 seconds, shutting it down. > Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1169153561.99906 now 1169153571.93951 dr 1169153566.98 > 030 adv 1169153566.98039:1169153566.98040 func (09ab0f3c:504) 1169153565.211482:1169153565.211485) > Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num 1) at 10.1.1.12:7777 > Jan 18 14:52:51 vs2 kernel: o2net: no longer connected to node vs3 (num 2) at 10.1.1.13:7777 > > =========> > I previously had configured ocfs2 for userspace heartbeating but > couldn't get that running so I reconfigured for disk based. Could that > now be the cause of this problem? > > Where do the nodes write the heartbeats? I see nothing on the ocfs2 > system. > > Also, I have no /config directory that is mentioned in the docs. Is that > normal? > > Here is /etc/ocfs2/cluster.conf > > node: > ip_port = 7777 > ip_address = 10.1.1.11 > number = 0 > name = vs1 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.12 > number = 1 > name = vs2 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.13 > number = 2 > name = vs3 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.14 > number = 3 > name = vs4 > cluster = ocfs2 > > cluster: > node_count = 4 > name = ocfs2 > > > Regards, > > Any tips on how I can go about diagnosing this problem? > > Thanks, > John Lange > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
1. In SLES10, the /config has been moved to /sys/kernel/config. That's how it is on mainline. 2. To monitor heartbeat do: # watch -d -n2 debugfs.ocfs2 -R "hb" /dev/sdX This comand will work if you have ocfs2-tools 1.2.2. (Not sure whether sles10 ships with 1.2.2 or 1.2.1.) If 1.2.1, do: # watch -d -n2 "echo \"hb\" | debugfs.ocfs2 -n /dev/sdX | grep -v \"0000000000000000 0000000000000000 00000000\"" 3. Configure netconsole to catch any oops stack trace. 4. From the looks of it the issue is related to the disk hb timeout. Check the FAQ on increasing it to 60 secs from a default of 14 secs. John Lange wrote:> I have a 4 node SLES 10 cluster with all nodes attached to a SAN via > fiber. > > The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf. > > I can mount the volume on any single node but as soon as I mount it on > the second node, it fences one of the nodes. There is never more than > one node active at a time. > > When I check the status of the nodes (quickly before they get fenced) > the satus shows they are heartbeating. > > # /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Checking O2CB heartbeat: Active > > ======== > > Here are the logs from 2 machines (NOTE that this is the logs from 2 > machines at the same time as they were captured via remote syslog on a > 3rd machine machine) of what happens when the node vs2 is already > running, and node vs3 joins the cluster (mounts the ocfs2 file system). > In this instance vs3 gets fenced. > > Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 (num 2) at 10.1.1.13:7777 > Jan 18 14:52:41 vs3 kernel: o2net: connected to node vs2 (num 1) at 10.1.1.12:7777 > Jan 18 14:52:45 vs3 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) > Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Node 2 joins domain 89FC5CB6C98B43B998AB8492874EA6CA > Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 > Jan 18 14:52:45 vs3 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 > Jan 18 14:52:45 vs3 kernel: kjournald starting. Commit interval 5 seconds > Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2, slot 0) > Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 too short > Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num 2) at 10.1.1.13:7777 has been idle for 10 seconds, shutting it down. > Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1169153561.99906 now 1169153571.93951 dr 1169153566.98 > 030 adv 1169153566.98039:1169153566.98040 func (09ab0f3c:504) 1169153565.211482:1169153565.211485) > Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num 1) at 10.1.1.12:7777 > Jan 18 14:52:51 vs2 kernel: o2net: no longer connected to node vs3 (num 2) at 10.1.1.13:7777 > > =========> > I previously had configured ocfs2 for userspace heartbeating but > couldn't get that running so I reconfigured for disk based. Could that > now be the cause of this problem? > > Where do the nodes write the heartbeats? I see nothing on the ocfs2 > system. > > Also, I have no /config directory that is mentioned in the docs. Is that > normal? > > Here is /etc/ocfs2/cluster.conf > > node: > ip_port = 7777 > ip_address = 10.1.1.11 > number = 0 > name = vs1 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.12 > number = 1 > name = vs2 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.13 > number = 2 > name = vs3 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.14 > number = 3 > name = vs4 > cluster = ocfs2 > > cluster: > node_count = 4 > name = ocfs2 > > > Regards, > > Any tips on how I can go about diagnosing this problem? > > Thanks, > John Lange > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
As I remember from the LinuxWorld, such configuration requires usingh heartbeat2 in addition to o2cb, and configuring OCFSv2 to make heartbeat thru it (not directly). It's what SuSe tested: - 4 nodes - heartbeat2 - evms - OCFSv2 interacting with heartbeat2 ----- Original Message ----- From: "John Lange" <j.lange@epic.ca> To: "ocfs2-users" <ocfs2-users@oss.oracle.com> Sent: Thursday, January 18, 2007 1:03 PM Subject: [Ocfs2-users] ocfs2 keeps fencing all my nodes> I have a 4 node SLES 10 cluster with all nodes attached to a SAN via > fiber. > > The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf. > > I can mount the volume on any single node but as soon as I mount it on > the second node, it fences one of the nodes. There is never more than > one node active at a time. > > When I check the status of the nodes (quickly before they get fenced) > the satus shows they are heartbeating. > > # /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Checking O2CB heartbeat: Active > > =======> > Here are the logs from 2 machines (NOTE that this is the logs from 2 > machines at the same time as they were captured via remote syslog on a > 3rd machine machine) of what happens when the node vs2 is already > running, and node vs3 joins the cluster (mounts the ocfs2 file system). > In this instance vs3 gets fenced. > > Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 (num2) at 10.1.1.13:7777> Jan 18 14:52:41 vs3 kernel: o2net: connected to node vs2 (num 1) at10.1.1.12:7777> Jan 18 14:52:45 vs3 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006(build sles)> Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Node 2 joins domain89FC5CB6C98B43B998AB8492874EA6CA> Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2> Jan 18 14:52:45 vs3 kernel: ocfs2_dlm: Nodes in domain("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2> Jan 18 14:52:45 vs3 kernel: kjournald starting. Commit interval 5 seconds > Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2,slot 0)> Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 too short > Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num 2) at10.1.1.13:7777 has been idle for 10 seconds, shutting it down.> Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are sometimes that might help debug the situation: (tmr 1169153561.99906 now 1169153571.93951 dr 1169153566.98> 030 adv 1169153566.98039:1169153566.98040 func (09ab0f3c:504)1169153565.211482:1169153565.211485)> Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num 1)at 10.1.1.12:7777> Jan 18 14:52:51 vs2 kernel: o2net: no longer connected to node vs3 (num 2)at 10.1.1.13:7777> > =========> > I previously had configured ocfs2 for userspace heartbeating but > couldn't get that running so I reconfigured for disk based. Could that > now be the cause of this problem? > > Where do the nodes write the heartbeats? I see nothing on the ocfs2 > system. > > Also, I have no /config directory that is mentioned in the docs. Is that > normal? > > Here is /etc/ocfs2/cluster.conf > > node: > ip_port = 7777 > ip_address = 10.1.1.11 > number = 0 > name = vs1 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.12 > number = 1 > name = vs2 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.13 > number = 2 > name = vs3 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.14 > number = 3 > name = vs4 > cluster = ocfs2 > > cluster: > node_count = 4 > name = ocfs2 > > > Regards, > > Any tips on how I can go about diagnosing this problem? > > Thanks, > John Lange > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
I just want to confirm for the benefit of the list archives that downgrading the SUSE kernel to 2.6.16.21-0.25-smp did solve the fencing problem. Thank you. John On Thu, 2007-01-18 at 16:57 -0500, Charlie Sharkey wrote:> > It may be a problem with SLES10. It looks like the latest > sles10 kernel patch (2.6.16.27-0.6) has this problem. > > here is the problem as reported by someone earlier: > http://oss.oracle.com/pipermail/ocfs2-users/2007-January/001181.html > http://oss.oracle.com/pipermail/ocfs2-users/2007-January/001182.html > > here is a bugzilla entry > http://oss.oracle.com/bugzilla/show_bug.cgi?id=835 > > > > > -----Original Message----- > From: ocfs2-users-bounces@oss.oracle.com > [mailto:ocfs2-users-bounces@oss.oracle.com] On Behalf Of John Lange > Sent: Thursday, January 18, 2007 4:03 PM > To: ocfs2-users > Subject: [Ocfs2-users] ocfs2 keeps fencing all my nodes > > I have a 4 node SLES 10 cluster with all nodes attached to a SAN via > fiber. > > The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf. > > I can mount the volume on any single node but as soon as I mount it on > the second node, it fences one of the nodes. There is never more than > one node active at a time. > > When I check the status of the nodes (quickly before they get fenced) > the satus shows they are heartbeating. > > # /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Checking O2CB heartbeat: Active > > ======== > > Here are the logs from 2 machines (NOTE that this is the logs from 2 > machines at the same time as they were captured via remote syslog on a > 3rd machine machine) of what happens when the node vs2 is already > running, and node vs3 joins the cluster (mounts the ocfs2 file system). > In this instance vs3 gets fenced. > > Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 > (num 2) at 10.1.1.13:7777 Jan 18 14:52:41 vs3 kernel: o2net: connected > to node vs2 (num 1) at 10.1.1.12:7777 Jan 18 14:52:45 vs3 kernel: OCFS2 > 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) Jan 18 14:52:45 vs2 > kernel: ocfs2_dlm: Node 2 joins domain 89FC5CB6C98B43B998AB8492874EA6CA > Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain > ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 Jan 18 14:52:45 vs3 kernel: > ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 Jan > 18 14:52:45 vs3 kernel: kjournald starting. Commit interval 5 seconds > Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2, > slot 0) Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 > too short Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num > 2) at 10.1.1.13:7777 has been idle for 10 seconds, shutting it down. > Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some > times that might help debug the situation: (tmr 1169153561.99906 now > 1169153571.93951 dr 1169153566.98 030 adv > 1169153566.98039:1169153566.98040 func (09ab0f3c:504) > 1169153565.211482:1169153565.211485) > Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num > 1) at 10.1.1.12:7777 Jan 18 14:52:51 vs2 kernel: o2net: no longer > connected to node vs3 (num 2) at 10.1.1.13:7777 > > =========> > I previously had configured ocfs2 for userspace heartbeating but > couldn't get that running so I reconfigured for disk based. Could that > now be the cause of this problem? > > Where do the nodes write the heartbeats? I see nothing on the ocfs2 > system. > > Also, I have no /config directory that is mentioned in the docs. Is that > normal? > > Here is /etc/ocfs2/cluster.conf > > node: > ip_port = 7777 > ip_address = 10.1.1.11 > number = 0 > name = vs1 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.12 > number = 1 > name = vs2 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.13 > number = 2 > name = vs3 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.1.1.14 > number = 3 > name = vs4 > cluster = ocfs2 > > cluster: > node_count = 4 > name = ocfs2 > > > Regards, > > Any tips on how I can go about diagnosing this problem? > > Thanks, > John Lange > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users