Whenever i mount my shared ocfs2 volume on the second node, the primary kernel panics. I have SLES10 xen guests both able to access the same /dev/sdc1. My /etc/ocfs2/cluster.conf -------- cluster: node_count = 2 name = ocfs2 node: ip_port = 7777 ip_address = 10.24.1.65 number = 1 name = testnode1 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.24.1.63 number = 0 name = testnode0 cluster = ocfs2 testnode0:~ # /etc/init.d/o2cb status Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Checking O2CB heartbeat: Active testnode0::~ # df |grep sdc1 /dev/sdc1 2097152 137108 1960044 7% /mnt/ocfs2 testnode1:~ # /etc/init.d/o2cb status Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Checking O2CB heartbeat: Not active testnode1:~ # df |grep sdc1 (Not mounted) Then i try to mount the device on testnode1: testnode1:~ # mount -tocfs2 /dev/sdc1 /mnt/ocfs2/ It comes back ok, but in about a min its very sorry about fencing this system by panicing. This is what it shows in the logs --- testnode1 /var/log/messages Apr 3 11:07:41 testnode1 kernel: o2net: connected to node testnode0 (num 0) at 10.24.1.63:7777 Apr 3 11:07:41 testnode1 kernel: klogd 1.4.1, ---------- state change ---------- Apr 3 11:07:45 testnode1 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) Apr 3 11:07:45 testnode1 kernel: ocfs2_dlm: Nodes in domain ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1 Apr 3 11:07:45 testnode1 kernel: (13756,0):ocfs2_find_slot:261 slot 0 is already allocated to this node! Apr 3 11:07:45 testnode1 kernel: (13756,0):ocfs2_check_volume:1651 File system was not unmounted cleanly, recovering volume. Apr 3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 255): journal_recover: JBD: recovery, exit status 0, recovered transactions 3 to 4 Apr 3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 257): journal_recover: JBD: Replayed 0 and revoked 0/0 blocks Apr 3 11:07:45 testnode1 kernel: kjournald starting. Commit interval 5 seconds Apr 3 11:07:45 testnode1 kernel: ocfs2: Mounting device (8,33) on (node 1, slot 0) Apr 3 11:07:51 testnode1 kernel: o2net: no longer connected to node testnode0 (num 0) at 10.24.1.63:7777 Apr 3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on special file /dev/xconsole Apr 3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on special file /dev/tty10 Apr 3 11:08:23 testnode1 kernel: (13773,0):dlm_do_master_request:1330 ERROR: link to 0 went down! Apr 3 11:08:23 testnode1 kernel: (13773,0):dlm_get_lock_resource:914 ERROR: status = -107 Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing ---testnode0 /var/log/messages Apr 3 11:07:41 testnode0 kernel: o2net: accepted connection from node testnode1 (num 1) at 10.24.1.65:7777 Apr 3 11:07:41 testnode0 kernel: klogd 1.4.1, ---------- state change ---------- Apr 3 11:07:45 testnode0 kernel: ocfs2_dlm: Node 1 joins domain F59ECDE2D42642F18D728F2AB96C3291 Apr 3 11:07:45 testnode0 kernel: ocfs2_dlm: Nodes in domain ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1 Apr 3 11:07:51 testnode0 kernel: o2net: connection to node testnode1 (num 1) at 10.24.1.65:7777 has been idle for 10 seconds, shutting it down. Apr 3 11:07:51 testnode0 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1175616461.855528 now 1175616471.854226 dr 1175616466.855354 adv 1175616466.855376:1175616466.855377 func (9fb0e5b8:502) 1175616466.35269:1175616466.35272) Apr 3 11:07:51 testnode0 kernel: o2net: no longer connected to node testnode1 (num 1) at 10.24.1.65:7777 Everything appears to be configured correctly from what i can tell, But why does it connect then just disconnect? Eli
This is a known issue on SLES10. Ping Novell for the update. Eli Criffield wrote:> Whenever i mount my shared ocfs2 volume on the second node, the > primary kernel panics. > I have SLES10 xen guests both able to access the same /dev/sdc1. > > My /etc/ocfs2/cluster.conf -------- > > cluster: > node_count = 2 > name = ocfs2 > node: > ip_port = 7777 > ip_address = 10.24.1.65 > number = 1 > name = testnode1 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.24.1.63 > number = 0 > name = testnode0 > cluster = ocfs2 > > testnode0:~ # /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Checking O2CB heartbeat: Active > > testnode0::~ # df |grep sdc1 > /dev/sdc1 2097152 137108 1960044 7% /mnt/ocfs2 > > testnode1:~ # /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Checking O2CB heartbeat: Not active > > testnode1:~ # df |grep sdc1 > (Not mounted) > > > Then i try to mount the device on testnode1: > > testnode1:~ # mount -tocfs2 /dev/sdc1 /mnt/ocfs2/ > > It comes back ok, but in about a min its very sorry about fencing this > system by panicing. > > This is what it shows in the logs > --- testnode1 /var/log/messages > Apr 3 11:07:41 testnode1 kernel: o2net: connected to node testnode0 > (num 0) at 10.24.1.63:7777 > Apr 3 11:07:41 testnode1 kernel: klogd 1.4.1, ---------- state change > ---------- > Apr 3 11:07:45 testnode1 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 > PDT 2006 (build sles) > Apr 3 11:07:45 testnode1 kernel: ocfs2_dlm: Nodes in domain > ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1 > Apr 3 11:07:45 testnode1 kernel: (13756,0):ocfs2_find_slot:261 slot 0 > is already allocated to this node! > Apr 3 11:07:45 testnode1 kernel: (13756,0):ocfs2_check_volume:1651 > File system was not unmounted cleanly, recovering volume. > Apr 3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 255): > journal_recover: JBD: recovery, exit status 0, recovered transactions > 3 to 4 > Apr 3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 257): > journal_recover: JBD: Replayed 0 and revoked 0/0 blocks > Apr 3 11:07:45 testnode1 kernel: kjournald starting. Commit interval > 5 seconds > Apr 3 11:07:45 testnode1 kernel: ocfs2: Mounting device (8,33) on > (node 1, slot 0) > Apr 3 11:07:51 testnode1 kernel: o2net: no longer connected to node > testnode0 (num 0) at 10.24.1.63:7777 > Apr 3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on > special file /dev/xconsole > Apr 3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on > special file /dev/tty10 > Apr 3 11:08:23 testnode1 kernel: (13773,0):dlm_do_master_request:1330 > ERROR: link to 0 went down! > Apr 3 11:08:23 testnode1 kernel: (13773,0):dlm_get_lock_resource:914 > ERROR: status = -107 > Kernel panic - not syncing: ocfs2 is very sorry to be fencing this > system by panicing > > > > ---testnode0 /var/log/messages > Apr 3 11:07:41 testnode0 kernel: o2net: accepted connection from node > testnode1 (num 1) at 10.24.1.65:7777 > Apr 3 11:07:41 testnode0 kernel: klogd 1.4.1, ---------- state change > ---------- > Apr 3 11:07:45 testnode0 kernel: ocfs2_dlm: Node 1 joins domain > F59ECDE2D42642F18D728F2AB96C3291 > Apr 3 11:07:45 testnode0 kernel: ocfs2_dlm: Nodes in domain > ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1 > Apr 3 11:07:51 testnode0 kernel: o2net: connection to node testnode1 > (num 1) at 10.24.1.65:7777 has been idle for 10 seconds, shutting it > down. > Apr 3 11:07:51 testnode0 kernel: (0,0):o2net_idle_timer:1314 here are > some times that might help debug the situation: (tmr 1175616461.855528 > now 1175616471.854226 dr 1175616466.855354 adv > 1175616466.855376:1175616466.855377 func (9fb0e5b8:502) > 1175616466.35269:1175616466.35272) > Apr 3 11:07:51 testnode0 kernel: o2net: no longer connected to node > testnode1 (num 1) at 10.24.1.65:7777 > > > > Everything appears to be configured correctly from what i can tell, > But why does it connect then just disconnect? > > Eli > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Yep they have a new kernel that works Thanks Eli On 4/3/07, Sunil Mushran <sunil.mushran@oracle.com> wrote:> This is a known issue on SLES10. Ping Novell for the update. > > Eli Criffield wrote: > > Whenever i mount my shared ocfs2 volume on the second node, the > > primary kernel panics. > > I have SLES10 xen guests both able to access the same /dev/sdc1. > > > > My /etc/ocfs2/cluster.conf -------- > > > > cluster: > > node_count = 2 > > name = ocfs2 > > node: > > ip_port = 7777 > > ip_address = 10.24.1.65 > > number = 1 > > name = testnode1 > > cluster = ocfs2 > > node: > > ip_port = 7777 > > ip_address = 10.24.1.63 > > number = 0 > > name = testnode0 > > cluster = ocfs2 > > > > testnode0:~ # /etc/init.d/o2cb status > > Module "configfs": Loaded > > Filesystem "configfs": Mounted > > Module "ocfs2_nodemanager": Loaded > > Module "ocfs2_dlm": Loaded > > Module "ocfs2_dlmfs": Loaded > > Filesystem "ocfs2_dlmfs": Mounted > > Checking O2CB cluster ocfs2: Online > > Checking O2CB heartbeat: Active > > > > testnode0::~ # df |grep sdc1 > > /dev/sdc1 2097152 137108 1960044 7% /mnt/ocfs2 > > > > testnode1:~ # /etc/init.d/o2cb status > > Module "configfs": Loaded > > Filesystem "configfs": Mounted > > Module "ocfs2_nodemanager": Loaded > > Module "ocfs2_dlm": Loaded > > Module "ocfs2_dlmfs": Loaded > > Filesystem "ocfs2_dlmfs": Mounted > > Checking O2CB cluster ocfs2: Online > > Checking O2CB heartbeat: Not active > > > > testnode1:~ # df |grep sdc1 > > (Not mounted) > > > > > > Then i try to mount the device on testnode1: > > > > testnode1:~ # mount -tocfs2 /dev/sdc1 /mnt/ocfs2/ > > > > It comes back ok, but in about a min its very sorry about fencing this > > system by panicing. > > > > This is what it shows in the logs > > --- testnode1 /var/log/messages > > Apr 3 11:07:41 testnode1 kernel: o2net: connected to node testnode0 > > (num 0) at 10.24.1.63:7777 > > Apr 3 11:07:41 testnode1 kernel: klogd 1.4.1, ---------- state change > > ---------- > > Apr 3 11:07:45 testnode1 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 > > PDT 2006 (build sles) > > Apr 3 11:07:45 testnode1 kernel: ocfs2_dlm: Nodes in domain > > ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1 > > Apr 3 11:07:45 testnode1 kernel: (13756,0):ocfs2_find_slot:261 slot 0 > > is already allocated to this node! > > Apr 3 11:07:45 testnode1 kernel: (13756,0):ocfs2_check_volume:1651 > > File system was not unmounted cleanly, recovering volume. > > Apr 3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 255): > > journal_recover: JBD: recovery, exit status 0, recovered transactions > > 3 to 4 > > Apr 3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 257): > > journal_recover: JBD: Replayed 0 and revoked 0/0 blocks > > Apr 3 11:07:45 testnode1 kernel: kjournald starting. Commit interval > > 5 seconds > > Apr 3 11:07:45 testnode1 kernel: ocfs2: Mounting device (8,33) on > > (node 1, slot 0) > > Apr 3 11:07:51 testnode1 kernel: o2net: no longer connected to node > > testnode0 (num 0) at 10.24.1.63:7777 > > Apr 3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on > > special file /dev/xconsole > > Apr 3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on > > special file /dev/tty10 > > Apr 3 11:08:23 testnode1 kernel: (13773,0):dlm_do_master_request:1330 > > ERROR: link to 0 went down! > > Apr 3 11:08:23 testnode1 kernel: (13773,0):dlm_get_lock_resource:914 > > ERROR: status = -107 > > Kernel panic - not syncing: ocfs2 is very sorry to be fencing this > > system by panicing > > > > > > > > ---testnode0 /var/log/messages > > Apr 3 11:07:41 testnode0 kernel: o2net: accepted connection from node > > testnode1 (num 1) at 10.24.1.65:7777 > > Apr 3 11:07:41 testnode0 kernel: klogd 1.4.1, ---------- state change > > ---------- > > Apr 3 11:07:45 testnode0 kernel: ocfs2_dlm: Node 1 joins domain > > F59ECDE2D42642F18D728F2AB96C3291 > > Apr 3 11:07:45 testnode0 kernel: ocfs2_dlm: Nodes in domain > > ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1 > > Apr 3 11:07:51 testnode0 kernel: o2net: connection to node testnode1 > > (num 1) at 10.24.1.65:7777 has been idle for 10 seconds, shutting it > > down. > > Apr 3 11:07:51 testnode0 kernel: (0,0):o2net_idle_timer:1314 here are > > some times that might help debug the situation: (tmr 1175616461.855528 > > now 1175616471.854226 dr 1175616466.855354 adv > > 1175616466.855376:1175616466.855377 func (9fb0e5b8:502) > > 1175616466.35269:1175616466.35272) > > Apr 3 11:07:51 testnode0 kernel: o2net: no longer connected to node > > testnode1 (num 1) at 10.24.1.65:7777 > > > > > > > > Everything appears to be configured correctly from what i can tell, > > But why does it connect then just disconnect? > > > > Eli > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users@oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Is this also an issue on SLES9? I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see the error on the same box on the cluster. _________________________________________________________________ Need a break? Find your escape route with Live Search Maps. http://maps.live.com/?icid=hmtag3
You will have to provide more information. If you have a netconsole server configured, it would have the details. Else, I would recommend you configure one to catch the messages during fence. We have to see the deduce for the fence to determine the actual problem. enohi ibekwe wrote:> Is this also an issue on SLES9? > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see > the error on the same box on the cluster. > > _________________________________________________________________ > Need a break? Find your escape route with Live Search Maps. > http://maps.live.com/?icid=hmtag3 > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
More details: I am attempting to add a node (node 2) to an existing 2 node ( node 0 and node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9. Node 2 is not part of the RAC cluster yet, I have only installed ocfs on it. I can mount the ocfs file system on all nodes, and the ocfs file system is accessible from all nodes. Node 0 is the node alway fenced and gets fenced very frequently. Before I added the kernel.panic parameter, node 0 would get fenced, panic and hang. Only a power reboot would make it responsive again. My issue is the frequency at which node 0 gets fenced, it has happened at least once a day in the last 2 days. This is what happened this morning. I was remotely connected to node 0 via ssh. Then I suddenly lost the connection. I tried to ssh again but node 0 refused the connection. Checking node 1 dmesg I found : ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2 o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for 10 seconds, shutting it down. (0,3):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr 1176207822.713473 now 1176207832.712008 dr 1176207822.713466 adv 1176207822.713475:1176207822.713476 func (1459c2a9:504) 1176196519.600486:1176196519.600489) o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777 checking node 2 dmesg I found: ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2 o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for 10 seconds, shutting it down. (0,0):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr 1176207823.774296 now 1176207833.772712 dr 1176207823.774293 adv 1176207823.774297:1176207823.774297 func (1459c2a9:504) 1176196505.704238:1176196505.704240) o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777 Since I had reboot on panic on both node 0, node 0 restarted. Checking /var/log/messages I found: Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: fencing this node because it is only connected to 1 nodes and 2 is needed to make a quorum out of 3 heartbeating nodes Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR: stopping heartbeat on all active regions. Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be fencing this system by panicing A ----Original Message Follows---- From: Sunil Mushran <Sunil.Mushran@oracle.com> To: enohi ibekwe <enohiaghe@hotmail.com> CC: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic Date: Fri, 06 Apr 2007 09:31:17 -0700 You will have to provide more information. If you have a netconsole server configured, it would have the details. Else, I would recommend you configure one to catch the messages during fence. We have to see the deduce for the fence to determine the actual problem. enohi ibekwe wrote:>Is this also an issue on SLES9? > >I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see the >error on the same box on the cluster. > >_________________________________________________________________ >Need a break? Find your escape route with Live Search Maps. >http://maps.live.com/?icid=hmtag3 > > >_______________________________________________ >Ocfs2-users mailing list >Ocfs2-users@oss.oracle.com >http://oss.oracle.com/mailman/listinfo/ocfs2-users_________________________________________________________________ Mortgage rates near historic lows. Refinance $200,000 loan for as low as $771/month* https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f8&disc=y&vers=689&s=4056&p=5117