thr3ads.net - Ocfs2 users - [Ocfs2-users] OCFS2 Fencing, then panic [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Eli Criffield

2007-Apr-03 09:14 UTC

[Ocfs2-users] OCFS2 Fencing, then panic

Whenever i mount my shared ocfs2 volume on the second node, the
primary kernel panics.
I have SLES10 xen guests both able to access the same /dev/sdc1.

My /etc/ocfs2/cluster.conf --------

cluster:
        node_count = 2
        name = ocfs2
node:
        ip_port = 7777
        ip_address = 10.24.1.65
        number = 1
        name = testnode1
        cluster = ocfs2
node:
        ip_port = 7777
        ip_address = 10.24.1.63
        number = 0
        name = testnode0
        cluster = ocfs2

testnode0:~ # /etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Checking O2CB heartbeat: Active

testnode0::~ # df |grep sdc1
/dev/sdc1              2097152    137108   1960044   7% /mnt/ocfs2

testnode1:~ # /etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Checking O2CB heartbeat: Not active

testnode1:~ # df |grep sdc1
(Not mounted)


Then i try to mount the device on testnode1:

testnode1:~ # mount -tocfs2 /dev/sdc1 /mnt/ocfs2/

It comes back ok, but in about a min its very sorry about fencing this
system by panicing.

This is what it shows in the logs
--- testnode1 /var/log/messages
Apr  3 11:07:41 testnode1 kernel: o2net: connected to node testnode0
(num 0) at 10.24.1.63:7777
Apr  3 11:07:41 testnode1 kernel: klogd 1.4.1, ---------- state change
----------
Apr  3 11:07:45 testnode1 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33
PDT 2006 (build sles)
Apr  3 11:07:45 testnode1 kernel: ocfs2_dlm: Nodes in domain
("F59ECDE2D42642F18D728F2AB96C3291"): 0 1
Apr  3 11:07:45 testnode1 kernel: (13756,0):ocfs2_find_slot:261 slot 0
is already allocated to this node!
Apr  3 11:07:45 testnode1 kernel: (13756,0):ocfs2_check_volume:1651
File system was not unmounted cleanly, recovering volume.
Apr  3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 255):
journal_recover: JBD: recovery, exit status 0, recovered transactions
3 to 4
Apr  3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 257):
journal_recover: JBD: Replayed 0 and revoked 0/0 blocks
Apr  3 11:07:45 testnode1 kernel: kjournald starting.  Commit interval 5 seconds
Apr  3 11:07:45 testnode1 kernel: ocfs2: Mounting device (8,33) on
(node 1, slot 0)
Apr  3 11:07:51 testnode1 kernel: o2net: no longer connected to node
testnode0 (num 0) at 10.24.1.63:7777
Apr  3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on
special file /dev/xconsole
Apr  3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on
special file /dev/tty10
Apr  3 11:08:23 testnode1 kernel: (13773,0):dlm_do_master_request:1330
ERROR: link to 0 went down!
Apr  3 11:08:23 testnode1 kernel: (13773,0):dlm_get_lock_resource:914
ERROR: status = -107
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing



---testnode0 /var/log/messages
Apr  3 11:07:41 testnode0 kernel: o2net: accepted connection from node
testnode1 (num 1) at 10.24.1.65:7777
Apr  3 11:07:41 testnode0 kernel: klogd 1.4.1, ---------- state change
----------
Apr  3 11:07:45 testnode0 kernel: ocfs2_dlm: Node 1 joins domain
F59ECDE2D42642F18D728F2AB96C3291
Apr  3 11:07:45 testnode0 kernel: ocfs2_dlm: Nodes in domain
("F59ECDE2D42642F18D728F2AB96C3291"): 0 1
Apr  3 11:07:51 testnode0 kernel: o2net: connection to node testnode1
(num 1) at 10.24.1.65:7777 has been idle for 10 seconds, shutting it
down.
Apr  3 11:07:51 testnode0 kernel: (0,0):o2net_idle_timer:1314 here are
some times that might help debug the situation: (tmr 1175616461.855528
now 1175616471.854226 dr 1175616466.855354 adv
1175616466.855376:1175616466.855377 func (9fb0e5b8:502)
1175616466.35269:1175616466.35272)
Apr  3 11:07:51 testnode0 kernel: o2net: no longer connected to node
testnode1 (num 1) at 10.24.1.65:7777



Everything appears to be configured correctly from what i can tell,
But why does it connect then just disconnect?

Eli

Sunil Mushran

2007-Apr-03 16:09 UTC

head link

[Ocfs2-users] OCFS2 Fencing, then panic

This is a known issue on SLES10. Ping Novell for the update.

Eli Criffield wrote:> Whenever i mount my shared ocfs2 volume on the second node, the
> primary kernel panics.
> I have SLES10 xen guests both able to access the same /dev/sdc1.
>
> My /etc/ocfs2/cluster.conf --------
>
> cluster:
>        node_count = 2
>        name = ocfs2
> node:
>        ip_port = 7777
>        ip_address = 10.24.1.65
>        number = 1
>        name = testnode1
>        cluster = ocfs2
> node:
>        ip_port = 7777
>        ip_address = 10.24.1.63
>        number = 0
>        name = testnode0
>        cluster = ocfs2
>
> testnode0:~ # /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
> Checking O2CB heartbeat: Active
>
> testnode0::~ # df |grep sdc1
> /dev/sdc1              2097152    137108   1960044   7% /mnt/ocfs2
>
> testnode1:~ # /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
> Checking O2CB heartbeat: Not active
>
> testnode1:~ # df |grep sdc1
> (Not mounted)
>
>
> Then i try to mount the device on testnode1:
>
> testnode1:~ # mount -tocfs2 /dev/sdc1 /mnt/ocfs2/
>
> It comes back ok, but in about a min its very sorry about fencing this
> system by panicing.
>
> This is what it shows in the logs
> --- testnode1 /var/log/messages
> Apr  3 11:07:41 testnode1 kernel: o2net: connected to node testnode0
> (num 0) at 10.24.1.63:7777
> Apr  3 11:07:41 testnode1 kernel: klogd 1.4.1, ---------- state change
> ----------
> Apr  3 11:07:45 testnode1 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33
> PDT 2006 (build sles)
> Apr  3 11:07:45 testnode1 kernel: ocfs2_dlm: Nodes in domain
> ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1
> Apr  3 11:07:45 testnode1 kernel: (13756,0):ocfs2_find_slot:261 slot 0
> is already allocated to this node!
> Apr  3 11:07:45 testnode1 kernel: (13756,0):ocfs2_check_volume:1651
> File system was not unmounted cleanly, recovering volume.
> Apr  3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 255):
> journal_recover: JBD: recovery, exit status 0, recovered transactions
> 3 to 4
> Apr  3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 257):
> journal_recover: JBD: Replayed 0 and revoked 0/0 blocks
> Apr  3 11:07:45 testnode1 kernel: kjournald starting.  Commit interval 
> 5 seconds
> Apr  3 11:07:45 testnode1 kernel: ocfs2: Mounting device (8,33) on
> (node 1, slot 0)
> Apr  3 11:07:51 testnode1 kernel: o2net: no longer connected to node
> testnode0 (num 0) at 10.24.1.63:7777
> Apr  3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on
> special file /dev/xconsole
> Apr  3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on
> special file /dev/tty10
> Apr  3 11:08:23 testnode1 kernel: (13773,0):dlm_do_master_request:1330
> ERROR: link to 0 went down!
> Apr  3 11:08:23 testnode1 kernel: (13773,0):dlm_get_lock_resource:914
> ERROR: status = -107
> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> system by panicing
>
>
>
> ---testnode0 /var/log/messages
> Apr  3 11:07:41 testnode0 kernel: o2net: accepted connection from node
> testnode1 (num 1) at 10.24.1.65:7777
> Apr  3 11:07:41 testnode0 kernel: klogd 1.4.1, ---------- state change
> ----------
> Apr  3 11:07:45 testnode0 kernel: ocfs2_dlm: Node 1 joins domain
> F59ECDE2D42642F18D728F2AB96C3291
> Apr  3 11:07:45 testnode0 kernel: ocfs2_dlm: Nodes in domain
> ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1
> Apr  3 11:07:51 testnode0 kernel: o2net: connection to node testnode1
> (num 1) at 10.24.1.65:7777 has been idle for 10 seconds, shutting it
> down.
> Apr  3 11:07:51 testnode0 kernel: (0,0):o2net_idle_timer:1314 here are
> some times that might help debug the situation: (tmr 1175616461.855528
> now 1175616471.854226 dr 1175616466.855354 adv
> 1175616466.855376:1175616466.855377 func (9fb0e5b8:502)
> 1175616466.35269:1175616466.35272)
> Apr  3 11:07:51 testnode0 kernel: o2net: no longer connected to node
> testnode1 (num 1) at 10.24.1.65:7777
>
>
>
> Everything appears to be configured correctly from what i can tell,
> But why does it connect then just disconnect?
>
> Eli
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Eli Criffield

2007-Apr-05 13:22 UTC

head link

[Ocfs2-users] OCFS2 Fencing, then panic

Yep they have a new kernel that works

Thanks

Eli

On 4/3/07, Sunil Mushran <sunil.mushran@oracle.com>
wrote:> This is a known issue on SLES10. Ping Novell for the update.
>
> Eli Criffield wrote:
> > Whenever i mount my shared ocfs2 volume on the second node, the
> > primary kernel panics.
> > I have SLES10 xen guests both able to access the same /dev/sdc1.
> >
> > My /etc/ocfs2/cluster.conf --------
> >
> > cluster:
> >        node_count = 2
> >        name = ocfs2
> > node:
> >        ip_port = 7777
> >        ip_address = 10.24.1.65
> >        number = 1
> >        name = testnode1
> >        cluster = ocfs2
> > node:
> >        ip_port = 7777
> >        ip_address = 10.24.1.63
> >        number = 0
> >        name = testnode0
> >        cluster = ocfs2
> >
> > testnode0:~ # /etc/init.d/o2cb status
> > Module "configfs": Loaded
> > Filesystem "configfs": Mounted
> > Module "ocfs2_nodemanager": Loaded
> > Module "ocfs2_dlm": Loaded
> > Module "ocfs2_dlmfs": Loaded
> > Filesystem "ocfs2_dlmfs": Mounted
> > Checking O2CB cluster ocfs2: Online
> > Checking O2CB heartbeat: Active
> >
> > testnode0::~ # df |grep sdc1
> > /dev/sdc1              2097152    137108   1960044   7% /mnt/ocfs2
> >
> > testnode1:~ # /etc/init.d/o2cb status
> > Module "configfs": Loaded
> > Filesystem "configfs": Mounted
> > Module "ocfs2_nodemanager": Loaded
> > Module "ocfs2_dlm": Loaded
> > Module "ocfs2_dlmfs": Loaded
> > Filesystem "ocfs2_dlmfs": Mounted
> > Checking O2CB cluster ocfs2: Online
> > Checking O2CB heartbeat: Not active
> >
> > testnode1:~ # df |grep sdc1
> > (Not mounted)
> >
> >
> > Then i try to mount the device on testnode1:
> >
> > testnode1:~ # mount -tocfs2 /dev/sdc1 /mnt/ocfs2/
> >
> > It comes back ok, but in about a min its very sorry about fencing this
> > system by panicing.
> >
> > This is what it shows in the logs
> > --- testnode1 /var/log/messages
> > Apr  3 11:07:41 testnode1 kernel: o2net: connected to node testnode0
> > (num 0) at 10.24.1.63:7777
> > Apr  3 11:07:41 testnode1 kernel: klogd 1.4.1, ---------- state change
> > ----------
> > Apr  3 11:07:45 testnode1 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33
> > PDT 2006 (build sles)
> > Apr  3 11:07:45 testnode1 kernel: ocfs2_dlm: Nodes in domain
> > ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1
> > Apr  3 11:07:45 testnode1 kernel: (13756,0):ocfs2_find_slot:261 slot 0
> > is already allocated to this node!
> > Apr  3 11:07:45 testnode1 kernel: (13756,0):ocfs2_check_volume:1651
> > File system was not unmounted cleanly, recovering volume.
> > Apr  3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 255):
> > journal_recover: JBD: recovery, exit status 0, recovered transactions
> > 3 to 4
> > Apr  3 11:07:45 testnode1 kernel: (fs/jbd/recovery.c, 257):
> > journal_recover: JBD: Replayed 0 and revoked 0/0 blocks
> > Apr  3 11:07:45 testnode1 kernel: kjournald starting.  Commit interval
> > 5 seconds
> > Apr  3 11:07:45 testnode1 kernel: ocfs2: Mounting device (8,33) on
> > (node 1, slot 0)
> > Apr  3 11:07:51 testnode1 kernel: o2net: no longer connected to node
> > testnode0 (num 0) at 10.24.1.63:7777
> > Apr  3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on
> > special file /dev/xconsole
> > Apr  3 11:08:23 testnode1 syslog-ng[1614]: Changing permissions on
> > special file /dev/tty10
> > Apr  3 11:08:23 testnode1 kernel: (13773,0):dlm_do_master_request:1330
> > ERROR: link to 0 went down!
> > Apr  3 11:08:23 testnode1 kernel: (13773,0):dlm_get_lock_resource:914
> > ERROR: status = -107
> > Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> > system by panicing
> >
> >
> >
> > ---testnode0 /var/log/messages
> > Apr  3 11:07:41 testnode0 kernel: o2net: accepted connection from node
> > testnode1 (num 1) at 10.24.1.65:7777
> > Apr  3 11:07:41 testnode0 kernel: klogd 1.4.1, ---------- state change
> > ----------
> > Apr  3 11:07:45 testnode0 kernel: ocfs2_dlm: Node 1 joins domain
> > F59ECDE2D42642F18D728F2AB96C3291
> > Apr  3 11:07:45 testnode0 kernel: ocfs2_dlm: Nodes in domain
> > ("F59ECDE2D42642F18D728F2AB96C3291"): 0 1
> > Apr  3 11:07:51 testnode0 kernel: o2net: connection to node testnode1
> > (num 1) at 10.24.1.65:7777 has been idle for 10 seconds, shutting it
> > down.
> > Apr  3 11:07:51 testnode0 kernel: (0,0):o2net_idle_timer:1314 here are
> > some times that might help debug the situation: (tmr 1175616461.855528
> > now 1175616471.854226 dr 1175616466.855354 adv
> > 1175616466.855376:1175616466.855377 func (9fb0e5b8:502)
> > 1175616466.35269:1175616466.35272)
> > Apr  3 11:07:51 testnode0 kernel: o2net: no longer connected to node
> > testnode1 (num 1) at 10.24.1.65:7777
> >
> >
> >
> > Everything appears to be configured correctly from what i can tell,
> > But why does it connect then just disconnect?
> >
> > Eli
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users@oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

enohi ibekwe

2007-Apr-06 05:45 UTC

head link

[Ocfs2-users] OCFS2 Fencing, then panic

Is this also an issue on SLES9?

I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see the 
error on the same box on the cluster.

_________________________________________________________________
Need a break? Find your escape route with Live Search Maps. 
http://maps.live.com/?icid=hmtag3

Sunil Mushran

2007-Apr-06 09:31 UTC

head link

[Ocfs2-users] OCFS2 Fencing, then panic

You will have to provide more information. If you
have a netconsole server configured, it would have the details.
Else, I would recommend you configure one to catch the
messages during fence. We have to see the deduce for the fence
to determine the actual problem.

enohi ibekwe wrote:> Is this also an issue on SLES9?
>
> I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see 
> the error on the same box on the cluster.
>
> _________________________________________________________________
> Need a break? Find your escape route with Live Search Maps. 
> http://maps.live.com/?icid=hmtag3
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

enohi ibekwe

2007-Apr-10 06:28 UTC

head link

[Ocfs2-users] OCFS2 Fencing, then panic

More details:

I am attempting to add a node (node 2) to an existing 2 node ( node 0 and 
node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp 
i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9. Node 
2 is not part of the RAC cluster yet, I have only installed ocfs on it. I 
can mount the ocfs file system on all nodes, and the ocfs file system is 
accessible from all nodes.

Node 0 is the node alway fenced and gets fenced very frequently. Before I 
added the kernel.panic parameter, node 0 would get fenced, panic and hang. 
Only a power reboot would make it responsive again.

My issue is the frequency at which node 0 gets fenced, it has happened at 
least once a day in the last 2 days.

This is what happened this morning.

I was remotely connected to node 0 via ssh. Then I suddenly lost the 
connection. I tried to ssh again but node 0 refused the connection.

Checking node 1 dmesg I found :
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for 
10 seconds, shutting it down.
(0,3):o2net_idle_timer:1310 here are some times that might help debug the 
situation: (tmr 1176207822.713473 now 1176207832.712008 dr 1176207822.713466 
adv 1176207822.713475:1176207822.713476 func (1459c2a9:504) 
1176196519.600486:1176196519.600489)
o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777

checking node 2 dmesg I found:
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for 
10 seconds, shutting it down.
(0,0):o2net_idle_timer:1310 here are some times that might help debug the 
situation: (tmr 1176207823.774296 now 1176207833.772712 dr 1176207823.774293 
adv 1176207823.774297:1176207823.774297 func (1459c2a9:504) 
1176196505.704238:1176196505.704240)
o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777

Since I had reboot on panic on both node 0, node 0 restarted. Checking 
/var/log/messages I found:
Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: fencing 
this node because it is only connected to 1 nodes and 2 is needed to make a 
quorum out of 3 heartbeating nodes
Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR: 
stopping heartbeat on all active regions.
Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be fencing 
this system by panicing
A

----Original Message Follows----
From: Sunil Mushran <Sunil.Mushran@oracle.com>
To: enohi ibekwe <enohiaghe@hotmail.com>
CC: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
Date: Fri, 06 Apr 2007 09:31:17 -0700

You will have to provide more information. If you
have a netconsole server configured, it would have the details.
Else, I would recommend you configure one to catch the
messages during fence. We have to see the deduce for the fence
to determine the actual problem.

enohi ibekwe wrote:>Is this also an issue on SLES9?
>
>I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see the 
>error on the same box on the cluster.
>
>_________________________________________________________________
>Need a break? Find your escape route with Live Search Maps. 
>http://maps.live.com/?icid=hmtag3
>
>
>_______________________________________________
>Ocfs2-users mailing list
>Ocfs2-users@oss.oracle.com
>http://oss.oracle.com/mailman/listinfo/ocfs2-users
_________________________________________________________________
Mortgage rates near historic lows. Refinance $200,000 loan for as low as 
$771/month* 
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f8&disc=y&vers=689&s=4056&p=5117

Ocfs2 users - Apr 2007 - OCFS2 Fencing, then panic

[Ocfs2-users] OCFS2 Fencing, then panic

[Ocfs2-users] OCFS2 Fencing, then panic

[Ocfs2-users] OCFS2 Fencing, then panic

[Ocfs2-users] OCFS2 Fencing, then panic

[Ocfs2-users] OCFS2 Fencing, then panic

[Ocfs2-users] OCFS2 Fencing, then panic