Hi I use ocfs2 (on RHEL4) since few days and i have some problem. I setup a ocfs2 cluster with 2 nodes. Sometimes, one node panic because it lost connection with the other node Mar 5 16:49:16 node1 kernel: (0,2):o2net_idle_timer:1310 connection to node node2 (num 0) at 10.150.28.67:7777 has been idle for 10 seconds, shutting it down. Mar 5 16:49:16 node1 kernel: (0,2):o2net_idle_timer:1321 here are some times that might help debug the situation: (tmr 1141573746.685964 now 1141573756.684348 dr 114157 3746.685955 adv 1141573746.6859 68:1141573746.685968 func (beddbae4:504) 1141573746.685776:1141573746.685824) Mar 5 16:49:16 node1 kernel: (2222,2):o2net_set_nn_state:411 no longer connected to node node2 (num 0) at 10.150.28.67:7777 Mar 5 16:49:16 node1 kernel: (2263,7):dlm_send_proxy_ast_msg:448 ERROR: status = -112 Mar 5 16:49:16 node1 kernel: (2263,7):dlm_flush_asts:556 ERROR: status = -112 Mar 5 16:49:20 node1 kernel: eip: f8b40ba2 Mar 5 16:49:20 node1 kernel: ------------[ cut here ]------------ Mar 5 16:49:20 node1 kernel: kernel BUG at include/asm/spinlock.h:133! Mar 5 16:49:20 node1 kernel: invalid operand: 0000 [#1] Mar 5 16:49:20 node1 kernel: SMP Mar 5 16:49:20 node1 kernel: Modules linked in: md5 ipv6 parport_pc lp parport autofs4 ocfs2(U) debugfs(U) nfs lockd ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) co nfigfs(U) sunrpc microcode dm_m irror dm_mod button battery ac ohci_hcd cpqphp e1000 e100 mii tg3 floppy ext3 jbd qla6312(U) qla2300(U) qla2xxx(U) scsi_transport_fc qla2xxx_conf(U) cciss sd_mod scsi_mo d Mar 5 16:49:20 node1 kernel: CPU: 6 Mar 5 16:49:20 node1 kernel: EIP: 0060:[<c02cff11>] Not tainted VLI Mar 5 16:49:20 node1 kernel: EFLAGS: 00010216 (2.6.9-22.0.2.ELsmp) Mar 5 16:49:20 node1 kernel: EIP is at _spin_lock+0x1c/0x34 Mar 5 16:49:20 node1 kernel: eax: c02e3869 ebx: d36c7994 ecx: f654ee50 edx: f8b40ba2 Mar 5 16:49:20 node1 kernel: esi: d36c7980 edi: 00000000 ebp: 00000000 esp: f654ee54 Mar 5 16:49:20 node1 kernel: ds: 007b es: 007b ss: 0068 Mar 5 16:49:20 node1 kernel: Process o2hb-1C0CB88CEF (pid: 2258, threadinfo=f654e000 task=f72f6730) Mar 5 16:49:20 node1 kernel: Stack: 00000000 f8b40ba2 d36c7988 f7043400 f8b40b88 00000000 00000000 f7043400 Mar 5 16:49:20 node1 kernel: 00000000 00000000 f8b50684 f7043430 f7043400 f8b5076a f704355c f7043558 Mar 5 16:49:20 node1 kernel: f8c21920 f8c0b8f7 f7e7f880 00000000 f654eedc f654eedc f8c1f8a0 f8c0ba27 Mar 5 16:49:20 node1 kernel: Call Trace: Mar 5 16:49:20 node1 kernel: [<f8b40ba2>] dlm_mle_node_down+0x10/0x73 [ocfs2_dlm] Mar 5 16:49:20 node1 kernel: [<f8b40b88>] dlm_hb_event_notify_attached+0x6e/0x78 [ocfs2_dlm] Mar 5 16:49:20 node1 kernel: [<f8b50684>] __dlm_hb_node_down+0x1a6/0x267 [ocfs2_dlm] Mar 5 16:49:20 node1 kernel: [<f8b5076a>] dlm_hb_node_down_cb+0x25/0x3a [ocfs2_dlm] Mar 5 16:49:20 node1 kernel: [<f8c0b8f7>] o2hb_fire_callbacks+0x62/0x6c [ocfs2_nodemanager] Mar 5 16:49:20 node1 kernel: [<f8c0ba27>] o2hb_run_event_list+0x126/0x162 [ocfs2_nodemanager] Mar 5 16:49:20 node1 kernel: [<f8c0c0f9>] o2hb_check_slot+0x4d2/0x4e7 [ocfs2_nodemanager] Mar 5 16:49:20 node1 kernel: [<c022370a>] submit_bio+0xca/0xd2 Mar 5 16:49:20 node1 kernel: [<f8c0c3ed>] o2hb_do_disk_heartbeat+0x2b4/0x325 [ocfs2_nodemanager] Mar 5 16:49:20 node1 kernel: [<f8c0c4e2>] o2hb_thread+0x0/0x291 [ocfs2_nodemanager] Mar 5 16:49:20 node1 kernel: [<f8c0c56b>] o2hb_thread+0x89/0x291 [ocfs2_nodemanager] Mar 5 16:49:20 node1 kernel: [<f8c0c4e2>] o2hb_thread+0x0/0x291 [ocfs2_nodemanager] Mar 5 16:49:20 node1 kernel: [<c0133a9d>] kthread+0x73/0x9b Mar 5 16:49:20 node1 kernel: [<c0133a2a>] kthread+0x0/0x9b Mar 5 16:49:20 node1 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Mar 5 16:49:20 node1 kernel: Code: 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3 53 89 c3 81 78 04 ad 4e ad de 74 18 ff 74 24 04 68 69 38 2e c0 e8 33 23 e5 ff 58 5a <0f> 0b 85 00 23 29 2e c0 f0 fe 0b 79 09 f3 90 80 3b 00 7e f9 eb Mar 5 16:49:20 node1 kernel: <0>Fatal exception: panic in 5 seconds The problem is this panic make a panic on the second node. How can i prevent panic ? add another node .? thanks Fred
Silviu Marin-Caea
2006-Mar-06 10:33 UTC
[Ocfs2-users] ocfs2 : Fatal exception: panic in 5 seconds
On Monday 06 March 2006 10:57, doof wrote:> Hi > > I use ocfs2 (on RHEL4) since few days and i have some problem. I setup a > ocfs2 cluster with 2 nodes. > > Sometimes, one node panic because it lost connection with the other nodeWhy does it lose connection? 1. Check the interconnect switch. Is it ok? Try mounting something through NFS and transfer 10 GB. 2. try the bcm5700 driver instead of tg3 The fact that node1 panics after node0 panicked is something I'd like to see clarified myself. What OCFS2 version do you have?
Sunil Mushran
2006-Mar-06 18:24 UTC
[Ocfs2-users] ocfs2 : Fatal exception: panic in 5 seconds
What version of OCFS2 are you on? Ensure you are running 1.2. I definitely remember this bug being fixed. doof wrote:> Hi > > I use ocfs2 (on RHEL4) since few days and i have some problem. I setup a > ocfs2 cluster with 2 nodes. > > Sometimes, one node panic because it lost connection with the other node > > Mar 5 16:49:16 node1 kernel: (0,2):o2net_idle_timer:1310 connection to > node node2 (num 0) at 10.150.28.67:7777 has been idle for 10 seconds, > shutting it down. > Mar 5 16:49:16 node1 kernel: (0,2):o2net_idle_timer:1321 here are some > times that might help debug the situation: (tmr 1141573746.685964 now > 1141573756.684348 dr 114157 > 3746.685955 adv 1141573746.6859 > 68:1141573746.685968 func (beddbae4:504) > 1141573746.685776:1141573746.685824) > Mar 5 16:49:16 node1 kernel: (2222,2):o2net_set_nn_state:411 no longer > connected to node node2 (num 0) at 10.150.28.67:7777 > Mar 5 16:49:16 node1 kernel: (2263,7):dlm_send_proxy_ast_msg:448 ERROR: > status = -112 > Mar 5 16:49:16 node1 kernel: (2263,7):dlm_flush_asts:556 ERROR: status > = -112 > Mar 5 16:49:20 node1 kernel: eip: f8b40ba2 > Mar 5 16:49:20 node1 kernel: ------------[ cut here ]------------ > Mar 5 16:49:20 node1 kernel: kernel BUG at include/asm/spinlock.h:133! > Mar 5 16:49:20 node1 kernel: invalid operand: 0000 [#1] > Mar 5 16:49:20 node1 kernel: SMP > Mar 5 16:49:20 node1 kernel: Modules linked in: md5 ipv6 parport_pc lp > parport autofs4 ocfs2(U) debugfs(U) nfs lockd ocfs2_dlmfs(U) > ocfs2_dlm(U) ocfs2_nodemanager(U) co > nfigfs(U) sunrpc microcode dm_m > irror dm_mod button battery ac ohci_hcd cpqphp e1000 e100 mii tg3 floppy > ext3 jbd qla6312(U) qla2300(U) qla2xxx(U) scsi_transport_fc > qla2xxx_conf(U) cciss sd_mod scsi_mo > d > Mar 5 16:49:20 node1 kernel: CPU: 6 > Mar 5 16:49:20 node1 kernel: EIP: 0060:[<c02cff11>] Not tainted VLI > Mar 5 16:49:20 node1 kernel: EFLAGS: 00010216 (2.6.9-22.0.2.ELsmp) > Mar 5 16:49:20 node1 kernel: EIP is at _spin_lock+0x1c/0x34 > Mar 5 16:49:20 node1 kernel: eax: c02e3869 ebx: d36c7994 ecx: > f654ee50 edx: f8b40ba2 > Mar 5 16:49:20 node1 kernel: esi: d36c7980 edi: 00000000 ebp: > 00000000 esp: f654ee54 > Mar 5 16:49:20 node1 kernel: ds: 007b es: 007b ss: 0068 > Mar 5 16:49:20 node1 kernel: Process o2hb-1C0CB88CEF (pid: 2258, > threadinfo=f654e000 task=f72f6730) > Mar 5 16:49:20 node1 kernel: Stack: 00000000 f8b40ba2 d36c7988 f7043400 > f8b40b88 00000000 00000000 f7043400 > Mar 5 16:49:20 node1 kernel: 00000000 00000000 f8b50684 f7043430 > f7043400 f8b5076a f704355c f7043558 > Mar 5 16:49:20 node1 kernel: f8c21920 f8c0b8f7 f7e7f880 00000000 > f654eedc f654eedc f8c1f8a0 f8c0ba27 > Mar 5 16:49:20 node1 kernel: Call Trace: > Mar 5 16:49:20 node1 kernel: [<f8b40ba2>] dlm_mle_node_down+0x10/0x73 > [ocfs2_dlm] > Mar 5 16:49:20 node1 kernel: [<f8b40b88>] > dlm_hb_event_notify_attached+0x6e/0x78 [ocfs2_dlm] > Mar 5 16:49:20 node1 kernel: [<f8b50684>] > __dlm_hb_node_down+0x1a6/0x267 [ocfs2_dlm] > Mar 5 16:49:20 node1 kernel: [<f8b5076a>] > dlm_hb_node_down_cb+0x25/0x3a [ocfs2_dlm] > Mar 5 16:49:20 node1 kernel: [<f8c0b8f7>] > o2hb_fire_callbacks+0x62/0x6c [ocfs2_nodemanager] > Mar 5 16:49:20 node1 kernel: [<f8c0ba27>] > o2hb_run_event_list+0x126/0x162 [ocfs2_nodemanager] > Mar 5 16:49:20 node1 kernel: [<f8c0c0f9>] o2hb_check_slot+0x4d2/0x4e7 > [ocfs2_nodemanager] > Mar 5 16:49:20 node1 kernel: [<c022370a>] submit_bio+0xca/0xd2 > Mar 5 16:49:20 node1 kernel: [<f8c0c3ed>] > o2hb_do_disk_heartbeat+0x2b4/0x325 [ocfs2_nodemanager] > Mar 5 16:49:20 node1 kernel: [<f8c0c4e2>] o2hb_thread+0x0/0x291 > [ocfs2_nodemanager] > Mar 5 16:49:20 node1 kernel: [<f8c0c56b>] o2hb_thread+0x89/0x291 > [ocfs2_nodemanager] > Mar 5 16:49:20 node1 kernel: [<f8c0c4e2>] o2hb_thread+0x0/0x291 > [ocfs2_nodemanager] > Mar 5 16:49:20 node1 kernel: [<c0133a9d>] kthread+0x73/0x9b > Mar 5 16:49:20 node1 kernel: [<c0133a2a>] kthread+0x0/0x9b > Mar 5 16:49:20 node1 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb > Mar 5 16:49:20 node1 kernel: Code: 00 75 09 f0 81 02 00 00 00 01 30 c9 > 89 c8 c3 53 89 c3 81 78 04 ad 4e ad de 74 18 ff 74 24 04 68 69 38 2e c0 > e8 33 23 e5 ff 58 5a <0f> > 0b 85 00 23 29 2e c0 f0 fe 0b > 79 09 f3 90 80 3b 00 7e f9 eb > Mar 5 16:49:20 node1 kernel: <0>Fatal exception: panic in 5 seconds > > The problem is this panic make a panic on the second node. How can i > prevent panic ? add another node .? > > thanks > Fred > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >