Hi,
I have a cluster with 4 nodes all of them with the same kernel:
Linux app19 2.6.9-48.ELxenU #1 SMP Sun Mar 4 19:50:03 EST 2007 x86_64 x86_64
x86_64 GNU/Linux
and with
OCFS2 Node Manager 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build
9e5f332181e8ebfad464946bcc4888af)
OCFS2 DLM 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build
e2556a71429f31033b275dff4b5594aa)
OCFS2 DLMFS 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build
e2556a71429f31033b275dff4b5594aa)
OCFS2 User DLM kernel interface loaded
From a moment to the other the ocfs2 filesystems freeze:
/home/user
/usr
I've rebooted one node (the one who had the higher load) and it keept on
rebooting over and over again with the following error:
(1768,0):dlm_convert_lock_handler:443 ERROR: Domain
CACE9ABE4D474B04A3C06C944B7D616D not fully joined!
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at dlmconvert:443
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: ocfs2(U) debugfs(U) ocfs2_dlmfs(U) ocfs2_dlm(U)
ocfs2_nodemanager(U) configfs(U) sunrpc dm_mod xennet ext3 jbd xenblk
Pid: 1768, comm: o2net Not tainted 2.6.9-48.ELxenU
RIP: e030:[<ffffffffa00dcb8b>]
<ffffffffa00dcb8b>{:ocfs2_dlm:dlm_convert_lock_handler+376}
RSP: e02b:ffffff807d419d88 EFLAGS: 00010292
RAX: 000000000000006a RBX: ffffff807e6bdf00 RCX: 00000000000013ba
RDX: 00000000000013ba RSI: 0000000000000000 RDI: ffffffff8032b9a0
RBP: ffffff8009669400 R08: 00000000000927bf R09: ffffff807e6bdf00
R10: ffffffff801eb0a8 R11: 0000ffff80346560 R12: ffffff807ed48000
R13: ffffff807e6bdf00 R14: 0000000000000000 R15: ffffff807ed48018
FS: 0000002a95563da0(0000) GS:ffffffff8041d700(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000
Process o2net (pid: 1768, threadinfo ffffff807d418000, task ffffff807e562030)
Stack: ffffffffff5fd000 0000000000000000 0000000000000000 ffffff8009786c00
0000000000000000 ffffff807e6bdf00 ffffff8009669400 ffffff807ed48000
ffffff807e6bdf00 0000000000000000
Call
Trace:<ffffffffa009dac6>{:ocfs2_nodemanager:o2net_process_message+1567}
<ffffffffa009dd03>{:ocfs2_nodemanager:o2net_rx_until_empty+0}
<ffffffffa009e5b6>{:ocfs2_nodemanager:o2net_rx_until_empty+2227}
<ffffffff8014092e>{worker_thread+419}
<ffffffff8012b177>{default_wake_function+0}
<ffffffff8012b1c8>{__wake_up_common+67}
<ffffffff8012b177>{default_wake_function+0}
<ffffffff80144bd4>{keventd_create_kthread+0}
<ffffffff8014078b>{worker_thread+0}
<ffffffff80144bd4>{keventd_create_kthread+0}
<ffffffff80144bab>{kthread+200}
<ffffffff8010e092>{child_rip+8}
<ffffffff80144bd4>{keventd_create_kthread+0}
<ffffffff80144ae3>{kthread+0} <ffffffff8010e08a>{child_rip+0}
Code: 0f 0b 12 06 0f a0 ff ff ff ff bb 01 41 80 7f 0f 20 76 5c 48
RIP <ffffffffa00dcb8b>{:ocfs2_dlm:dlm_convert_lock_handler+376} RSP
<ffffff807d419d88>
<0>Kernel panic - not syncing: Oops
Connection to xen3 closed.
I had to shutdown all 4 nodes and start them one by one. I even checked with
fsck.ocfs2 and it didn't reported any error.
Any clues?
Thanks
Nuno Fernandes