Nathan Ehresman
2007-Feb-15 07:06 UTC
[Ocfs2-users] 2 OCFS2 clusters that affect each other
I have a strange OCFS2 problem that has been plaguing me. I have 2 separate OCFS2 clusters, each consisting of 3 machines. One is an Oracle RAC, the other is used as a shared DocumentRoot for a web cluster. All 6 machines are in an IBM Bladecenter and thus are nearly identical hardware and use the same ethernet switch and FC switch. All 6 machines connect to the same SAN but mount completely different partitions (LVMed). The 3 RAC nodes are running RHEL 2.6.9-34.0.2.ELsmp and the 3 web heads are running kernel 2.6.9-42.0.3. All 6 machines are running OCFS2 1.2.4. Also, all 6 nodes that their O2CB_HEARTBEAT_THRESHOLD set at 31 as it appears the timeout on my HBAs is set at 60 seconds. Every once in a while if two of the web heads are powered on at the same time and begin to mount the shared OCFS2 partition, one of my Oracle nodes will complain that OCFS2 is self fencing itself and then reboot itself (thanks to the hangcheck timer). It is always the 2nd node in the RAC cluster that does this while nodes 1 and 3 stay up just fine. I have the following stack trace taken from a netdump of the kernel on RAC node 2 when it goes down, but I am not familiar enough with OCFS2 internals to read it. Can anybody read this and give me any insight into what might be causing this problem? [<c0129a20>] check_timer_failed+0x3c/0x58 [<c0129c7d>] del_timer+0x12/0x65 [<f88f326b>] qla2x00_done+0x2c6/0x37a [qla2xxx] [<f88fe7f6>] qla2300_intr_handler+0x25a/0x267 [qla2xxx] [<c0107472>] handle_IRQ_event+0x25/0x4f [<c01079d2>] do_IRQ+0x11c/0x1ae ====================== [<c02d304c>] common_interrupt+0x18/0x20 [<f8c9007b>] ocfs2_do_truncate+0x37a/0xb84 [ocfs2] [<c02d122b>] _spin_lock+0x27/0x34 [<f8c9700c>] ocfs2_cluster_lock+0xf2/0x894 [ocfs2] [<f8c96ea1>] ocfs2_status_completion_cb+0x0/0xa [ocfs2] [<f8c99444>] ocfs2_meta_lock_full+0x1e7/0x57e [ocfs2] [<c016e4c0>] dput+0x34/0x1a7 [<c01668c8>] link_path_walk+0x94/0xbe [<c01672e3>] open_namei+0x99/0x579 [<f8ca7625>] ocfs2_inode_revalidate+0x11a/0x1f9 [ocfs2] [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2] [<f8ca386b>] ocfs2_getattr+0x63/0x14d [ocfs2] [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2] [<c0161fa2>] vfs_getattr+0x35/0x88 [<c016201d>] vfs_stat+0x28/0x3a [<c01672e3>] open_namei+0x99/0x579 [<c015990b>] filp_open+0x66/0x70 [<c0162612>] sys_stat64+0xf/0x23 [<c02d0ca2>] __cond_resched+0x14/0x39 [<c01c23c2>] direct_strncpy_from_user+0x3e/0x5d [<c0159c7f>] sys_open+0x6a/0x7d [<c02d268f>] syscall_call+0x7/0xb Thanks, Nathan -- nre :wq
Do you have the full oops trace? Nathan Ehresman wrote:> I have a strange OCFS2 problem that has been plaguing me. I have 2 > separate OCFS2 clusters, each consisting of 3 machines. One is an > Oracle RAC, the other is used as a shared DocumentRoot for a web > cluster. All 6 machines are in an IBM Bladecenter and thus are nearly > identical hardware and use the same ethernet switch and FC switch. > All 6 machines connect to the same SAN but mount completely different > partitions (LVMed). The 3 RAC nodes are running RHEL > 2.6.9-34.0.2.ELsmp and the 3 web heads are running kernel > 2.6.9-42.0.3. All 6 machines are running OCFS2 1.2.4. Also, all 6 > nodes that their O2CB_HEARTBEAT_THRESHOLD set at 31 as it appears the > timeout on my HBAs is set at 60 seconds. > > Every once in a while if two of the web heads are powered on at the > same time and begin to mount the shared OCFS2 partition, one of my > Oracle nodes will complain that OCFS2 is self fencing itself and then > reboot itself (thanks to the hangcheck timer). It is always the 2nd > node in the RAC cluster that does this while nodes 1 and 3 stay up > just fine. I have the following stack trace taken from a netdump of > the kernel on RAC node 2 when it goes down, but I am not familiar > enough with OCFS2 internals to read it. Can anybody read this and > give me any insight into what might be causing this problem? > > > [<c0129a20>] check_timer_failed+0x3c/0x58 > [<c0129c7d>] del_timer+0x12/0x65 > [<f88f326b>] qla2x00_done+0x2c6/0x37a [qla2xxx] > [<f88fe7f6>] qla2300_intr_handler+0x25a/0x267 [qla2xxx] > [<c0107472>] handle_IRQ_event+0x25/0x4f > [<c01079d2>] do_IRQ+0x11c/0x1ae > ======================> [<c02d304c>] common_interrupt+0x18/0x20 > [<f8c9007b>] ocfs2_do_truncate+0x37a/0xb84 [ocfs2] > [<c02d122b>] _spin_lock+0x27/0x34 > [<f8c9700c>] ocfs2_cluster_lock+0xf2/0x894 [ocfs2] > [<f8c96ea1>] ocfs2_status_completion_cb+0x0/0xa [ocfs2] > [<f8c99444>] ocfs2_meta_lock_full+0x1e7/0x57e [ocfs2] > [<c016e4c0>] dput+0x34/0x1a7 > [<c01668c8>] link_path_walk+0x94/0xbe > [<c01672e3>] open_namei+0x99/0x579 > [<f8ca7625>] ocfs2_inode_revalidate+0x11a/0x1f9 [ocfs2] > [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2] > [<f8ca386b>] ocfs2_getattr+0x63/0x14d [ocfs2] > [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2] > [<c0161fa2>] vfs_getattr+0x35/0x88 > [<c016201d>] vfs_stat+0x28/0x3a > [<c01672e3>] open_namei+0x99/0x579 > [<c015990b>] filp_open+0x66/0x70 > [<c0162612>] sys_stat64+0xf/0x23 > [<c02d0ca2>] __cond_resched+0x14/0x39 > [<c01c23c2>] direct_strncpy_from_user+0x3e/0x5d > [<c0159c7f>] sys_open+0x6a/0x7d > [<c02d268f>] syscall_call+0x7/0xb > > > Thanks, > > Nathan