thr3ads.net - Ocfs2 users - [Ocfs2-users] 2 OCFS2 clusters that affect each other [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Nathan Ehresman

2007-Feb-15 07:06 UTC

[Ocfs2-users] 2 OCFS2 clusters that affect each other

I have a strange OCFS2 problem that has been plaguing me.  I have 2 
separate OCFS2 clusters, each consisting of 3 machines.  One is an 
Oracle RAC, the other is used as a shared DocumentRoot for a web 
cluster.  All 6 machines are in an IBM Bladecenter and thus are nearly 
identical hardware and use the same ethernet switch and FC switch.  All 
6 machines connect to the same SAN but mount completely different 
partitions (LVMed).  The 3 RAC nodes are running RHEL 2.6.9-34.0.2.ELsmp 
and the 3 web heads are running kernel 2.6.9-42.0.3.  All 6 machines are 
running OCFS2 1.2.4.  Also, all 6 nodes that their 
O2CB_HEARTBEAT_THRESHOLD set at 31 as it appears the timeout on my HBAs 
is set at 60 seconds.

Every once in a while if two of the web heads are powered on at the same 
time and begin to mount the shared OCFS2 partition, one of my Oracle 
nodes will complain that OCFS2 is self fencing itself and then reboot 
itself (thanks to the hangcheck timer).  It is always the 2nd node in 
the RAC cluster that does this while nodes 1 and 3 stay up just fine.  I 
have the following stack trace taken from a netdump of the kernel on RAC 
node 2 when it goes down, but I am not familiar enough with OCFS2 
internals to read it.  Can anybody read this and give me any insight 
into what might be causing this problem?


  [<c0129a20>] check_timer_failed+0x3c/0x58
  [<c0129c7d>] del_timer+0x12/0x65
  [<f88f326b>] qla2x00_done+0x2c6/0x37a [qla2xxx]
  [<f88fe7f6>] qla2300_intr_handler+0x25a/0x267 [qla2xxx]
  [<c0107472>] handle_IRQ_event+0x25/0x4f
  [<c01079d2>] do_IRQ+0x11c/0x1ae
  ======================  [<c02d304c>] common_interrupt+0x18/0x20
  [<f8c9007b>] ocfs2_do_truncate+0x37a/0xb84 [ocfs2]
  [<c02d122b>] _spin_lock+0x27/0x34
  [<f8c9700c>] ocfs2_cluster_lock+0xf2/0x894 [ocfs2]
  [<f8c96ea1>] ocfs2_status_completion_cb+0x0/0xa [ocfs2]
  [<f8c99444>] ocfs2_meta_lock_full+0x1e7/0x57e [ocfs2]
  [<c016e4c0>] dput+0x34/0x1a7
  [<c01668c8>] link_path_walk+0x94/0xbe
  [<c01672e3>] open_namei+0x99/0x579
  [<f8ca7625>] ocfs2_inode_revalidate+0x11a/0x1f9 [ocfs2]
  [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2]
  [<f8ca386b>] ocfs2_getattr+0x63/0x14d [ocfs2]
  [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2]
  [<c0161fa2>] vfs_getattr+0x35/0x88
  [<c016201d>] vfs_stat+0x28/0x3a
  [<c01672e3>] open_namei+0x99/0x579
  [<c015990b>] filp_open+0x66/0x70
  [<c0162612>] sys_stat64+0xf/0x23
  [<c02d0ca2>] __cond_resched+0x14/0x39
  [<c01c23c2>] direct_strncpy_from_user+0x3e/0x5d
  [<c0159c7f>] sys_open+0x6a/0x7d
  [<c02d268f>] syscall_call+0x7/0xb


Thanks,

Nathan
-- 
nre
:wq

Sunil Mushran

2007-Feb-15 12:04 UTC

head link

[Ocfs2-users] 2 OCFS2 clusters that affect each other

Do you have the full oops trace?

Nathan Ehresman wrote:> I have a strange OCFS2 problem that has been plaguing me.  I have 2 
> separate OCFS2 clusters, each consisting of 3 machines.  One is an 
> Oracle RAC, the other is used as a shared DocumentRoot for a web 
> cluster.  All 6 machines are in an IBM Bladecenter and thus are nearly 
> identical hardware and use the same ethernet switch and FC switch.  
> All 6 machines connect to the same SAN but mount completely different 
> partitions (LVMed).  The 3 RAC nodes are running RHEL 
> 2.6.9-34.0.2.ELsmp and the 3 web heads are running kernel 
> 2.6.9-42.0.3.  All 6 machines are running OCFS2 1.2.4.  Also, all 6 
> nodes that their O2CB_HEARTBEAT_THRESHOLD set at 31 as it appears the 
> timeout on my HBAs is set at 60 seconds.
>
> Every once in a while if two of the web heads are powered on at the 
> same time and begin to mount the shared OCFS2 partition, one of my 
> Oracle nodes will complain that OCFS2 is self fencing itself and then 
> reboot itself (thanks to the hangcheck timer).  It is always the 2nd 
> node in the RAC cluster that does this while nodes 1 and 3 stay up 
> just fine.  I have the following stack trace taken from a netdump of 
> the kernel on RAC node 2 when it goes down, but I am not familiar 
> enough with OCFS2 internals to read it.  Can anybody read this and 
> give me any insight into what might be causing this problem?
>
>
>  [<c0129a20>] check_timer_failed+0x3c/0x58
>  [<c0129c7d>] del_timer+0x12/0x65
>  [<f88f326b>] qla2x00_done+0x2c6/0x37a [qla2xxx]
>  [<f88fe7f6>] qla2300_intr_handler+0x25a/0x267 [qla2xxx]
>  [<c0107472>] handle_IRQ_event+0x25/0x4f
>  [<c01079d2>] do_IRQ+0x11c/0x1ae
>  ======================>  [<c02d304c>] common_interrupt+0x18/0x20
>  [<f8c9007b>] ocfs2_do_truncate+0x37a/0xb84 [ocfs2]
>  [<c02d122b>] _spin_lock+0x27/0x34
>  [<f8c9700c>] ocfs2_cluster_lock+0xf2/0x894 [ocfs2]
>  [<f8c96ea1>] ocfs2_status_completion_cb+0x0/0xa [ocfs2]
>  [<f8c99444>] ocfs2_meta_lock_full+0x1e7/0x57e [ocfs2]
>  [<c016e4c0>] dput+0x34/0x1a7
>  [<c01668c8>] link_path_walk+0x94/0xbe
>  [<c01672e3>] open_namei+0x99/0x579
>  [<f8ca7625>] ocfs2_inode_revalidate+0x11a/0x1f9 [ocfs2]
>  [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2]
>  [<f8ca386b>] ocfs2_getattr+0x63/0x14d [ocfs2]
>  [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2]
>  [<c0161fa2>] vfs_getattr+0x35/0x88
>  [<c016201d>] vfs_stat+0x28/0x3a
>  [<c01672e3>] open_namei+0x99/0x579
>  [<c015990b>] filp_open+0x66/0x70
>  [<c0162612>] sys_stat64+0xf/0x23
>  [<c02d0ca2>] __cond_resched+0x14/0x39
>  [<c01c23c2>] direct_strncpy_from_user+0x3e/0x5d
>  [<c0159c7f>] sys_open+0x6a/0x7d
>  [<c02d268f>] syscall_call+0x7/0xb
>
>
> Thanks,
>
> Nathan

Nathan Ehresman

2007-Feb-15 12:20 UTC

head link

[Ocfs2-users] 2 OCFS2 clusters that affect each other

Sunil Mushran wrote:> Do you have the full oops trace?
Yes I do.  It is ~220k so I will send it to you off list.

Nathan

Ocfs2 users - Feb 2007 - 2 OCFS2 clusters that affect each other

[Ocfs2-users] 2 OCFS2 clusters that affect each other

[Ocfs2-users] 2 OCFS2 clusters that affect each other

[Ocfs2-users] 2 OCFS2 clusters that affect each other