thr3ads.net - Ocfs2 users - [Ocfs2-users] o2net: connect to node has been idle for 10 secs [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Andy Phillips

2006-Aug-03 04:41 UTC

[Ocfs2-users] o2net: connect to node has been idle for 10 secs

Hello,

   I've a two node 10gR2 rac cluster on a pair of sun opteron boxes.
Redhat AS 4.3 2.6.9-34.0.1.ELsmp x86_64. ocfs 1.2.2. RAC is using 
ASM to talk to the data files, but we have 3 ocfs2 filesystems up
to share dba files, and the usual bits and bobs. 

   Things were fine until, on mostly idle system, this happened out
of the blue;

Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1154545576.798263 now
1154545586.796978 dr 1154545576.798238 adv
1154545576.798291:1154545576.798293 func (06aac8a1:1)
1154545566.800782:1154545566.800787)
Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
(num 0) at 172.16.6.10:7777
Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest
active node 0
Aug  2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR:
stopping heartbeat on all active regions.

   And the node then halted. 

   Barney is node 0. The systems were idle. We've hammered the ocfs2 
file systems, and set o2cb_heartbeat_threshold to 61. All is good and
stable under heavy i/o.
   
   The interconnect is a bonded interface, with two gig cards, each
connected (with flow control on) to two separate FESX424 switches.
The switches dont register any problems at this time, nor does linux
register any interface issues.

   I'm looking at the source code at the moment, but nothing is leaping
out at me. Any ideas - Do the timer debug lines above mean anything to
anyone.

  Thanks
   Andy 

    

   
  

-- 
Andy Phillips, FRAS
Systems Architect, Information Systems.
Betfair.com   

Direct Line: 0208 834 8436

Betfair Limited (Company No.5140986), Winslow Road, Hammersmith
Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834
8501 (direct). The information in this e-mail and any attachment is
confidential, may contain legal advice protected by privilege and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.

Andy Phillips

2006-Aug-03 05:24 UTC

head link

[Ocfs2-users] o2net: connect to node has been idle for 10 secs

Hello,

   Apologies for following up on myself.

in ocfs2/cluster/tcp_internal.h
#define O2NET_KEEPALIVE_DELAY_SECS      5
#define O2NET_IDLE_TIMEOUT_SECS         10

   Is this really sensible? Potentially, given small variance in 
system clocks losing one keepalive packet (assuming that 
o2net_sc_send_keep_req is the only thing keeping the connection alive)
the loss of one packet could cause a node to self fence and reboot.

   Would
#define O2NET_KEEPALIVE_DELAY_SECS      5
#define O2NET_IDLE_TIMEOUT_SECS         20

   Cause any problems?

   Andy

On Thu, 2006-08-03 at 12:41 +0100, Andy Phillips wrote:> Hello,
> 
>    I've a two node 10gR2 rac cluster on a pair of sun opteron boxes.
> Redhat AS 4.3 2.6.9-34.0.1.ELsmp x86_64. ocfs 1.2.2. RAC is using 
> ASM to talk to the data files, but we have 3 ocfs2 filesystems up
> to share dba files, and the usual bits and bobs. 
> 
>    Things were fine until, on mostly idle system, this happened out
> of the blue;
> 
> Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
> Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
> times that might help debug the situation: (tmr 1154545576.798263 now
> 1154545586.796978 dr 1154545576.798238 adv
> 1154545576.798291:1154545576.798293 func (06aac8a1:1)
> 1154545566.800782:1154545566.800787)
> Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
> (num 0) at 172.16.6.10:7777
> Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
> fencing this node because it is connected to
> a half-quorum of 1 out of 2 nodes which doesn't include the lowest
> active node 0
> Aug  2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR:
> stopping heartbeat on all active regions.
> 
>    And the node then halted. 
> 
>    Barney is node 0. The systems were idle. We've hammered the ocfs2 
> file systems, and set o2cb_heartbeat_threshold to 61. All is good and
> stable under heavy i/o.
>    
>    The interconnect is a bonded interface, with two gig cards, each
> connected (with flow control on) to two separate FESX424 switches.
> The switches dont register any problems at this time, nor does linux
> register any interface issues.
> 
>    I'm looking at the source code at the moment, but nothing is leaping
> out at me. Any ideas - Do the timer debug lines above mean anything to
> anyone.
> 
>   Thanks
>    Andy 
> 
>     
> 
>    
>   
> -- 
Andy Phillips, FRAS
Systems Architect, Information Systems.
Betfair.com   

Direct Line: 0208 834 8436

Betfair Limited (Company No.5140986), Winslow Road, Hammersmith
Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834
8501 (direct). The information in this e-mail and any attachment is
confidential, may contain legal advice protected by privilege and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.

Ocfs2 users - Aug 2006 - o2net: connect to node has been idle for 10 secs

[Ocfs2-users] o2net: connect to node has been idle for 10 secs

[Ocfs2-users] o2net: connect to node has been idle for 10 secs