thr3ads.net - Ocfs2 users - [Ocfs2-users] another fencing question [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Mailing List SVR

2010-Jan-14 12:01 UTC

[Ocfs2-users] another fencing question

Hi, 

periodically one of on my two nodes cluster is fenced here are the logs:

Jan 14 07:01:44 nvr1-rc kernel: o2net: no longer connected to node nvr2-
rc.minint.it (num 0) at 1.1.1.6:7777
Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_do_master_request:1334 ERROR: 
link to 0 went down!
Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 ERROR: 
status = -112
Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: status = 
-112
Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_get_lock_resource:917 ERROR: 
status = -112
Jan 14 07:02:19 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR: no 
connection established with node 0 after 35.0 seconds, giving up and returning 
errors.
Jan 14 07:02:54 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR: no 
connection established with node 0 after 35.0 seconds, giving up and returning 
errors.
Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 ERROR: 
status = -107
Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: status = 
-107
Jan 14 07:03:29 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR: no 
connection established with node 0 after 35.0 seconds, giving up and returning 
errors.
Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2quo_make_decision:146 ERROR: fencing 
this node because it is connected to a half-quorum of 1 out of 2 nodes which 
doesn't include the lowest active node 0
Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2hb_stop_all_regions:1967 ERROR: 
stopping heartbeat on all active regions.

I'm sure there are no network connectivity problem but it is possible that 
there are heavy IO loads, is this the intended behaviour? Why under heavy load 
the loaded node is fenced?

I'm using ocfs2-1.4.4 on rhel5 kernel-2.6.18-164.6.1.el5

thanks
Nicola

Sunil Mushran

2010-Jan-14 20:13 UTC

head link

[Ocfs2-users] another fencing question

Mailing List SVR wrote:> Hi, 
>
> periodically one of on my two nodes cluster is fenced here are the logs:
>
> Jan 14 07:01:44 nvr1-rc kernel: o2net: no longer connected to node nvr2-
> rc.minint.it (num 0) at 1.1.1.6:7777
> Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_do_master_request:1334 ERROR:
> link to 0 went down!
> Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 ERROR: 
> status = -112
> Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: status =
> -112
> Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_get_lock_resource:917 ERROR: 
> status = -112
> Jan 14 07:02:19 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR:
no
> connection established with node 0 after 35.0 seconds, giving up and
returning
> errors.
> Jan 14 07:02:54 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR:
no
> connection established with node 0 after 35.0 seconds, giving up and
returning
> errors.
> Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 ERROR: 
> status = -107
> Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: status =
> -107
> Jan 14 07:03:29 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 ERROR:
no
> connection established with node 0 after 35.0 seconds, giving up and
returning
> errors.
> Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2quo_make_decision:146 ERROR:
fencing
> this node because it is connected to a half-quorum of 1 out of 2 nodes
which
> doesn't include the lowest active node 0
> Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2hb_stop_all_regions:1967 ERROR: 
> stopping heartbeat on all active regions.
>
> I'm sure there are no network connectivity problem but it is possible
that
> there are heavy IO loads, is this the intended behaviour? Why under heavy
load
> the loaded node is fenced?
>
> I'm using ocfs2-1.4.4 on rhel5 kernel-2.6.18-164.6.1.el5
So the network connection snapped. What it means is that the nodes
could not ping each other for 35 seconds. In fact node 1 (this one),
tried to reconnect to node 0 but got no reply back. So the network
issue lasted for over 2 mins.

Switch could be one culprit. See if the switch logs say something. Other
possibility is that node 0 was paging heavily. Or kswapd was pegged
at 100%. This is hard to determine after the fact. Something to keep in
mind the next time you see the same issue. If that is the case, then that
needs to be fixed. Maybe add more memory. Or, if you are running the
database, ensure you are using hugepages. etc.

Maybe Matching Threads

Search for more apparently analagous threads

Ocfs2 users - Jan 2010 - another fencing question

[Ocfs2-users] another fencing question

[Ocfs2-users] another fencing question

Maybe Matching Threads