thr3ads.net - Ocfs2 users - [Ocfs2-users] Cluster lockup when one node fails [May 2009]

If this information is useful, please help other people find it:
Share via:

Kees Hoekzema

2009-May-27 16:20 UTC

[Ocfs2-users] Cluster lockup when one node fails

Hello List,

At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i (iscsi)
NAS. This cluster has run fine for well over a year now, but recently one of
the older and more unstable servers in the cluster has started to fail
sometimes.

While it is not a big problem that this particular server reboots, it is
however a problem that when he does that the whole cluster becomes unusable
until that node reboots and returns.

Today we had another crash on the server. The other nodes displayed it like
this in the dmesg output: 

May 27 16:45:03 aphaea kernel: 
o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been idle
for 10.0 seconds, shutting it down.
(0,3):o2net_idle_timer:1468 here are some times that might help debug the
situation: (tmr 1243435493.522086 now 1243435503.520354 dr 1243435493.522080
adv 1243435493.522090:1243435493.522091 func (6169a8d1:502)
1243435148.2972:1243435148.2999)
o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777
(3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down!
(3762,1):dlm_get_lock_resource:912 ERROR: status = -112
(5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(5196,3):dlm_get_lock_resource:912 ERROR: status = -107
(735,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(735,3):dlm_get_lock_resource:912 ERROR: status = -107
(21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(21573,3):dlm_get_lock_resource:912 ERROR: status = -107
(2825,3):o2net_connect_expired:1629 ERROR: no connection established with
node 5 after 10.0 seconds, giving up and returning errors.
(1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(1916,3):dlm_get_lock_resource:912 ERROR: status = -107
..
[and a lot more similar errors]
..
May 27 17:14:45 aphaea kernel:  (2825,3):o2dlm_eviction_cb:258 o2dlm has
evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640

The node that is in fault was totally frozen, so it most likely did not even
receive a kernel panic from ocfs2 so that it reboots.

After we rebooted the node, the cluster became available again. However, it
still prevented the other 6 servers from accessing the shared storage for
almost 30 minutes.

Is there a way to 'evict' a node faster? and continue normal read/write
operations without the node?
Or is it possible to have at least read operations continue without being
locked out as well?

Tia,
Kees Hoekzema

Sunil Mushran

2009-May-27 18:02 UTC

head link

[Ocfs2-users] Cluster lockup when one node fails

kernel version, ocfs2 version?

$ uname -a
$ modinfo ocfs2
$ rpm -qa | grep ocfs2


Kees Hoekzema wrote:> Hello List,
>
> At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i
(iscsi)
> NAS. This cluster has run fine for well over a year now, but recently one
of
> the older and more unstable servers in the cluster has started to fail
> sometimes.
>
> While it is not a big problem that this particular server reboots, it is
> however a problem that when he does that the whole cluster becomes unusable
> until that node reboots and returns.
>
> Today we had another crash on the server. The other nodes displayed it like
> this in the dmesg output: 
>
> May 27 16:45:03 aphaea kernel: 
> o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been idle
> for 10.0 seconds, shutting it down.
> (0,3):o2net_idle_timer:1468 here are some times that might help debug the
> situation: (tmr 1243435493.522086 now 1243435503.520354 dr
1243435493.522080
> adv 1243435493.522090:1243435493.522091 func (6169a8d1:502)
> 1243435148.2972:1243435148.2999)
> o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777
> (3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (3762,1):dlm_get_lock_resource:912 ERROR: status = -112
> (5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (5196,3):dlm_get_lock_resource:912 ERROR: status = -107
> (735,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (735,3):dlm_get_lock_resource:912 ERROR: status = -107
> (21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (21573,3):dlm_get_lock_resource:912 ERROR: status = -107
> (2825,3):o2net_connect_expired:1629 ERROR: no connection established with
> node 5 after 10.0 seconds, giving up and returning errors.
> (1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (1916,3):dlm_get_lock_resource:912 ERROR: status = -107
> ..
> [and a lot more similar errors]
> ..
> May 27 17:14:45 aphaea kernel:  (2825,3):o2dlm_eviction_cb:258 o2dlm has
> evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640
>
> The node that is in fault was totally frozen, so it most likely did not
even
> receive a kernel panic from ocfs2 so that it reboots.
>
> After we rebooted the node, the cluster became available again. However, it
> still prevented the other 6 servers from accessing the shared storage for
> almost 30 minutes.
>
> Is there a way to 'evict' a node faster? and continue normal
read/write
> operations without the node?
> Or is it possible to have at least read operations continue without being
> locked out as well?
>
> Tia,
> Kees Hoekzema
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Ocfs2 users - May 2009 - Cluster lockup when one node fails

[Ocfs2-users] Cluster lockup when one node fails

[Ocfs2-users] Cluster lockup when one node fails