thr3ads.net - Ocfs2 users - [Ocfs2-users] cluster rebooting [Mar 2009]

If this information is useful, please help other people find it:
Share via:

andrew at temporalspaces.com

2009-Mar-13 19:01 UTC

[Ocfs2-users] cluster rebooting

Hi-

I have a 16 node cluster that has been rebooting all nodes in the cluster. 
I
recevied a seg-fault from multipathd on one node and then all nodes in the 
cluster
rebooted. Here is the error message that appeared on all nodes:

Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
B24F4E67EBB34CAA99690B112FA6D50E
Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Nodes in domain
("B24F4E67EBB34CAA99690B112FA6D50E"): 0 1 2 3 5 6 7 9 10 13 15 17
Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
F575B164F63E4E888004C70D9F84D779
Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Nodes in domain
("F575B164F63E4E888004C70D9F84D779"): 0 1 2 3 5 6 7 9 10 13 15 16 17
Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
A70D0DC186724FF388CDE65EC540C444
Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Nodes in domain
("A70D0DC186724FF388CDE65EC540C444"): 0 1 2 3 5 6 7 9 10
Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
B31B07823153433C948F63199CE4A31C
Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Nodes in domain
("B31B07823153433C948F63199CE4A31C"): 0 1 2 3 5 6 7 9 10
Mar 13 13:31:11 bws01 xinetd[4934]: START: nrpe pid=1065 from=10.10.8.20
Mar 13 13:31:11 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1065 
duration=0(sec)
Mar 13 13:32:27 bws01 kernel: o2net: connection to node bapp05 (num 8) at
10.10.16.15:7777 has been idle for 30.0 seconds, shutting it down.
Mar 13 13:32:27 bws01 kernel: (0,0):o2net_idle_timer:1476 here are some 
times that
might help debug the situation: (tmr 1236965517.208305 now 
1236965547.207461 dr
1236965517.208295 adv 1236965517.208311:1236965517.208312 func 
(ee9d109e:513)
1236965445.298207:1236965445.298219)
Mar 13 13:32:27 bws01 kernel: o2net: no longer connected to node bapp05 
(num 8) at
10.10.16.15:7777
Mar 13 13:32:55 bws01 xinetd[4934]: START: nrpe pid=1068 from=10.10.8.20
Mar 13 13:32:55 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1068 
duration=0(sec)
Mar 13 13:32:57 bws01 kernel: (4586,0):o2net_connect_expired:1637 ERROR: 
no
connection established with node 8 after 30.0 seconds, giving up and 
returning
errors.
Mar 13 13:33:00 bws01 kernel: (4586,0):ocfs2_dlm_eviction_cb:98 device 
(253,0): dlm
has evicted node 8

Why would this cause all nodes in the cluster to reboot? Seems to me that 
it should have kicked out node 8 only...

thanks
Andrew

Sunil Mushran

2009-Mar-14 05:43 UTC

head link

[Ocfs2-users] cluster rebooting

Impossible to determine the cause with what you have provided. File a
bugzilla and attach messages from all nodes. No exceptions. If you have
netconsole setup (you should) attach those logs. That way we'll know if
the nodes oopsed and if so what the stack was.

Sunil

On Fri, Mar 13, 2009 at 03:01:57PM -0400, andrew at temporalspaces.com
wrote:> Hi-
> 
> I have a 16 node cluster that has been rebooting all nodes in the cluster. 
> I
> recevied a seg-fault from multipathd on one node and then all nodes in the 
> cluster
> rebooted. Here is the error message that appeared on all nodes:
> 
> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> B24F4E67EBB34CAA99690B112FA6D50E
> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("B24F4E67EBB34CAA99690B112FA6D50E"): 0 1 2 3 5 6 7 9 10 13 15 17
> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> F575B164F63E4E888004C70D9F84D779
> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("F575B164F63E4E888004C70D9F84D779"): 0 1 2 3 5 6 7 9 10 13 15 16
17
> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> A70D0DC186724FF388CDE65EC540C444
> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("A70D0DC186724FF388CDE65EC540C444"): 0 1 2 3 5 6 7 9 10
> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> B31B07823153433C948F63199CE4A31C
> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("B31B07823153433C948F63199CE4A31C"): 0 1 2 3 5 6 7 9 10
> Mar 13 13:31:11 bws01 xinetd[4934]: START: nrpe pid=1065 from=10.10.8.20
> Mar 13 13:31:11 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1065 
> duration=0(sec)
> Mar 13 13:32:27 bws01 kernel: o2net: connection to node bapp05 (num 8) at
> 10.10.16.15:7777 has been idle for 30.0 seconds, shutting it down.
> Mar 13 13:32:27 bws01 kernel: (0,0):o2net_idle_timer:1476 here are some 
> times that
> might help debug the situation: (tmr 1236965517.208305 now 
> 1236965547.207461 dr
> 1236965517.208295 adv 1236965517.208311:1236965517.208312 func 
> (ee9d109e:513)
> 1236965445.298207:1236965445.298219)
> Mar 13 13:32:27 bws01 kernel: o2net: no longer connected to node bapp05 
> (num 8) at
> 10.10.16.15:7777
> Mar 13 13:32:55 bws01 xinetd[4934]: START: nrpe pid=1068 from=10.10.8.20
> Mar 13 13:32:55 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1068 
> duration=0(sec)
> Mar 13 13:32:57 bws01 kernel: (4586,0):o2net_connect_expired:1637 ERROR: 
> no
> connection established with node 8 after 30.0 seconds, giving up and 
> returning
> errors.
> Mar 13 13:33:00 bws01 kernel: (4586,0):ocfs2_dlm_eviction_cb:98 device 
> (253,0): dlm
> has evicted node 8
> 
> Why would this cause all nodes in the cluster to reboot? Seems to me that 
> it should have kicked out node 8 only...
> 
> thanks
> Andrew
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Ocfs2 users - Mar 2009 - cluster rebooting

[Ocfs2-users] cluster rebooting

[Ocfs2-users] cluster rebooting