Impossible to determine the cause with what you have provided. File a
bugzilla and attach messages from all nodes. No exceptions. If you have
netconsole setup (you should) attach those logs. That way we'll know if
the nodes oopsed and if so what the stack was.
Sunil
On Fri, Mar 13, 2009 at 03:01:57PM -0400, andrew at temporalspaces.com
wrote:> Hi-
>
> I have a 16 node cluster that has been rebooting all nodes in the cluster.
> I
> recevied a seg-fault from multipathd on one node and then all nodes in the
> cluster
> rebooted. Here is the error message that appeared on all nodes:
>
> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> B24F4E67EBB34CAA99690B112FA6D50E
> Mar 13 13:30:27 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("B24F4E67EBB34CAA99690B112FA6D50E"): 0 1 2 3 5 6 7 9 10 13 15 17
> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> F575B164F63E4E888004C70D9F84D779
> Mar 13 13:30:33 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("F575B164F63E4E888004C70D9F84D779"): 0 1 2 3 5 6 7 9 10 13 15 16
17
> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> A70D0DC186724FF388CDE65EC540C444
> Mar 13 13:30:39 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("A70D0DC186724FF388CDE65EC540C444"): 0 1 2 3 5 6 7 9 10
> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Node 8 leaves domain
> B31B07823153433C948F63199CE4A31C
> Mar 13 13:30:45 bws01 kernel: ocfs2_dlm: Nodes in domain
> ("B31B07823153433C948F63199CE4A31C"): 0 1 2 3 5 6 7 9 10
> Mar 13 13:31:11 bws01 xinetd[4934]: START: nrpe pid=1065 from=10.10.8.20
> Mar 13 13:31:11 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1065
> duration=0(sec)
> Mar 13 13:32:27 bws01 kernel: o2net: connection to node bapp05 (num 8) at
> 10.10.16.15:7777 has been idle for 30.0 seconds, shutting it down.
> Mar 13 13:32:27 bws01 kernel: (0,0):o2net_idle_timer:1476 here are some
> times that
> might help debug the situation: (tmr 1236965517.208305 now
> 1236965547.207461 dr
> 1236965517.208295 adv 1236965517.208311:1236965517.208312 func
> (ee9d109e:513)
> 1236965445.298207:1236965445.298219)
> Mar 13 13:32:27 bws01 kernel: o2net: no longer connected to node bapp05
> (num 8) at
> 10.10.16.15:7777
> Mar 13 13:32:55 bws01 xinetd[4934]: START: nrpe pid=1068 from=10.10.8.20
> Mar 13 13:32:55 bws01 xinetd[4934]: EXIT: nrpe status=0 pid=1068
> duration=0(sec)
> Mar 13 13:32:57 bws01 kernel: (4586,0):o2net_connect_expired:1637 ERROR:
> no
> connection established with node 8 after 30.0 seconds, giving up and
> returning
> errors.
> Mar 13 13:33:00 bws01 kernel: (4586,0):ocfs2_dlm_eviction_cb:98 device
> (253,0): dlm
> has evicted node 8
>
> Why would this cause all nodes in the cluster to reboot? Seems to me that
> it should have kicked out node 8 only...
>
> thanks
> Andrew
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users