thr3ads.net - Ocfs2 users - [Ocfs2-users] 6 node cluster with unexplained reboots [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Ulf Zimmermann

2007-Jul-29 14:16 UTC

[Ocfs2-users] 6 node cluster with unexplained reboots

We just installed a new cluster with 6 HP DL380g5, dual single port Qlogic 24xx
HBAs connected via two HP 4/16 Storageworks switches to a 3Par S400. We are
using the 3Par recommended config for the Qlogic driver and
device-mapper-multipath giving us 4 paths to the SAN. We do see some SCSI errors
where DM-MP is failing a path after get a 0x2000 error from the SAN controller,
but the path gets puts back in service in less then 10 seconds.

This needs to be fixed but I don't think it is what is causing our reboots.
2 of the nodes rebooted once while being idle (ocfs2 and clusterware were
running, no db) and one node rebooted while idle (another node was copying using
fscat our 9i db from ocfs1 to the ocfs2 data volume) and once while some load
was put on it via the upgraded 10g database. In all cases it is as if someone a
hardware reset button. No kernel panic (at least not one leading to a stop with
visable message), we can get a dirty write cache for the internal cciss
controller.

The only messages we get on the nodes are when the crashed node is already in
reset and it missed its ocfs2 heartbeat (set to the default of 7), followed
later by crs moving the vip.

Any hints on trouble shooting this would be appreciated.

Regards, Ulf.


--------------------------
Sent from my BlackBerry Wireless Handheld

Sunil Mushran

2007-Jul-30 10:21 UTC

head link

[Ocfs2-users] 6 node cluster with unexplained reboots

Do you have a netconsole setup? If not, set it up. That will capture the
real reason for the reset. Well, it typically does.

Ulf Zimmermann wrote:> We just installed a new cluster with 6 HP DL380g5, dual single port Qlogic
24xx HBAs connected via two HP 4/16 Storageworks switches to a 3Par S400. We are
using the 3Par recommended config for the Qlogic driver and
device-mapper-multipath giving us 4 paths to the SAN. We do see some SCSI errors
where DM-MP is failing a path after get a 0x2000 error from the SAN controller,
but the path gets puts back in service in less then 10 seconds.
>
> This needs to be fixed but I don't think it is what is causing our
reboots. 2 of the nodes rebooted once while being idle (ocfs2 and clusterware
were running, no db) and one node rebooted while idle (another node was copying
using fscat our 9i db from ocfs1 to the ocfs2 data volume) and once while some
load was put on it via the upgraded 10g database. In all cases it is as if
someone a hardware reset button. No kernel panic (at least not one leading to a
stop with visable message), we can get a dirty write cache for the internal
cciss controller.
>
> The only messages we get on the nodes are when the crashed node is already
in reset and it missed its ocfs2 heartbeat (set to the default of 7), followed
later by crs moving the vip.
>
> Any hints on trouble shooting this would be appreciated.
>
> Regards, Ulf.
>
>
> --------------------------
> Sent from my BlackBerry Wireless Handheld
>        
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Seemingly Similar Threads

Search for more maybe matching threads

Ocfs2 users - Jul 2007 - 6 node cluster with unexplained reboots

[Ocfs2-users] 6 node cluster with unexplained reboots

[Ocfs2-users] 6 node cluster with unexplained reboots

Seemingly Similar Threads