John McNulty
2009-Aug-13 13:44 UTC
[Ocfs2-users] Shutdown to single user mode causes SysRq Reset
Hello, I've got a 2 node HP DL580 cluster supported by a Fibrechannel SAN with dual FC cards, dual switches and an HP EVA on the back end.? All SAN disks are multipathed.? Installed software is: Redhat 5.3 ocfs2-2.6.18-128.1.14.el5-1.4.2-1.el5 ocfs2-tools-1.4.2-1.el5 ocfs2console-1.4.2-1.el5 Oracle RAC 11g ASM Oracle RAC 11g Clusterware Oracle RAC 10g databases OCFS2 isn't being used by RAC, we're using ASM for that, but OCFS2 is used to provide a shared /usr/local, /home and /apps. Yesterday I discovered something very unexpected. ? I shutdown node B to single user mode, and immediately node A crashed. The only message on the console was SysRq Resetting.? Node A then rebooted normally. I then exit single user mode on node B to jump back up to run level 3 the system started up ok, but no sooner had I got the login prompt on the console when it too crashed with SysRq Resetting. I repeated the steps for a second time and it did exactly the same thing all over again.? It appears to be repeatable. The only thing that jumped out at me watching the consoles when this was going on was that node B fails to stop the OCFS2 service on shutdown, even going to far as to tell me after the fact with an "eeeeeee" message.?? I assume that's bad ! There were no other console messages to give me a clue, so this is my starting point.?? Anyone got any ideas? Oh, there's one other thing that may or may not be relevant.?? On this cluster, and another identical cluster, mounted.ocfs2 -f always shows the node B cluster member as "Unknown" instead of the system name.? As far as I'm aware I've followed the OCFS2 setup to the letter (it's not complicated) and "o2cb_ctl -It node" on either node shows both systems with all the correct details.? Both nodes mount the cluster filesystems ok and work just fine. I've not had chance to try my single user test on the other identical cluster yet as I've not been able to get a downtime window for it.? If I do, then I will. Rgds, John
Sunil Mushran
2009-Aug-13 17:52 UTC
[Ocfs2-users] Shutdown to single user mode causes SysRq Reset
This is a feature. ;) If you have mounted a volume on two or more nodes, the expectation is that the private interconnect will always remain up. If you shutdown the network on a node, the cluster stack will have to kill a node. It does so inorder to prevent hangs in cluster operations. In a 2 node setup, the higher node number will fence. I would imagine Node A is the higher number. But I am not sure why Node B fenced on restart. The "eeeeeee" message does not ring a bell. If you want to get to the bottom of this, setup a netconsole server to capture the logs. Or, remember to shut down the cluster before switching to single user mode. Sunil John McNulty wrote:> Hello, > > I've got a 2 node HP DL580 cluster supported by a Fibrechannel SAN > with dual FC cards, dual switches and an HP EVA on the back end. All > SAN disks are multipathed. Installed software is: > > Redhat 5.3 > ocfs2-2.6.18-128.1.14.el5-1.4.2-1.el5 > ocfs2-tools-1.4.2-1.el5 > ocfs2console-1.4.2-1.el5 > Oracle RAC 11g ASM > Oracle RAC 11g Clusterware > Oracle RAC 10g databases > > OCFS2 isn't being used by RAC, we're using ASM for that, but OCFS2 is > used to provide a shared /usr/local, /home and /apps. > > Yesterday I discovered something very unexpected. I shutdown node B > to single user mode, and immediately node A crashed. The only message > on the console was SysRq Resetting. Node A then rebooted normally. > I then exit single user mode on node B to jump back up to run level 3 > the system started up ok, but no sooner had I got the login prompt on > the console when it too crashed with SysRq Resetting. > > I repeated the steps for a second time and it did exactly the same > thing all over again. It appears to be repeatable. > > The only thing that jumped out at me watching the consoles when this > was going on was that node B fails to stop the OCFS2 service on > shutdown, even going to far as to tell me after the fact with an > "eeeeeee" message. I assume that's bad ! > > There were no other console messages to give me a clue, so this is my > starting point. Anyone got any ideas? > > Oh, there's one other thing that may or may not be relevant. On this > cluster, and another identical cluster, mounted.ocfs2 -f always shows > the node B cluster member as "Unknown" instead of the system name. As > far as I'm aware I've followed the OCFS2 setup to the letter (it's not > complicated) and "o2cb_ctl -It node" on either node shows both systems > with all the correct details. Both nodes mount the cluster > filesystems ok and work just fine. > > I've not had chance to try my single user test on the other identical > cluster yet as I've not been able to get a downtime window for it. If > I do, then I will. > > Rgds, > > John > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Possibly Parallel Threads
- Question about increasing node slots
- Add a new node to ocfs cluster
- AW: ocfs2_search_chain: Group Descriptor has bad signature
- OCFS2 volumes do no mount automatically on RHEL4 also
- re: is it possible for the o2cb stack to monitor multiple "clusternames" on the same box