Ivan Savčić | Epix
2011-Dec-20 12:18 UTC
[Ocfs2-users] OCFS2 problems when connectivity lost
Hello, We are having a problem with a 3-node cluster based on Pacemaker/Corosync with 2 primary DRBD+OCFS2 nodes and a quorum node. Nodes run on Debian Squeeze, all packages are from the stable branch except for Corosync (which is from backports for udpu functionality). Each node has a single network card. When the network is up, everything works without any problems, graceful shutdown of any node works as intended and doesn't reflect on the remaining cluster partition. When the network is down on one OCFS2 node, Pacemaker (no-quorum-policy="stop") tries to shut the resources down on that node, but fails to stop the OCFS2 filesystem resource stating that it is "in use". *Both* OCFS2 nodes (ie. the one with the network down and the one which is still up in the partition with quorum) hang with dmesg reporting that events, ocfs2rec and ocfs2_wq on *both* nodes are "blocked for more than 120 seconds". When the network is operational, umount by hand works without any problems, because for the testing scenario there are no services running which are keeping the mountpoint busy. Configuration we used is pretty much from "ClusterStack/LucidTesting" document, with clone-max="2" added where needed because of the additional quorum node in comparison to that document. NB: we have successfully reproduced this problem on three Ubuntu 11.10 Server nodes as well. Any ideas? Thanks, Ivan
Ivan Savc(ic' | Epix wrote:> *Both* OCFS2 nodes (ie. the one with the network down and the one which > is still up in the partition with quorum) hang with dmesg reporting that > events, ocfs2rec and ocfs2_wq on *both* nodes are "blocked for more than > 120 seconds". >what are the timeout values. /etc/init.d/o2cb status
Seemingly Similar Threads
- Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)
- How to break out the unstop loop in the recovery thread? Thanks a lot.
- ocfs2 - Kernel panic on many write/read from both
- Regd: Ethernet Channel Bonding Clarification is Needed
- Diagnosing some OCFS2 error messages