Jan Wielemaker
2010-Dec-07 15:45 UTC
[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2
Hi, I'm pretty new to ocfs2 and a bit stuck. I have two Debian/Squeeze (testing) machines accessing an ocfs2 filesystem over aoe. The filesystem sits on an lvm2 volume, but I guess that is irrelevant. Even when mostly idle, everything accessing the cluster sometimes hangs for about 20 seconds. This happens rather frequently, say every 5 minutes, but the interval seems irregular while the time that it hangs is quite similar. This behavior seems pretty much independent from the (IO) load of the nodes (as long as not really high). I tried a ps, grepping for D repeated every second on both nodes. When hanging, both show this: 1649 D< o2hb-02BC250CDB ? 3507 R+ ps - 1649 D< o2hb-02BC250CDB ? 3511 R+ ps - 1649 D< o2hb-02BC250CDB ? 3515 R+ ps - 1649 D< o2hb-02BC250CDB ? 3519 R+ ps - 1649 D< o2hb-02BC250CDB ? 3523 R+ ps - 1649 D< o2hb-02BC250CDB ? 3527 R+ ps - 1649 D< o2hb-02BC250CDB ? 3531 R+ ps - 1649 D< o2hb-02BC250CDB ? 1670 D jbd2/dm-4-18 ? 3535 R+ ps - 1649 D< o2hb-02BC250CDB ? 1670 D jbd2/dm-4-18 ? 3539 R+ ps - 1649 D< o2hb-02BC250CDB ? 1670 D jbd2/dm-4-18 ? 3543 R+ ps - ocfs2-tools is at version 1.4.4-3. Kernel is version 2.6.32-5-amd64. The kernel log of the mount at boot is here: [ 18.911452] aoe: AoE v47 initialised. [ 19.686358] fuse init (API version 7.13) [ 29.000017] eth2: no IPv6 routers present [ 36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors [ 36.212218] etherd/e1.1: unknown partition table [ 59.715506] OCFS2 Node Manager 1.5.0 [ 59.732002] OCFS2 DLM 1.5.0 [ 59.733343] ocfs2: Registered cluster interface o2cb [ 59.749185] OCFS2 DLMFS 1.5.0 [ 59.749304] OCFS2 User DLM kernel interface loaded [ 65.347517] o2net: accepted connection from node eculture (num 1) at 130.37.193.11:7777 [ 67.884256] OCFS2 1.5.0 [ 67.886984] ocfs2_dlm: Nodes in domain ("02BC250CDB0A4B468F845C68BE99B90E"): 0 1 [ 67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with ordered data mode. Installation and formatting are totally standard. I've been spending quite a bit of time getting a clue on what might be wrong, but sofar I failed. Today I played a fair bit with the debugfs, but I'm do not have enough experience to see what is odd. Dumping all the locks showed just over 100,000 of them, which I though might be a lot, but posts suggest it isn't. No busy or very few (-B) locks. Checked cabling and low-level network activity. Seems ok. Does anyone has similar experiences and/or an idea where to look? Thanks --- Jan
Sunil Mushran
2010-Dec-07 17:07 UTC
[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2
Check the kernel stack of the D state processes. cat /proc/PID/stack The kernel stack will tell us where it is waiting. My guess is that the io stack is slow. Slow ios appear as temporary hangs to the users. On 12/07/2010 07:45 AM, Jan Wielemaker wrote:> Hi, > > I'm pretty new to ocfs2 and a bit stuck. I have two Debian/Squeeze > (testing) machines accessing an ocfs2 filesystem over aoe. The > filesystem sits on an lvm2 volume, but I guess that is irrelevant. > > Even when mostly idle, everything accessing the cluster sometimes hangs > for about 20 seconds. This happens rather frequently, say every 5 > minutes, but the interval seems irregular while the time that it hangs > is quite similar. This behavior seems pretty much independent from > the (IO) load of the nodes (as long as not really high). > > I tried a ps, grepping for D repeated every second on both nodes. > When hanging, both show this: > > 1649 D< o2hb-02BC250CDB ? > 3507 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 3511 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 3515 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 3519 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 3523 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 3527 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 3531 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 1670 D jbd2/dm-4-18 ? > 3535 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 1670 D jbd2/dm-4-18 ? > 3539 R+ ps - > 1649 D< o2hb-02BC250CDB ? > 1670 D jbd2/dm-4-18 ? > 3543 R+ ps - > > ocfs2-tools is at version 1.4.4-3. Kernel is version 2.6.32-5-amd64. > The kernel log of the mount at boot is here: > > [ 18.911452] aoe: AoE v47 initialised. > [ 19.686358] fuse init (API version 7.13) > [ 29.000017] eth2: no IPv6 routers present > [ 36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors > [ 36.212218] etherd/e1.1: unknown partition table > [ 59.715506] OCFS2 Node Manager 1.5.0 > [ 59.732002] OCFS2 DLM 1.5.0 > [ 59.733343] ocfs2: Registered cluster interface o2cb > [ 59.749185] OCFS2 DLMFS 1.5.0 > [ 59.749304] OCFS2 User DLM kernel interface loaded > [ 65.347517] o2net: accepted connection from node eculture (num 1) at > 130.37.193.11:7777 > [ 67.884256] OCFS2 1.5.0 > [ 67.886984] ocfs2_dlm: Nodes in domain > ("02BC250CDB0A4B468F845C68BE99B90E"): 0 1 > [ 67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with > ordered data mode. > > Installation and formatting are totally standard. > > I've been spending quite a bit of time getting a clue on what might be > wrong, but sofar I failed. Today I played a fair bit with the debugfs, > but I'm do not have enough experience to see what is odd. Dumping all > the locks showed just over 100,000 of them, which I though might be a > lot, but posts suggest it isn't. No busy or very few (-B) locks. > > Checked cabling and low-level network activity. Seems ok. > > Does anyone has similar experiences and/or an idea where to look? > > Thanks --- Jan > > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Maybe Matching Threads
- OCFS2 + iscsi: another node is heartbeating in our slot (over scst)
- [bug report] ocfs2/cluster: Pin/unpin o2hb regions
- Getting Closer (was: Fencing options)
- Another node is heartbeating in our slot! errors with LUN removal/addition
- OCFS2 1.4 + DRBD + iSCSI problem with DLM