thr3ads.net - Ocfs2 users - [Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2 [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Jan Wielemaker

2010-Dec-07 15:45 UTC

[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

Hi,

I'm pretty new to ocfs2 and a bit stuck.  I have two Debian/Squeeze
(testing) machines accessing an ocfs2 filesystem over aoe.  The
filesystem sits on an lvm2 volume, but I guess that is irrelevant.

Even when mostly idle, everything accessing the cluster sometimes hangs
for about 20 seconds.  This happens rather frequently, say every 5
minutes, but the interval seems irregular while the time that it hangs
is quite similar.  This behavior seems pretty much independent from
the (IO) load of the nodes (as long as not really high).

I tried a ps, grepping for D repeated every second on both nodes.  
When hanging, both show this:

 1649 D<   o2hb-02BC250CDB ?
 3507 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3511 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3515 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3519 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3523 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3527 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 3531 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 1670 D    jbd2/dm-4-18    ?
 3535 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 1670 D    jbd2/dm-4-18    ?
 3539 R+   ps              -
 1649 D<   o2hb-02BC250CDB ?
 1670 D    jbd2/dm-4-18    ?
 3543 R+   ps              -

ocfs2-tools is at version 1.4.4-3.  Kernel is version 2.6.32-5-amd64.
The kernel log of the mount at boot is here:

[   18.911452] aoe: AoE v47 initialised.
[   19.686358] fuse init (API version 7.13)
[   29.000017] eth2: no IPv6 routers present
[   36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors
[   36.212218]  etherd/e1.1: unknown partition table
[   59.715506] OCFS2 Node Manager 1.5.0
[   59.732002] OCFS2 DLM 1.5.0
[   59.733343] ocfs2: Registered cluster interface o2cb
[   59.749185] OCFS2 DLMFS 1.5.0
[   59.749304] OCFS2 User DLM kernel interface loaded
[   65.347517] o2net: accepted connection from node eculture (num 1) at
130.37.193.11:7777
[   67.884256] OCFS2 1.5.0
[   67.886984] ocfs2_dlm: Nodes in domain
("02BC250CDB0A4B468F845C68BE99B90E"): 0 1 
[   67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with
ordered data mode.

Installation and formatting are totally standard.

I've been spending quite a bit of time getting a clue on what might be
wrong, but sofar I failed.  Today I played a fair bit with the debugfs,
but I'm do not have enough experience to see what is odd.  Dumping all
the locks showed just over 100,000 of them, which I though might be a
lot, but posts suggest it isn't.  No busy or very few (-B) locks.

Checked cabling and low-level network activity.  Seems ok.

Does anyone has similar experiences and/or an idea where to look?

	Thanks --- Jan

Sunil Mushran

2010-Dec-07 17:07 UTC

head link

[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

Check the kernel stack of the D state processes.

cat /proc/PID/stack

The kernel stack will tell us where it is waiting. My guess is that
the io stack is slow. Slow ios appear as temporary hangs to the
users.

On 12/07/2010 07:45 AM, Jan Wielemaker wrote:> Hi,
>
> I'm pretty new to ocfs2 and a bit stuck.  I have two Debian/Squeeze
> (testing) machines accessing an ocfs2 filesystem over aoe.  The
> filesystem sits on an lvm2 volume, but I guess that is irrelevant.
>
> Even when mostly idle, everything accessing the cluster sometimes hangs
> for about 20 seconds.  This happens rather frequently, say every 5
> minutes, but the interval seems irregular while the time that it hangs
> is quite similar.  This behavior seems pretty much independent from
> the (IO) load of the nodes (as long as not really high).
>
> I tried a ps, grepping for D repeated every second on both nodes.
> When hanging, both show this:
>
>   1649 D<    o2hb-02BC250CDB ?
>   3507 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3511 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3515 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3519 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3523 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3527 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   3531 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   1670 D    jbd2/dm-4-18    ?
>   3535 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   1670 D    jbd2/dm-4-18    ?
>   3539 R+   ps              -
>   1649 D<    o2hb-02BC250CDB ?
>   1670 D    jbd2/dm-4-18    ?
>   3543 R+   ps              -
>
> ocfs2-tools is at version 1.4.4-3.  Kernel is version 2.6.32-5-amd64.
> The kernel log of the mount at boot is here:
>
> [   18.911452] aoe: AoE v47 initialised.
> [   19.686358] fuse init (API version 7.13)
> [   29.000017] eth2: no IPv6 routers present
> [   36.212109] aoe: 003048f28d36 e1.1 vace0 has 9767625597 sectors
> [   36.212218]  etherd/e1.1: unknown partition table
> [   59.715506] OCFS2 Node Manager 1.5.0
> [   59.732002] OCFS2 DLM 1.5.0
> [   59.733343] ocfs2: Registered cluster interface o2cb
> [   59.749185] OCFS2 DLMFS 1.5.0
> [   59.749304] OCFS2 User DLM kernel interface loaded
> [   65.347517] o2net: accepted connection from node eculture (num 1) at
> 130.37.193.11:7777
> [   67.884256] OCFS2 1.5.0
> [   67.886984] ocfs2_dlm: Nodes in domain
> ("02BC250CDB0A4B468F845C68BE99B90E"): 0 1
> [   67.890075] ocfs2: Mounting device (254,4) on (node 0, slot 0) with
> ordered data mode.
>
> Installation and formatting are totally standard.
>
> I've been spending quite a bit of time getting a clue on what might be
> wrong, but sofar I failed.  Today I played a fair bit with the debugfs,
> but I'm do not have enough experience to see what is odd.  Dumping all
> the locks showed just over 100,000 of them, which I though might be a
> lot, but posts suggest it isn't.  No busy or very few (-B) locks.
>
> Checked cabling and low-level network activity.  Seems ok.
>
> Does anyone has similar experiences and/or an idea where to look?
>
> 	Thanks --- Jan
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Possibly Parallel Threads

Search for more reasonably related threads

Ocfs2 users - Dec 2010 - Two-node cluster often hanging in o2hb/jdb2

[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

[Ocfs2-users] Two-node cluster often hanging in o2hb/jdb2

Possibly Parallel Threads