thr3ads.net - Ocfs users - [Ocfs-users] DLM Problem? [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Bogdan Constantin

2009-Mar-12 06:40 UTC

[Ocfs-users] DLM Problem?

Hello,

I have an active, balanced webcluster with 2 SLES10SP2 nodes, both running
ocfs2 with an iscsi target.
The ocfs2 volume is mounted on both nodes.
Everything works fine, except sometime the load on both systems go as high
as 200, then both systems freeze, only reboot helps to regain control.
I noticed that this behaviour does not occur when the cluster is NOT
balanced. When the load on both systems is even the problem occurs.
Looking into the logs revealed strange lock problems from the DLM.
See lines below:

Feb 19 10:31:52 web1 kernel: (31564,0):dlm_send_remote_lock_request:315
ERROR: status = -40
Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock_remote:251 ERROR: dlm status
= DLM_BADARGS
Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock:729 ERROR: dlm status DLM_BADARGS
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_lock_create:901 ERROR: Dlm
error "DLM_BADARGS" while calling dlmlock on resource
F000000000000000155f89eb780960c: bad api args
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_file_lock:1486 ERROR: status -22
Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_do_flock:79 ERROR: status = -22
Feb 19 10:33:16 web1 kernel: (7071,5):dlm_send_remote_lock_request:315
ERROR: status = -40
Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock_remote:251 ERROR: dlm status
DLM_BADARGS
Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock:729 ERROR: dlm status DLM_BADARGS
Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_lock_create:901 ERROR: Dlm error
"DLM_BADARGS" while calling dlmlock on resource
F000000000000000155f89eb780960c: bad api args
Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_file_lock:1486 ERROR: status -22
Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_do_flock:79 ERROR: status = -22

The heartbeat between nodes is active at all times, even when those errors
occur.
These errors are the only error related to ocfs2 or the filesystem.

Cluster config is as follows:
node:
ip_port = 7777
ip_address = 10.0.0.1
number = 0
name = web1
cluster = ocfs2

node:
ip_port = 7777
ip_address = 10.0.0.2
number = 1
name = web2
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2

I suspect that the nodes of the cluster have problems with creating and
releasing the locks, thus knocking out each other.
I have searched but cannot find anything about this problem, the only thing
appearing on the oracle page is this [Ocfs2-devel] [PATCH] ocfs2: fix
DLM_BADARGS error in concurrent file
locking<http://oss.oracle.com/pipermail/ocfs2-devel/2008-December/003464.html>but
i'm not that skilled to understand what the guys are talking about.

I can reproduce the freeze problem and the dlm errors at any time.

Has anyone encountered the same problem? Does anyone know if novell offers
support in this kind of situations? I have an Standard subscriptions on both
systems.
Thanks a lot, every hint is welcome.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs-users/attachments/20090312/c42bf909/attachment.html

Sunil Mushran

2009-Mar-13 16:06 UTC

head link

[Ocfs-users] DLM Problem?

This issue has been fixed quite some time ago. Get the latest sles10 sp2
kernel. It should have the fix.

Sunil

On Thu, Mar 12, 2009 at 07:40:13AM +0100, Bogdan Constantin
wrote:>    Hello,
>    I have an active, balanced webcluster with 2 SLES10SP2 nodes, both
>    running ocfs2 with an iscsi target.
>    The ocfs2 volume is mounted on both nodes.
>    Everything works fine, except sometime the load on both systems go as
>    high as 200, then both systems freeze, only reboot helps to regain
>    control.
>    I noticed that this behaviour does not occur when the cluster is NOT
>    balanced. When the load on both systems is even the problem occurs.
>    Looking into the logs revealed strange lock problems from the DLM.
>    See lines below:
>    Feb 19 10:31:52 web1 kernel: (31564,0):dlm_send_remote_lock_request:315
>    ERROR: status = -40
>    Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock_remote:251 ERROR: dlm
>    status = DLM_BADARGS
>    Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock:729 ERROR: dlm status
>    DLM_BADARGS
>    Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_lock_create:901 ERROR: Dlm
>    error "DLM_BADARGS" while calling dlmlock on resource
>    F000000000000000155f89eb780960c: bad api args
>    Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_file_lock:1486 ERROR:
>    status = -22
>    Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_do_flock:79 ERROR: status
>    = -22
>    Feb 19 10:33:16 web1 kernel: (7071,5):dlm_send_remote_lock_request:315
>    ERROR: status = -40
>    Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock_remote:251 ERROR: dlm
>    status = DLM_BADARGS
>    Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock:729 ERROR: dlm status >
DLM_BADARGS
>    Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_lock_create:901 ERROR: Dlm
>    error "DLM_BADARGS" while calling dlmlock on resource
>    F000000000000000155f89eb780960c: bad api args
>    Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_file_lock:1486 ERROR:
>    status = -22
>    Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_do_flock:79 ERROR: status
>    -22
>    The heartbeat between nodes is active at all times, even when those
>    errors occur.
>    These errors are the only error related to ocfs2 or the filesystem.
>    Cluster config is as follows:
>    node:
>    ip_port = 7777
>    ip_address = 10.0.0.1
>    number = 0
>    name = web1
>    cluster = ocfs2
>    node:
>    ip_port = 7777
>    ip_address = 10.0.0.2
>    number = 1
>    name = web2
>    cluster = ocfs2
>    cluster:
>    node_count = 2
>    name = ocfs2
>    I suspect that the nodes of the cluster have problems with creating and
>    releasing the locks, thus knocking out each other.
>    I have searched but cannot find anything about this problem, the only
>    thing appearing on the oracle page is this [1][Ocfs2-devel] [PATCH]
>    ocfs2: fix DLM_BADARGS error in concurrent file locking but i'm not
>    that skilled to understand what the guys are talking about.
>    I can reproduce the freeze problem and the dlm errors at any time.
>    Has anyone encountered the same problem? Does anyone know if novell
>    offers support in this kind of situations? I have an Standard
>    subscriptions on both systems.
>    Thanks a lot, every hint is welcome.
> 
> References
> 
>    1. http://oss.oracle.com/pipermail/ocfs2-devel/2008-December/003464.html
> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users

Apparently Analagous Threads

Search for more seemingly similar threads

Ocfs users - Mar 2009 - DLM Problem?

[Ocfs-users] DLM Problem?

[Ocfs-users] DLM Problem?

Apparently Analagous Threads