This issue has been fixed quite some time ago. Get the latest sles10 sp2
kernel. It should have the fix.
Sunil
On Thu, Mar 12, 2009 at 07:40:13AM +0100, Bogdan Constantin
wrote:> Hello,
> I have an active, balanced webcluster with 2 SLES10SP2 nodes, both
> running ocfs2 with an iscsi target.
> The ocfs2 volume is mounted on both nodes.
> Everything works fine, except sometime the load on both systems go as
> high as 200, then both systems freeze, only reboot helps to regain
> control.
> I noticed that this behaviour does not occur when the cluster is NOT
> balanced. When the load on both systems is even the problem occurs.
> Looking into the logs revealed strange lock problems from the DLM.
> See lines below:
> Feb 19 10:31:52 web1 kernel: (31564,0):dlm_send_remote_lock_request:315
> ERROR: status = -40
> Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock_remote:251 ERROR: dlm
> status = DLM_BADARGS
> Feb 19 10:31:52 web1 kernel: (31564,0):dlmlock:729 ERROR: dlm status
> DLM_BADARGS
> Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_lock_create:901 ERROR: Dlm
> error "DLM_BADARGS" while calling dlmlock on resource
> F000000000000000155f89eb780960c: bad api args
> Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_file_lock:1486 ERROR:
> status = -22
> Feb 19 10:31:52 web1 kernel: (31564,0):ocfs2_do_flock:79 ERROR: status
> = -22
> Feb 19 10:33:16 web1 kernel: (7071,5):dlm_send_remote_lock_request:315
> ERROR: status = -40
> Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock_remote:251 ERROR: dlm
> status = DLM_BADARGS
> Feb 19 10:33:16 web1 kernel: (7071,5):dlmlock:729 ERROR: dlm status >
DLM_BADARGS
> Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_lock_create:901 ERROR: Dlm
> error "DLM_BADARGS" while calling dlmlock on resource
> F000000000000000155f89eb780960c: bad api args
> Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_file_lock:1486 ERROR:
> status = -22
> Feb 19 10:33:16 web1 kernel: (7071,5):ocfs2_do_flock:79 ERROR: status
> -22
> The heartbeat between nodes is active at all times, even when those
> errors occur.
> These errors are the only error related to ocfs2 or the filesystem.
> Cluster config is as follows:
> node:
> ip_port = 7777
> ip_address = 10.0.0.1
> number = 0
> name = web1
> cluster = ocfs2
> node:
> ip_port = 7777
> ip_address = 10.0.0.2
> number = 1
> name = web2
> cluster = ocfs2
> cluster:
> node_count = 2
> name = ocfs2
> I suspect that the nodes of the cluster have problems with creating and
> releasing the locks, thus knocking out each other.
> I have searched but cannot find anything about this problem, the only
> thing appearing on the oracle page is this [1][Ocfs2-devel] [PATCH]
> ocfs2: fix DLM_BADARGS error in concurrent file locking but i'm not
> that skilled to understand what the guys are talking about.
> I can reproduce the freeze problem and the dlm errors at any time.
> Has anyone encountered the same problem? Does anyone know if novell
> offers support in this kind of situations? I have an Standard
> subscriptions on both systems.
> Thanks a lot, every hint is welcome.
>
> References
>
> 1. http://oss.oracle.com/pipermail/ocfs2-devel/2008-December/003464.html
> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users