If the hang you see is after a node (with a mounted ocfs2 volume) dies,
then it is a known one. This specific recovery bug was introduced in 1.2.7
and fixed in 1.2.8-2. 1.2.8-SLES-r3074 maps to 1.2.8-1. The fixed one should
be version r3080 or more.
If so, upgrade to the latest SLES10 SP1 kernel. This was detected and
fixed few months ago.
http://oss.oracle.com/pipermail/ocfs2-commits/2008-January/002350.html
But, is that the real issue? As in, you don't mention a server going down
in your original problem but only during test. Does a server go down
during regular op too?
One change I would recommend is that your network idle is too low.
We've increased the default for that to 30 secs
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT
Sunil
S?rgio Surkamp wrote:> Hi all,
>
> We setup a OCFS2 cluster on our storage, and exported it using NFS to
> other network servers. It was working fine, but suddenly it locked up
> all NFS clients and unlocked only rebooting all servers (including the
> OCFS2 servers). It seems that under heavy load, the OCFS2+NFS solution
> is deadlocking.
>
> Setup:
> * 1 Dell Storage AX100
> * 2 Dell servers, running SuSE 10-sp1 x86_64 and attached to storage
> using fibre channel qlogic HBA
> * 4 Dell servers, running FreeBSD and accessing the shared storage by NFS
>
> The FreeBSD servers are conected in 2 rows. 2 of them mount the suse #1
> nfsd and 2 mount to suse #2 nfsd, to split the load. Network interfaces
> are connected by a gigabit network with a dedicated switch to NFS and
> OCFS2 (Heartbit/sync messages) traffic.
>
> Without NFS and it seems to work fine. We rushed the filesystem using
> 'iozone' manytimes on both serveres at sametime and it worked like
expected.
>
> During deadlock recovery, we rebooted the slave OCFS2 server (suse01)
> first and checked the 'dmesg' on master:
>
> o2net: connection to node suse01 (num 1) at 192.168.0.1:7777 has been
> idle for 10.0 seconds, shutting it down.
> (0,0):o2net_idle_timer:1434 here are some times that might help debug
> the situation: (tmr 1211375306.9290 now 1211375316.11998 dr
> 1211375306.9272 adv 1211375306.9313:1211375306.9314 func (300d6acb:502)
> 1211374816.37752:1211374816.37756)
> o2net: no longer connected to node suse01 (num 1) at 192.168.0.1:7777
> (15331,4):dlm_get_lock_resource:932
> F59B45831EEA41F384BADE6C4B7A932B:M000000000000000000001ba9d5b7e0: at
> least one node (1) torecover before lock mastery can begin
> (5313,4):dlm_get_lock_resource:932
> F59B45831EEA41F384BADE6C4B7A932B:$RECOVERY: at least one node (1)
> torecover before lock mastery can begin
> (5313,4):dlm_get_lock_resource:966 F59B45831EEA41F384BADE6C4B7A932B:
> recovery map is not empty, but must master $RECOVERY lock now
> (15331,4):ocfs2_replay_journal:1173 Recovering node 1 from slot 1 on
> device (8,17)
> kjournald starting. Commit interval 5 seconds
> o2net: accepted connection from node suse01 (num 1) at 192.168.0.1:7777
> ocfs2_dlm: Node 1 joins domain F59B45831EEA41F384BADE6C4B7A932B
> ocfs2_dlm: Nodes in domain ("F59B45831EEA41F384BADE6C4B7A932B"):
0 1
>
> It seems to me that something is deadlocking on DLM resource manager. I
> used the debugfs.ocfs2 to show me the active locks and it has many of
> them with "Blocking Mode" and/or "Requested Mode"
marked as "Invalid",
> can it be one of the problems? Why there is a Invalid Blocking Mode for
> DLM locks? Is it just a pre-allocated empty lock?
>
> System configuration:
> --> o2cb:
> # O2CB_ENABELED: 'true' means to load the driver on boot.
> O2CB_ENABLED=true
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=ocfs2
>
> # TIMEOUT - 600s
> O2CB_HEARTBEAT_THRESHOLD=301
>
> --> cluster.conf:
> node:
> ip_port = 7777
> ip_address = 192.168.0.10
> number = 0
> name = suse02
> cluster = ocfs2
>
> node:
> ip_port = 7777
> ip_address = 192.168.0.1
> number = 1
> name = suse01
> cluster = ocfs2
>
> cluster:
> node_count = 2
> name = ocfs2
>
> FreeBSD setup:
> * Default NFS Client configuration.
> * nfslocking daemon disabled.
> * NFS not soft mounted.
>
> SuSE package versions:
> ocfs2-tools-1.2.3-0.7
> ocfs2console-1.2.3-0.7
> nfs-utils-1.0.7-36.26
> nfsidmap-0.12-16.17
>
> OCFS2 kernel driver version:
> OCFS2 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles)
> OCFS2 Node Manager 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build
> sles)
> OCFS2 DLM 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles)
> OCFS2 DLMFS 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles)
>
> Any tip on what is going on?
>
> Thanks for any help.
>
> Regards,
>