thr3ads.net - Ocfs2 users - [Ocfs2-users] OCFS2 + NFS setup deadlocking [May 2008]

If this information is useful, please help other people find it:
Share via:

Sérgio Surkamp

2008-May-21 18:10 UTC

[Ocfs2-users] OCFS2 + NFS setup deadlocking

Hi all,

We setup a OCFS2 cluster on our storage, and exported it using NFS to 
other network servers. It was working fine, but suddenly it locked up 
all NFS clients and unlocked only rebooting all servers (including the 
OCFS2 servers). It seems that under heavy load, the OCFS2+NFS solution 
is deadlocking.

Setup:
* 1 Dell Storage AX100
* 2 Dell servers, running SuSE 10-sp1 x86_64 and attached to storage 
using fibre channel qlogic HBA
* 4 Dell servers, running FreeBSD and accessing the shared storage by NFS

The FreeBSD servers are conected in 2 rows. 2 of them mount the suse #1 
nfsd and 2 mount to suse #2 nfsd, to split the load. Network interfaces 
are connected by a gigabit network with a dedicated switch to NFS and 
OCFS2 (Heartbit/sync messages) traffic.

Without NFS and it seems to work fine. We rushed the filesystem using 
'iozone' manytimes on both serveres at sametime and it worked like
expected.

During deadlock recovery, we rebooted the slave OCFS2 server (suse01) 
first and checked the 'dmesg' on master:

o2net: connection to node suse01 (num 1) at 192.168.0.1:7777 has been 
idle for 10.0 seconds, shutting it down.
(0,0):o2net_idle_timer:1434 here are some times that might help debug 
the situation: (tmr 1211375306.9290 now 1211375316.11998 dr 
1211375306.9272 adv 1211375306.9313:1211375306.9314 func (300d6acb:502) 
1211374816.37752:1211374816.37756)
o2net: no longer connected to node suse01 (num 1) at 192.168.0.1:7777
(15331,4):dlm_get_lock_resource:932 
F59B45831EEA41F384BADE6C4B7A932B:M000000000000000000001ba9d5b7e0: at 
least one node (1) torecover before lock mastery can begin
(5313,4):dlm_get_lock_resource:932 
F59B45831EEA41F384BADE6C4B7A932B:$RECOVERY: at least one node (1) 
torecover before lock mastery can begin
(5313,4):dlm_get_lock_resource:966 F59B45831EEA41F384BADE6C4B7A932B: 
recovery map is not empty, but must master $RECOVERY lock now
(15331,4):ocfs2_replay_journal:1173 Recovering node 1 from slot 1 on 
device (8,17)
kjournald starting.  Commit interval 5 seconds
o2net: accepted connection from node suse01 (num 1) at 192.168.0.1:7777
ocfs2_dlm: Node 1 joins domain F59B45831EEA41F384BADE6C4B7A932B
ocfs2_dlm: Nodes in domain ("F59B45831EEA41F384BADE6C4B7A932B"): 0 1

It seems to me that something is deadlocking on DLM resource manager. I 
used the debugfs.ocfs2 to show me the active locks and it has many of 
them with "Blocking Mode" and/or "Requested Mode" marked as
"Invalid",
can it be one of the problems? Why there is a Invalid Blocking Mode for 
DLM locks? Is it just a pre-allocated empty lock?

System configuration:
--> o2cb:
# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# TIMEOUT - 600s
O2CB_HEARTBEAT_THRESHOLD=301

--> cluster.conf:
node:
         ip_port = 7777
         ip_address = 192.168.0.10
         number = 0
         name = suse02
         cluster = ocfs2

node:
         ip_port = 7777
         ip_address = 192.168.0.1
         number = 1
         name = suse01
         cluster = ocfs2

cluster:
         node_count = 2
         name = ocfs2

FreeBSD setup:
* Default NFS Client configuration.
* nfslocking daemon disabled.
* NFS not soft mounted.

SuSE package versions:
ocfs2-tools-1.2.3-0.7
ocfs2console-1.2.3-0.7
nfs-utils-1.0.7-36.26
nfsidmap-0.12-16.17

OCFS2 kernel driver version:
OCFS2 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
OCFS2 Node Manager 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build 
sles)
OCFS2 DLM 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
OCFS2 DLMFS 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)

Any tip on what is going on?

Thanks for any help.

Regards,
-- 
   .:''''':.
.:'        `     S?rgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
   `:,   ,.:'     *Grupos Internet S.A.*
     `: :'        R. Laulo Linhares, 2123 Torre B - Sala 201
      : :         Trindade - Florian?polis - SC
      :.'
      ::          +55 48 3234-9109
      :
      '           http://www.gruposinternet.com.br

Sunil Mushran

2008-May-21 19:11 UTC

head link

[Ocfs2-users] OCFS2 + NFS setup deadlocking

If the hang you see is after a node (with a mounted ocfs2 volume) dies,
then it is a known one. This specific recovery bug was introduced in 1.2.7
and fixed in 1.2.8-2. 1.2.8-SLES-r3074 maps to 1.2.8-1. The fixed one should
be version r3080 or more.

If so, upgrade to the latest SLES10 SP1 kernel. This was detected and
fixed few months ago.
http://oss.oracle.com/pipermail/ocfs2-commits/2008-January/002350.html

But, is that the real issue? As in, you don't mention a server going down
in your original problem but only during test. Does a server go down
during regular op too?

One change I would recommend is that your network idle is too low.
We've increased the default for that to 30 secs
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT

Sunil


S?rgio Surkamp wrote:> Hi all,
>
> We setup a OCFS2 cluster on our storage, and exported it using NFS to 
> other network servers. It was working fine, but suddenly it locked up 
> all NFS clients and unlocked only rebooting all servers (including the 
> OCFS2 servers). It seems that under heavy load, the OCFS2+NFS solution 
> is deadlocking.
>
> Setup:
> * 1 Dell Storage AX100
> * 2 Dell servers, running SuSE 10-sp1 x86_64 and attached to storage 
> using fibre channel qlogic HBA
> * 4 Dell servers, running FreeBSD and accessing the shared storage by NFS
>
> The FreeBSD servers are conected in 2 rows. 2 of them mount the suse #1 
> nfsd and 2 mount to suse #2 nfsd, to split the load. Network interfaces 
> are connected by a gigabit network with a dedicated switch to NFS and 
> OCFS2 (Heartbit/sync messages) traffic.
>
> Without NFS and it seems to work fine. We rushed the filesystem using 
> 'iozone' manytimes on both serveres at sametime and it worked like
expected.
>
> During deadlock recovery, we rebooted the slave OCFS2 server (suse01) 
> first and checked the 'dmesg' on master:
>
> o2net: connection to node suse01 (num 1) at 192.168.0.1:7777 has been 
> idle for 10.0 seconds, shutting it down.
> (0,0):o2net_idle_timer:1434 here are some times that might help debug 
> the situation: (tmr 1211375306.9290 now 1211375316.11998 dr 
> 1211375306.9272 adv 1211375306.9313:1211375306.9314 func (300d6acb:502) 
> 1211374816.37752:1211374816.37756)
> o2net: no longer connected to node suse01 (num 1) at 192.168.0.1:7777
> (15331,4):dlm_get_lock_resource:932 
> F59B45831EEA41F384BADE6C4B7A932B:M000000000000000000001ba9d5b7e0: at 
> least one node (1) torecover before lock mastery can begin
> (5313,4):dlm_get_lock_resource:932 
> F59B45831EEA41F384BADE6C4B7A932B:$RECOVERY: at least one node (1) 
> torecover before lock mastery can begin
> (5313,4):dlm_get_lock_resource:966 F59B45831EEA41F384BADE6C4B7A932B: 
> recovery map is not empty, but must master $RECOVERY lock now
> (15331,4):ocfs2_replay_journal:1173 Recovering node 1 from slot 1 on 
> device (8,17)
> kjournald starting.  Commit interval 5 seconds
> o2net: accepted connection from node suse01 (num 1) at 192.168.0.1:7777
> ocfs2_dlm: Node 1 joins domain F59B45831EEA41F384BADE6C4B7A932B
> ocfs2_dlm: Nodes in domain ("F59B45831EEA41F384BADE6C4B7A932B"):
0 1
>
> It seems to me that something is deadlocking on DLM resource manager. I 
> used the debugfs.ocfs2 to show me the active locks and it has many of 
> them with "Blocking Mode" and/or "Requested Mode"
marked as "Invalid",
> can it be one of the problems? Why there is a Invalid Blocking Mode for 
> DLM locks? Is it just a pre-allocated empty lock?
>
> System configuration:
> --> o2cb:
> # O2CB_ENABELED: 'true' means to load the driver on boot.
> O2CB_ENABLED=true
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=ocfs2
>
> # TIMEOUT - 600s
> O2CB_HEARTBEAT_THRESHOLD=301
>
> --> cluster.conf:
> node:
>          ip_port = 7777
>          ip_address = 192.168.0.10
>          number = 0
>          name = suse02
>          cluster = ocfs2
>
> node:
>          ip_port = 7777
>          ip_address = 192.168.0.1
>          number = 1
>          name = suse01
>          cluster = ocfs2
>
> cluster:
>          node_count = 2
>          name = ocfs2
>
> FreeBSD setup:
> * Default NFS Client configuration.
> * nfslocking daemon disabled.
> * NFS not soft mounted.
>
> SuSE package versions:
> ocfs2-tools-1.2.3-0.7
> ocfs2console-1.2.3-0.7
> nfs-utils-1.0.7-36.26
> nfsidmap-0.12-16.17
>
> OCFS2 kernel driver version:
> OCFS2 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
> OCFS2 Node Manager 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build 
> sles)
> OCFS2 DLM 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
> OCFS2 DLMFS 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
>
> Any tip on what is going on?
>
> Thanks for any help.
>
> Regards,
>

Ocfs2 users - May 2008 - OCFS2 + NFS setup deadlocking

[Ocfs2-users] OCFS2 + NFS setup deadlocking

[Ocfs2-users] OCFS2 + NFS setup deadlocking