thr3ads.net - Ocfs2 users - [Ocfs2-users] Problems with umounting ocfs2 volume [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Georg Höllrigl

2009-Jul-30 12:28 UTC

[Ocfs2-users] Problems with umounting ocfs2 volume

Hello,

I've several LUNs mounted in a 7 node cluster - one LUN which is only used
on 4 of the nodes.

It's impossible to umount this LUN - I'm always getting device is busy
on my nodes.

I'm using SuSE Linux Enterprise 10 SP 2 with Kernel 2.6.16.60-0.34-smp

lsof doesn't show anything that could still be open on this volume.


host1:~ # mount
....
/dev/dm-10 on /srv/www/vhosts type ocfs2 (rw,_netdev,heartbeat=local)

host1:~ # umount /dev/dm-10 -vvv -f
Trying to umount /dev/dm-10
umount2: Device or resource busy
umount: /srv/www/vhosts: device is busy
umount2: Device or resource busy
umount: /srv/www/vhosts: device is busy


Here are the Logfiles from the other node - which was blocking all requests
until the node that
umounted got evicted.

Jul 30 09:31:12 host1-s-01 kernel: o2net: connection to node host1-s-03 (num 2)
at 10.0.1.170:7777
has been idle for 60.0 seconds, shutting it down.
Jul 30 09:31:12 host1-s-01 kernel: (0,0):o2net_idle_timer:1476 here are some
times that might help
debug the situation: (tmr 1248939012.142271 now 1248939072.147815 dr
1248939012.142267 adv
1248939012.142271:1248939012.142272 func (1c9b2828
:502) 1248939007.404799:1248939007.404802)
Jul 30 09:31:12 host1-s-01 kernel: o2net: no longer connected to node host1-s-03
(num 2) at
10.0.1.170:7777
Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_do_master_request:1360 ERROR:
link to 2 went down!
Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:937 ERROR:
status = -112
Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_do_master_request:1360 ERROR:
link to 2 went down!
Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:937 ERROR:
status = -112
Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_do_master_request:1360 ERROR:
link to 2 went down!
Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:937 ERROR:
status = -112
Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_do_master_request:1360 ERROR:
link to 2 went down!
Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:937 ERROR:
status = -107
Jul 30 09:32:12 host1-s-01 kernel: (5723,0):o2net_connect_expired:1637 ERROR: no
connection
established with node 2 after 60.0 seconds, giving up and returning errors.
Jul 30 09:32:14 host1-s-01 kernel: (5764,0):ocfs2_dlm_eviction_cb:108 device
(253,10): dlm has
evicted node 2
Jul 30 09:32:17 host1-s-01 kernel: (5723,0):ocfs2_dlm_eviction_cb:108 device
(253,10): dlm has
evicted node 2
Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
Jul 30 09:32:18 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:918 
D7581877783A4174A498C97DDC573E52:M0000000000000008bfed0600000000: at least one
node (2) to recover
before lock mastery can begin
Jul 30 09:32:18 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:918 
D7581877783A4174A498C97DDC573E52:N0000000003560f35: at least one node (2) to
recover before lock
mastery can begin
Jul 30 09:32:18 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:918 
D7581877783A4174A498C97DDC573E52:N0000000008172be2: at least one node (2) to
recover before lock
mastery can begin
Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
Jul 30 09:32:19 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:918 
D7581877783A4174A498C97DDC573E52:O000000000000000a40040b00000000: at least one
node (2) to recover
before lock mastery can begin
Jul 30 09:33:01 host1-s-01 kernel: o2net: accepted connection from node
host1-s-03 (num 2) at
10.0.1.170:7777
Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
D7581877783A4174A498C97DDC573E52
Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("D7581877783A4174A498C97DDC573E52"):
0 1 2 3
Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
F7929E2FBDB0487DA142467EB725FC22
Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("F7929E2FBDB0487DA142467EB725FC22"):
0 1 2 3
Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
6831EB702AC04901A6D5BBE7EBE691AE
Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("6831EB702AC04901A6D5BBE7EBE691AE"):
0 1 2 3
Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
0E7D6EE19D644648919028729AE662A1
Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("0E7D6EE19D644648919028729AE662A1"):
0 1 2 3 4 5 6
Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
63865CE5EDE74713A6B3CECE2A3923C0
Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("63865CE5EDE74713A6B3CECE2A3923C0"):
0 1 2 3 4 5 6
Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
D5754A078396403FB841A798BE945A26
Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("D5754A078396403FB841A798BE945A26"):
0 1 2 3 4 5 6
Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
976B1863D9114CFAA314354BB1235577
Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("976B1863D9114CFAA314354BB1235577"):
0 1 2 3 4 5 6
Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
933698C044EF46D9A175057C523C2D1E
Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("933698C044EF46D9A175057C523C2D1E"):
0 1 2 3 4 5 6


The strange thing is, that this only happens with one of the volumes - perhaps
you could point me
into the right direction how to fix this volume?

-- 
Ing. Georg H?llrigl

Sunil Mushran

2009-Jul-30 19:08 UTC

head link

[Ocfs2-users] Problems with umounting ocfs2 volume

"device busy" could be because you have a shell having that as the
cwd.
Check:
ls -l /proc/[0-9]*/cwd

Georg H?llrigl wrote:> Hello,
>
> I've several LUNs mounted in a 7 node cluster - one LUN which is only
used on 4 of the nodes.
>
> It's impossible to umount this LUN - I'm always getting device is
busy on my nodes.
>
> I'm using SuSE Linux Enterprise 10 SP 2 with Kernel 2.6.16.60-0.34-smp
>
> lsof doesn't show anything that could still be open on this volume.
>
>
> host1:~ # mount
> ....
> /dev/dm-10 on /srv/www/vhosts type ocfs2 (rw,_netdev,heartbeat=local)
>
> host1:~ # umount /dev/dm-10 -vvv -f
> Trying to umount /dev/dm-10
> umount2: Device or resource busy
> umount: /srv/www/vhosts: device is busy
> umount2: Device or resource busy
> umount: /srv/www/vhosts: device is busy
>
>
> Here are the Logfiles from the other node - which was blocking all requests
until the node that
> umounted got evicted.
>
> Jul 30 09:31:12 host1-s-01 kernel: o2net: connection to node host1-s-03
(num 2) at 10.0.1.170:7777
> has been idle for 60.0 seconds, shutting it down.
> Jul 30 09:31:12 host1-s-01 kernel: (0,0):o2net_idle_timer:1476 here are
some times that might help
> debug the situation: (tmr 1248939012.142271 now 1248939072.147815 dr
1248939012.142267 adv
> 1248939012.142271:1248939012.142272 func (1c9b2828
> :502) 1248939007.404799:1248939007.404802)
> Jul 30 09:31:12 host1-s-01 kernel: o2net: no longer connected to node
host1-s-03 (num 2) at
> 10.0.1.170:7777
> Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_do_master_request:1360
ERROR: link to 2 went down!
> Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:937
ERROR: status = -112
> Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_do_master_request:1360
ERROR: link to 2 went down!
> Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:937
ERROR: status = -112
> Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_do_master_request:1360
ERROR: link to 2 went down!
> Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:937
ERROR: status = -112
> Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_do_master_request:1360
ERROR: link to 2 went down!
> Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:937
ERROR: status = -107
> Jul 30 09:32:12 host1-s-01 kernel: (5723,0):o2net_connect_expired:1637
ERROR: no connection
> established with node 2 after 60.0 seconds, giving up and returning errors.
> Jul 30 09:32:14 host1-s-01 kernel: (5764,0):ocfs2_dlm_eviction_cb:108
device (253,10): dlm has
> evicted node 2
> Jul 30 09:32:17 host1-s-01 kernel: (5723,0):ocfs2_dlm_eviction_cb:108
device (253,10): dlm has
> evicted node 2
> Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
> Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
> Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
> Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
> Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
> Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
> Jul 30 09:32:18 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:918 
> D7581877783A4174A498C97DDC573E52:M0000000000000008bfed0600000000: at least
one node (2) to recover
> before lock mastery can begin
> Jul 30 09:32:18 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:918 
> D7581877783A4174A498C97DDC573E52:N0000000003560f35: at least one node (2)
to recover before lock
> mastery can begin
> Jul 30 09:32:18 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:918 
> D7581877783A4174A498C97DDC573E52:N0000000008172be2: at least one node (2)
to recover before lock
> mastery can begin
> Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_restart_lock_mastery:1243
ERROR: node down! 2
> Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_wait_for_lock_mastery:1060
ERROR: status = -11
> Jul 30 09:32:19 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:918 
> D7581877783A4174A498C97DDC573E52:O000000000000000a40040b00000000: at least
one node (2) to recover
> before lock mastery can begin
> Jul 30 09:33:01 host1-s-01 kernel: o2net: accepted connection from node
host1-s-03 (num 2) at
> 10.0.1.170:7777
> Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
D7581877783A4174A498C97DDC573E52
> Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("D7581877783A4174A498C97DDC573E52"):
> 0 1 2 3
> Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
F7929E2FBDB0487DA142467EB725FC22
> Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("F7929E2FBDB0487DA142467EB725FC22"):
> 0 1 2 3
> Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
6831EB702AC04901A6D5BBE7EBE691AE
> Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("6831EB702AC04901A6D5BBE7EBE691AE"):
> 0 1 2 3
> Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
0E7D6EE19D644648919028729AE662A1
> Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("0E7D6EE19D644648919028729AE662A1"):
> 0 1 2 3 4 5 6
> Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
63865CE5EDE74713A6B3CECE2A3923C0
> Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("63865CE5EDE74713A6B3CECE2A3923C0"):
> 0 1 2 3 4 5 6
> Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
D5754A078396403FB841A798BE945A26
> Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("D5754A078396403FB841A798BE945A26"):
> 0 1 2 3 4 5 6
> Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
976B1863D9114CFAA314354BB1235577
> Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("976B1863D9114CFAA314354BB1235577"):
> 0 1 2 3 4 5 6
> Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain
933698C044EF46D9A175057C523C2D1E
> Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Nodes in domain
("933698C044EF46D9A175057C523C2D1E"):
> 0 1 2 3 4 5 6
>
>
> The strange thing is, that this only happens with one of the volumes -
perhaps you could point me
> into the right direction how to fix this volume?
>
>

Ocfs2 users - Jul 2009 - Problems with umounting ocfs2 volume

[Ocfs2-users] Problems with umounting ocfs2 volume

[Ocfs2-users] Problems with umounting ocfs2 volume