Hello, I've several LUNs mounted in a 7 node cluster - one LUN which is only used on 4 of the nodes. It's impossible to umount this LUN - I'm always getting device is busy on my nodes. I'm using SuSE Linux Enterprise 10 SP 2 with Kernel 2.6.16.60-0.34-smp lsof doesn't show anything that could still be open on this volume. host1:~ # mount .... /dev/dm-10 on /srv/www/vhosts type ocfs2 (rw,_netdev,heartbeat=local) host1:~ # umount /dev/dm-10 -vvv -f Trying to umount /dev/dm-10 umount2: Device or resource busy umount: /srv/www/vhosts: device is busy umount2: Device or resource busy umount: /srv/www/vhosts: device is busy Here are the Logfiles from the other node - which was blocking all requests until the node that umounted got evicted. Jul 30 09:31:12 host1-s-01 kernel: o2net: connection to node host1-s-03 (num 2) at 10.0.1.170:7777 has been idle for 60.0 seconds, shutting it down. Jul 30 09:31:12 host1-s-01 kernel: (0,0):o2net_idle_timer:1476 here are some times that might help debug the situation: (tmr 1248939012.142271 now 1248939072.147815 dr 1248939012.142267 adv 1248939012.142271:1248939012.142272 func (1c9b2828 :502) 1248939007.404799:1248939007.404802) Jul 30 09:31:12 host1-s-01 kernel: o2net: no longer connected to node host1-s-03 (num 2) at 10.0.1.170:7777 Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_do_master_request:1360 ERROR: link to 2 went down! Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:937 ERROR: status = -112 Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_do_master_request:1360 ERROR: link to 2 went down! Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:937 ERROR: status = -112 Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_do_master_request:1360 ERROR: link to 2 went down! Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:937 ERROR: status = -112 Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_do_master_request:1360 ERROR: link to 2 went down! Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:937 ERROR: status = -107 Jul 30 09:32:12 host1-s-01 kernel: (5723,0):o2net_connect_expired:1637 ERROR: no connection established with node 2 after 60.0 seconds, giving up and returning errors. Jul 30 09:32:14 host1-s-01 kernel: (5764,0):ocfs2_dlm_eviction_cb:108 device (253,10): dlm has evicted node 2 Jul 30 09:32:17 host1-s-01 kernel: (5723,0):ocfs2_dlm_eviction_cb:108 device (253,10): dlm has evicted node 2 Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 Jul 30 09:32:18 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:918 D7581877783A4174A498C97DDC573E52:M0000000000000008bfed0600000000: at least one node (2) to recover before lock mastery can begin Jul 30 09:32:18 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:918 D7581877783A4174A498C97DDC573E52:N0000000003560f35: at least one node (2) to recover before lock mastery can begin Jul 30 09:32:18 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:918 D7581877783A4174A498C97DDC573E52:N0000000008172be2: at least one node (2) to recover before lock mastery can begin Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 Jul 30 09:32:19 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:918 D7581877783A4174A498C97DDC573E52:O000000000000000a40040b00000000: at least one node (2) to recover before lock mastery can begin Jul 30 09:33:01 host1-s-01 kernel: o2net: accepted connection from node host1-s-03 (num 2) at 10.0.1.170:7777 Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain D7581877783A4174A498C97DDC573E52 Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("D7581877783A4174A498C97DDC573E52"): 0 1 2 3 Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain F7929E2FBDB0487DA142467EB725FC22 Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("F7929E2FBDB0487DA142467EB725FC22"): 0 1 2 3 Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 6831EB702AC04901A6D5BBE7EBE691AE Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("6831EB702AC04901A6D5BBE7EBE691AE"): 0 1 2 3 Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 0E7D6EE19D644648919028729AE662A1 Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("0E7D6EE19D644648919028729AE662A1"): 0 1 2 3 4 5 6 Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 63865CE5EDE74713A6B3CECE2A3923C0 Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("63865CE5EDE74713A6B3CECE2A3923C0"): 0 1 2 3 4 5 6 Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain D5754A078396403FB841A798BE945A26 Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("D5754A078396403FB841A798BE945A26"): 0 1 2 3 4 5 6 Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 976B1863D9114CFAA314354BB1235577 Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("976B1863D9114CFAA314354BB1235577"): 0 1 2 3 4 5 6 Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 933698C044EF46D9A175057C523C2D1E Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("933698C044EF46D9A175057C523C2D1E"): 0 1 2 3 4 5 6 The strange thing is, that this only happens with one of the volumes - perhaps you could point me into the right direction how to fix this volume? -- Ing. Georg H?llrigl
"device busy" could be because you have a shell having that as the cwd. Check: ls -l /proc/[0-9]*/cwd Georg H?llrigl wrote:> Hello, > > I've several LUNs mounted in a 7 node cluster - one LUN which is only used on 4 of the nodes. > > It's impossible to umount this LUN - I'm always getting device is busy on my nodes. > > I'm using SuSE Linux Enterprise 10 SP 2 with Kernel 2.6.16.60-0.34-smp > > lsof doesn't show anything that could still be open on this volume. > > > host1:~ # mount > .... > /dev/dm-10 on /srv/www/vhosts type ocfs2 (rw,_netdev,heartbeat=local) > > host1:~ # umount /dev/dm-10 -vvv -f > Trying to umount /dev/dm-10 > umount2: Device or resource busy > umount: /srv/www/vhosts: device is busy > umount2: Device or resource busy > umount: /srv/www/vhosts: device is busy > > > Here are the Logfiles from the other node - which was blocking all requests until the node that > umounted got evicted. > > Jul 30 09:31:12 host1-s-01 kernel: o2net: connection to node host1-s-03 (num 2) at 10.0.1.170:7777 > has been idle for 60.0 seconds, shutting it down. > Jul 30 09:31:12 host1-s-01 kernel: (0,0):o2net_idle_timer:1476 here are some times that might help > debug the situation: (tmr 1248939012.142271 now 1248939072.147815 dr 1248939012.142267 adv > 1248939012.142271:1248939012.142272 func (1c9b2828 > :502) 1248939007.404799:1248939007.404802) > Jul 30 09:31:12 host1-s-01 kernel: o2net: no longer connected to node host1-s-03 (num 2) at > 10.0.1.170:7777 > Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_do_master_request:1360 ERROR: link to 2 went down! > Jul 30 09:31:12 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:937 ERROR: status = -112 > Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_do_master_request:1360 ERROR: link to 2 went down! > Jul 30 09:31:12 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:937 ERROR: status = -112 > Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_do_master_request:1360 ERROR: link to 2 went down! > Jul 30 09:31:12 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:937 ERROR: status = -112 > Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_do_master_request:1360 ERROR: link to 2 went down! > Jul 30 09:31:13 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:937 ERROR: status = -107 > Jul 30 09:32:12 host1-s-01 kernel: (5723,0):o2net_connect_expired:1637 ERROR: no connection > established with node 2 after 60.0 seconds, giving up and returning errors. > Jul 30 09:32:14 host1-s-01 kernel: (5764,0):ocfs2_dlm_eviction_cb:108 device (253,10): dlm has > evicted node 2 > Jul 30 09:32:17 host1-s-01 kernel: (5723,0):ocfs2_dlm_eviction_cb:108 device (253,10): dlm has > evicted node 2 > Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 > Jul 30 09:32:17 host1-s-01 kernel: (16685,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 > Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 > Jul 30 09:32:17 host1-s-01 kernel: (16680,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 > Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 > Jul 30 09:32:17 host1-s-01 kernel: (16637,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 > Jul 30 09:32:18 host1-s-01 kernel: (16685,0):dlm_get_lock_resource:918 > D7581877783A4174A498C97DDC573E52:M0000000000000008bfed0600000000: at least one node (2) to recover > before lock mastery can begin > Jul 30 09:32:18 host1-s-01 kernel: (16680,0):dlm_get_lock_resource:918 > D7581877783A4174A498C97DDC573E52:N0000000003560f35: at least one node (2) to recover before lock > mastery can begin > Jul 30 09:32:18 host1-s-01 kernel: (16637,0):dlm_get_lock_resource:918 > D7581877783A4174A498C97DDC573E52:N0000000008172be2: at least one node (2) to recover before lock > mastery can begin > Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_restart_lock_mastery:1243 ERROR: node down! 2 > Jul 30 09:32:18 host1-s-01 kernel: (16755,0):dlm_wait_for_lock_mastery:1060 ERROR: status = -11 > Jul 30 09:32:19 host1-s-01 kernel: (16755,0):dlm_get_lock_resource:918 > D7581877783A4174A498C97DDC573E52:O000000000000000a40040b00000000: at least one node (2) to recover > before lock mastery can begin > Jul 30 09:33:01 host1-s-01 kernel: o2net: accepted connection from node host1-s-03 (num 2) at > 10.0.1.170:7777 > Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain D7581877783A4174A498C97DDC573E52 > Jul 30 09:33:05 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("D7581877783A4174A498C97DDC573E52"): > 0 1 2 3 > Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain F7929E2FBDB0487DA142467EB725FC22 > Jul 30 09:33:09 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("F7929E2FBDB0487DA142467EB725FC22"): > 0 1 2 3 > Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 6831EB702AC04901A6D5BBE7EBE691AE > Jul 30 09:33:13 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("6831EB702AC04901A6D5BBE7EBE691AE"): > 0 1 2 3 > Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 0E7D6EE19D644648919028729AE662A1 > Jul 30 09:33:19 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("0E7D6EE19D644648919028729AE662A1"): > 0 1 2 3 4 5 6 > Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 63865CE5EDE74713A6B3CECE2A3923C0 > Jul 30 09:33:23 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("63865CE5EDE74713A6B3CECE2A3923C0"): > 0 1 2 3 4 5 6 > Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain D5754A078396403FB841A798BE945A26 > Jul 30 09:33:27 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("D5754A078396403FB841A798BE945A26"): > 0 1 2 3 4 5 6 > Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 976B1863D9114CFAA314354BB1235577 > Jul 30 09:33:31 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("976B1863D9114CFAA314354BB1235577"): > 0 1 2 3 4 5 6 > Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Node 2 joins domain 933698C044EF46D9A175057C523C2D1E > Jul 30 09:33:35 host1-s-01 kernel: ocfs2_dlm: Nodes in domain ("933698C044EF46D9A175057C523C2D1E"): > 0 1 2 3 4 5 6 > > > The strange thing is, that this only happens with one of the volumes - perhaps you could point me > into the right direction how to fix this volume? > >