Frans Lawaetz
2008-Oct-27 18:52 UTC
[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.
Hi, I experienced an OCFS related OS crash and, potentially, some sort of cluster-wide locking problem that I am having trouble diagnosing. The symptoms first appeared as alerts about very high load across the Rails / OCFS cluster (four nodes). Further debugging showed that one directory in particular was causing rails processes to stop responding (even to a kill -9). Changing into this directory would result in my shell immediately hanging. Strangely, I could run a "find" down this directory but a "du" would lock. The directory itself is tiny, comprised of only 147 files, 11 MB in size. About half these files get updated every hour or so as their cache contents expire. 40 different rails processes spread out over the four different OCFS nodes can opportunistically update one of these cache files if they are the first to run a query against it (meaning that there may have been near-simultaneous attempts at an update). I rebooted three of the nodes, two of which successfully came back online. A fourth node in the cluster (that which I issued the reboot commands from) experienced a kernel panic at this juncture and froze. After all nodes were cleanly rebooted I unmounted the partition everywhere and ran a fsck on the filesystem. It finished with no errors reported. The data in the cache files, however, was somewhat corrupted (not totally random but looking a lot like a failed write). The closest thing to a smoking gun is the output below, tens of pages of which were recorded to the node's syslog facility (no other nodes reported these lockups). The time aligns with when things first started to go wrong. I am running CentOS 5, kernel 2.6.18-53.1.21.el5 #1 SMP x86_64. Ocfs_tools 1.2.7-1 and OCFS 1.2.8-2. Oct 26 15:37:39 web2 kernel: BUG: soft lockup detected on CPU#1! Oct 26 15:37:39 web2 kernel: Oct 26 15:37:39 web2 kernel: Call Trace: Oct 26 15:37:39 web2 kernel: <IRQ> [<ffffffff800b6475>] softlockup_tick+0xd5/0xe7 Oct 26 15:37:39 web2 kernel: [<ffffffff80091040>] tasklet_action+0x62/0xac Oct 26 15:37:39 web2 kernel: [<ffffffff8009413e>] update_process_times+0x42/0x68 Oct 26 15:37:39 web2 kernel: [<ffffffff80075724>] smp_local_timer_interrupt+0x2c/0x61 Oct 26 15:37:39 web2 kernel: [<ffffffff8005d2fc>] call_softirq+0x1c/0x28 Oct 26 15:37:39 web2 kernel: [<ffffffff80075dec>] smp_apic_timer_interrupt+0x41/0x47 Oct 26 15:37:39 web2 kernel: [<ffffffff8005cc8e>] apic_timer_interrupt+0x66/0x6c Oct 26 15:37:39 web2 kernel: <EOI> [<ffffffff884644fb>] :ocfs2:ocfs2_get_block+0x0/0x967 Oct 26 15:37:39 web2 kernel: [<ffffffff800d82a5>] cont_prepare_write+0x12c/0x26e Oct 26 15:37:39 web2 kernel: [<ffffffff800d822c>] cont_prepare_write+0xb3/0x26e Oct 26 15:37:39 web2 kernel: [<ffffffff884644fb>] :ocfs2:ocfs2_get_block+0x0/0x967 Oct 26 15:37:39 web2 kernel: [<ffffffff88465172>] :ocfs2:ocfs2_prepare_write+0x176/0x1d5 Oct 26 15:37:39 web2 kernel: [<ffffffff8000f9cb>] generic_file_buffered_write+0x25a/0x6d8 Oct 26 15:37:39 web2 kernel: [<ffffffff8846281c>] :ocfs2:ocfs2_insert_extent+0xb8d/0xc77 Oct 26 15:37:39 web2 kernel: [<ffffffff8000ddd9>] current_fs_time+0x3b/0x40 Oct 26 15:37:39 web2 kernel: [<ffffffff80015df1>] __generic_file_aio_write_nolock+0x36c/0x3b8 Oct 26 15:37:39 web2 kernel: [<ffffffff88472d53>] :ocfs2:ocfs2_extend_file+0xc89/0xd17 Oct 26 15:37:39 web2 kernel: [<ffffffff800bf380>] generic_file_aio_write_nolock+0x20/0x6c Oct 26 15:37:39 web2 kernel: [<ffffffff800bf74c>] generic_file_write_nolock+0x8f/0xa8 Oct 26 15:37:39 web2 kernel: [<ffffffff8009c4a2>] autoremove_wake_function+0x0/0x2e Oct 26 15:37:39 web2 kernel: [<ffffffff800635bf>] __down_read+0x12/0x92 Oct 26 15:37:39 web2 kernel: [<ffffffff88473435>] :ocfs2:ocfs2_file_write+0x1cb/0x28a Oct 26 15:37:39 web2 kernel: [<ffffffff80016203>] vfs_write+0xce/0x174 Oct 26 15:37:39 web2 kernel: [<ffffffff80016ad0>] sys_write+0x45/0x6e Oct 26 15:37:39 web2 kernel: [<ffffffff8005c28d>] tracesys+0xd5/0xe0 Oct 26 15:37:39 web2 kernel: Oct 26 15:38:10 web2 kernel: BUG: soft lockup detected on CPU#3! Oct 26 15:38:10 web2 kernel: Oct 26 15:38:10 web2 kernel: Call Trace: Oct 26 15:38:10 web2 kernel: <IRQ> [<ffffffff800b6475>] softlockup_tick+0xd5/0xe7 Oct 26 15:38:10 web2 kernel: [<ffffffff80091040>] tasklet_action+0x62/0xac Oct 26 15:38:10 web2 kernel: [<ffffffff8009413e>] update_process_times+0x42/0x68 The only other syslog messages that might be of interest were reported on a different system, some hours later, when I attempted to reboot the cluster nodes. Oct 26 18:33:51 web3 kernel: o2net: connection to node web1 (num 1) at x.x.x.x:7777 has been idle for 30.0 seconds, shutting it down. Oct 26 18:33:51 web3 kernel: (0,0):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1225046001.897649 no w 1225046031.898315 dr 1225046001.897645 adv 1225046001.897651:1225046001.897652 func (efd9a54a:502) 1225046001.881165:1225046001.881184) Oct 26 18:33:51 web3 kernel: o2net: no longer connected to node web1 (num 1) at 10.5.7.10:7777 Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR: status = -112 Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_broadcast_vote:731 ERROR: status = -112 Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_do_request_vote:804 ERROR: status = -112 Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_query_inode_wipe:760 ERROR: status = -112 Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_delete_inode:851 ERROR: status = -112 Oct 26 18:33:51 web3 kernel: (30119,0):dlm_send_remote_unlock_request:360 ERROR: status = -107 Oct 26 18:33:51 web3 kernel: (30119,0):dlm_send_remote_convert_request:398 ERROR: status = -107 Oct 26 18:33:51 web3 kernel: (30119,0):dlm_wait_for_node_death:365 E7CF726AF2D745CC82E521F820198D83: waiting 5000ms for notification of dea th of node 1 Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR: status = -112 Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR: status = -107 Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR: status = -107 Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR: status = -107 Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR: status = -107 [and pages more of these status codes] Thanks in advance, Frans
Sunil Mushran
2008-Oct-27 20:45 UTC
[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.
find and du are not the same. As in, a find may only need to read the directory entries and thus only lock the dir inode. OTOH, du needs to read all the inodes in the dir tree. The stack shows a soft lockup that could have been triggered if the process was extending, ftruncate(), the file to a very large size. As ocfs2 1.2 does not support sparse files, it would not only need to allocate the space on disk, it would also need to clear out the pages in memory. If this is happening freq, look into upgrading to ocfs2 1.4 as it supports sparse files. For more, refer to the 1.4 user's guide. The last issue looks similar to a bugzilla we fixed in 1.2.9-1. So if not 1.4, atleast upgrade to 1.2.9-1. Sunil Frans Lawaetz wrote:> Hi, > > I experienced an OCFS related OS crash and, potentially, some sort of > cluster-wide locking problem that I am having trouble diagnosing. > > The symptoms first appeared as alerts about very high load across the > Rails / OCFS cluster (four nodes). Further debugging showed that one > directory in particular was causing rails processes to stop responding > (even to a kill -9). Changing into this directory would result in my > shell immediately hanging. Strangely, I could run a "find" down this > directory but a "du" would lock. The directory itself is tiny, > comprised of only 147 files, 11 MB in size. About half these files get > updated every hour or so as their cache contents expire. 40 different > rails processes spread out over the four different OCFS nodes can > opportunistically update one of these cache files if they are the first > to run a query against it (meaning that there may have been > near-simultaneous attempts at an update). > > I rebooted three of the nodes, two of which successfully came back > online. A fourth node in the cluster (that which I issued the reboot > commands from) experienced a kernel panic at this juncture and froze. > > After all nodes were cleanly rebooted I unmounted the partition > everywhere and ran a fsck on the filesystem. It finished with no errors > reported. The data in the cache files, however, was somewhat corrupted > (not totally random but looking a lot like a failed write). > > The closest thing to a smoking gun is the output below, tens of pages of > which were recorded to the node's syslog facility (no other nodes > reported these lockups). The time aligns with when things first started > to go wrong. > > I am running CentOS 5, kernel 2.6.18-53.1.21.el5 #1 SMP x86_64. > Ocfs_tools 1.2.7-1 and OCFS 1.2.8-2. > > Oct 26 15:37:39 web2 kernel: BUG: soft lockup detected on CPU#1! > Oct 26 15:37:39 web2 kernel: > Oct 26 15:37:39 web2 kernel: Call Trace: > Oct 26 15:37:39 web2 kernel: <IRQ> [<ffffffff800b6475>] > softlockup_tick+0xd5/0xe7 > Oct 26 15:37:39 web2 kernel: [<ffffffff80091040>] tasklet_action+0x62/0xac > Oct 26 15:37:39 web2 kernel: [<ffffffff8009413e>] > update_process_times+0x42/0x68 > Oct 26 15:37:39 web2 kernel: [<ffffffff80075724>] > smp_local_timer_interrupt+0x2c/0x61 > Oct 26 15:37:39 web2 kernel: [<ffffffff8005d2fc>] call_softirq+0x1c/0x28 > Oct 26 15:37:39 web2 kernel: [<ffffffff80075dec>] > smp_apic_timer_interrupt+0x41/0x47 > Oct 26 15:37:39 web2 kernel: [<ffffffff8005cc8e>] > apic_timer_interrupt+0x66/0x6c > Oct 26 15:37:39 web2 kernel: <EOI> [<ffffffff884644fb>] > :ocfs2:ocfs2_get_block+0x0/0x967 > Oct 26 15:37:39 web2 kernel: [<ffffffff800d82a5>] > cont_prepare_write+0x12c/0x26e > Oct 26 15:37:39 web2 kernel: [<ffffffff800d822c>] > cont_prepare_write+0xb3/0x26e > Oct 26 15:37:39 web2 kernel: [<ffffffff884644fb>] > :ocfs2:ocfs2_get_block+0x0/0x967 > Oct 26 15:37:39 web2 kernel: [<ffffffff88465172>] > :ocfs2:ocfs2_prepare_write+0x176/0x1d5 > Oct 26 15:37:39 web2 kernel: [<ffffffff8000f9cb>] > generic_file_buffered_write+0x25a/0x6d8 > Oct 26 15:37:39 web2 kernel: [<ffffffff8846281c>] > :ocfs2:ocfs2_insert_extent+0xb8d/0xc77 > Oct 26 15:37:39 web2 kernel: [<ffffffff8000ddd9>] current_fs_time+0x3b/0x40 > Oct 26 15:37:39 web2 kernel: [<ffffffff80015df1>] > __generic_file_aio_write_nolock+0x36c/0x3b8 > Oct 26 15:37:39 web2 kernel: [<ffffffff88472d53>] > :ocfs2:ocfs2_extend_file+0xc89/0xd17 > Oct 26 15:37:39 web2 kernel: [<ffffffff800bf380>] > generic_file_aio_write_nolock+0x20/0x6c > Oct 26 15:37:39 web2 kernel: [<ffffffff800bf74c>] > generic_file_write_nolock+0x8f/0xa8 > Oct 26 15:37:39 web2 kernel: [<ffffffff8009c4a2>] > autoremove_wake_function+0x0/0x2e > Oct 26 15:37:39 web2 kernel: [<ffffffff800635bf>] __down_read+0x12/0x92 > Oct 26 15:37:39 web2 kernel: [<ffffffff88473435>] > :ocfs2:ocfs2_file_write+0x1cb/0x28a > Oct 26 15:37:39 web2 kernel: [<ffffffff80016203>] vfs_write+0xce/0x174 > Oct 26 15:37:39 web2 kernel: [<ffffffff80016ad0>] sys_write+0x45/0x6e > Oct 26 15:37:39 web2 kernel: [<ffffffff8005c28d>] tracesys+0xd5/0xe0 > Oct 26 15:37:39 web2 kernel: > Oct 26 15:38:10 web2 kernel: BUG: soft lockup detected on CPU#3! > Oct 26 15:38:10 web2 kernel: > Oct 26 15:38:10 web2 kernel: Call Trace: > Oct 26 15:38:10 web2 kernel: <IRQ> [<ffffffff800b6475>] > softlockup_tick+0xd5/0xe7 > Oct 26 15:38:10 web2 kernel: [<ffffffff80091040>] tasklet_action+0x62/0xac > Oct 26 15:38:10 web2 kernel: [<ffffffff8009413e>] > update_process_times+0x42/0x68 > > The only other syslog messages that might be of interest were reported > on a different system, some hours later, when I attempted to reboot the > cluster nodes. > > Oct 26 18:33:51 web3 kernel: o2net: connection to node web1 (num 1) at > x.x.x.x:7777 has been idle for 30.0 seconds, shutting it down. > Oct 26 18:33:51 web3 kernel: (0,0):o2net_idle_timer:1426 here are some > times that might help debug the situation: (tmr 1225046001.897649 no > w 1225046031.898315 dr 1225046001.897645 adv > 1225046001.897651:1225046001.897652 func (efd9a54a:502) > 1225046001.881165:1225046001.881184) > Oct 26 18:33:51 web3 kernel: o2net: no longer connected to node web1 > (num 1) at 10.5.7.10:7777 > Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR: > status = -112 > Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_broadcast_vote:731 ERROR: > status = -112 > Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_do_request_vote:804 ERROR: > status = -112 > Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_query_inode_wipe:760 ERROR: > status = -112 > Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_delete_inode:851 ERROR: > status = -112 > Oct 26 18:33:51 web3 kernel: > (30119,0):dlm_send_remote_unlock_request:360 ERROR: status = -107 > Oct 26 18:33:51 web3 kernel: > (30119,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Oct 26 18:33:51 web3 kernel: (30119,0):dlm_wait_for_node_death:365 > E7CF726AF2D745CC82E521F820198D83: waiting 5000ms for notification of dea > th of node 1 > Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR: > status = -112 > Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR: > status = -107 > Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR: > status = -107 > Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR: > status = -107 > Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR: > status = -107 > > [and pages more of these status codes] > > Thanks in advance, > Frans > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >