thr3ads.net - Ocfs2 users - [Ocfs2-users] Soft lockup preceding directory becoming inaccessible. [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Frans Lawaetz

2008-Oct-27 18:52 UTC

[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.

Hi,

I experienced an OCFS related OS crash and, potentially, some sort of
cluster-wide locking problem that I am having trouble diagnosing.

The symptoms first appeared as alerts about very high load across the
Rails / OCFS cluster (four nodes).  Further debugging showed that one
directory in particular was causing rails processes to stop responding
(even to a kill -9).  Changing into this directory would result in my
shell immediately hanging.  Strangely, I could run a "find" down this
directory but a "du" would lock.  The directory itself is tiny,
comprised of only 147 files, 11 MB in size.  About half these files get
updated every hour or so as their cache contents expire.  40 different
rails processes spread out over the four different OCFS nodes can
opportunistically update one of these cache files if they are the first
to run a query against it (meaning that there may have been
near-simultaneous attempts at an update).

I rebooted three of the nodes, two of which successfully came back
online.  A fourth node in the cluster (that which I issued the reboot
commands from) experienced a kernel panic at this juncture and froze.

After all nodes were cleanly rebooted I unmounted the partition
everywhere and ran a fsck on the filesystem.  It finished with no errors
reported.  The data in the cache files, however, was somewhat corrupted
(not totally random but looking a lot like a failed write).

The closest thing to a smoking gun is the output below, tens of pages of
which were recorded to the node's syslog facility (no other nodes
reported these lockups).  The time aligns with when things first started
to go wrong.

I am running CentOS 5, kernel 2.6.18-53.1.21.el5 #1 SMP x86_64. 
Ocfs_tools 1.2.7-1 and OCFS 1.2.8-2.

Oct 26 15:37:39 web2 kernel: BUG: soft lockup detected on CPU#1!
Oct 26 15:37:39 web2 kernel:
Oct 26 15:37:39 web2 kernel: Call Trace:
Oct 26 15:37:39 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
softlockup_tick+0xd5/0xe7
Oct 26 15:37:39 web2 kernel:  [<ffffffff80091040>]
tasklet_action+0x62/0xac
Oct 26 15:37:39 web2 kernel:  [<ffffffff8009413e>]
update_process_times+0x42/0x68
Oct 26 15:37:39 web2 kernel:  [<ffffffff80075724>]
smp_local_timer_interrupt+0x2c/0x61
Oct 26 15:37:39 web2 kernel:  [<ffffffff8005d2fc>] call_softirq+0x1c/0x28
Oct 26 15:37:39 web2 kernel:  [<ffffffff80075dec>]
smp_apic_timer_interrupt+0x41/0x47
Oct 26 15:37:39 web2 kernel:  [<ffffffff8005cc8e>]
apic_timer_interrupt+0x66/0x6c
Oct 26 15:37:39 web2 kernel:  <EOI>  [<ffffffff884644fb>]
:ocfs2:ocfs2_get_block+0x0/0x967
Oct 26 15:37:39 web2 kernel:  [<ffffffff800d82a5>]
cont_prepare_write+0x12c/0x26e
Oct 26 15:37:39 web2 kernel:  [<ffffffff800d822c>]
cont_prepare_write+0xb3/0x26e
Oct 26 15:37:39 web2 kernel:  [<ffffffff884644fb>]
:ocfs2:ocfs2_get_block+0x0/0x967
Oct 26 15:37:39 web2 kernel:  [<ffffffff88465172>]
:ocfs2:ocfs2_prepare_write+0x176/0x1d5
Oct 26 15:37:39 web2 kernel:  [<ffffffff8000f9cb>]
generic_file_buffered_write+0x25a/0x6d8
Oct 26 15:37:39 web2 kernel:  [<ffffffff8846281c>]
:ocfs2:ocfs2_insert_extent+0xb8d/0xc77
Oct 26 15:37:39 web2 kernel:  [<ffffffff8000ddd9>]
current_fs_time+0x3b/0x40
Oct 26 15:37:39 web2 kernel:  [<ffffffff80015df1>]
__generic_file_aio_write_nolock+0x36c/0x3b8
Oct 26 15:37:39 web2 kernel:  [<ffffffff88472d53>]
:ocfs2:ocfs2_extend_file+0xc89/0xd17
Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf380>]
generic_file_aio_write_nolock+0x20/0x6c
Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf74c>]
generic_file_write_nolock+0x8f/0xa8
Oct 26 15:37:39 web2 kernel:  [<ffffffff8009c4a2>]
autoremove_wake_function+0x0/0x2e
Oct 26 15:37:39 web2 kernel:  [<ffffffff800635bf>] __down_read+0x12/0x92
Oct 26 15:37:39 web2 kernel:  [<ffffffff88473435>]
:ocfs2:ocfs2_file_write+0x1cb/0x28a
Oct 26 15:37:39 web2 kernel:  [<ffffffff80016203>] vfs_write+0xce/0x174
Oct 26 15:37:39 web2 kernel:  [<ffffffff80016ad0>] sys_write+0x45/0x6e
Oct 26 15:37:39 web2 kernel:  [<ffffffff8005c28d>] tracesys+0xd5/0xe0
Oct 26 15:37:39 web2 kernel:
Oct 26 15:38:10 web2 kernel: BUG: soft lockup detected on CPU#3!
Oct 26 15:38:10 web2 kernel:
Oct 26 15:38:10 web2 kernel: Call Trace:
Oct 26 15:38:10 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
softlockup_tick+0xd5/0xe7
Oct 26 15:38:10 web2 kernel:  [<ffffffff80091040>]
tasklet_action+0x62/0xac
Oct 26 15:38:10 web2 kernel:  [<ffffffff8009413e>]
update_process_times+0x42/0x68

The only other syslog messages that might be of interest were reported
on a different system, some hours later, when I attempted to reboot the
cluster nodes.

Oct 26 18:33:51 web3 kernel: o2net: connection to node web1 (num 1) at
x.x.x.x:7777 has been idle for 30.0 seconds, shutting it down.
Oct 26 18:33:51 web3 kernel: (0,0):o2net_idle_timer:1426 here are some
times that might help debug the situation: (tmr 1225046001.897649 no
w 1225046031.898315 dr 1225046001.897645 adv
1225046001.897651:1225046001.897652 func (efd9a54a:502)
1225046001.881165:1225046001.881184)
Oct 26 18:33:51 web3 kernel: o2net: no longer connected to node web1
(num 1) at 10.5.7.10:7777
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_broadcast_vote:731 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_do_request_vote:804 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_query_inode_wipe:760 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_delete_inode:851 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel:
(30119,0):dlm_send_remote_unlock_request:360 ERROR: status = -107
Oct 26 18:33:51 web3 kernel:
(30119,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Oct 26 18:33:51 web3 kernel: (30119,0):dlm_wait_for_node_death:365
E7CF726AF2D745CC82E521F820198D83: waiting 5000ms for notification of dea
th of node 1
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
status = -112
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
status = -107
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
status = -107
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
status = -107
Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
status = -107

[and pages more of these status codes]

Thanks in advance,
Frans

Sunil Mushran

2008-Oct-27 20:45 UTC

head link

[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.

find and du are not the same. As in, a find may only need to
read the directory entries and thus only lock the dir inode.
OTOH, du needs to read all the inodes in the dir tree.

The stack shows a soft lockup that could have been triggered if the
process was extending, ftruncate(), the file to a very large size.
As ocfs2 1.2 does not support sparse files, it would not only need
to allocate the space on disk, it would also need to clear out the
pages in memory. If this is happening freq, look into upgrading to
ocfs2 1.4 as it supports sparse files. For more, refer to the 1.4
user's guide.

The last issue looks similar to a bugzilla we fixed in 1.2.9-1.
So if not 1.4, atleast upgrade to 1.2.9-1.

Sunil

Frans Lawaetz wrote:> Hi,
>
> I experienced an OCFS related OS crash and, potentially, some sort of
> cluster-wide locking problem that I am having trouble diagnosing.
>
> The symptoms first appeared as alerts about very high load across the
> Rails / OCFS cluster (four nodes).  Further debugging showed that one
> directory in particular was causing rails processes to stop responding
> (even to a kill -9).  Changing into this directory would result in my
> shell immediately hanging.  Strangely, I could run a "find" down
this
> directory but a "du" would lock.  The directory itself is tiny,
> comprised of only 147 files, 11 MB in size.  About half these files get
> updated every hour or so as their cache contents expire.  40 different
> rails processes spread out over the four different OCFS nodes can
> opportunistically update one of these cache files if they are the first
> to run a query against it (meaning that there may have been
> near-simultaneous attempts at an update).
>
> I rebooted three of the nodes, two of which successfully came back
> online.  A fourth node in the cluster (that which I issued the reboot
> commands from) experienced a kernel panic at this juncture and froze.
>
> After all nodes were cleanly rebooted I unmounted the partition
> everywhere and ran a fsck on the filesystem.  It finished with no errors
> reported.  The data in the cache files, however, was somewhat corrupted
> (not totally random but looking a lot like a failed write).
>
> The closest thing to a smoking gun is the output below, tens of pages of
> which were recorded to the node's syslog facility (no other nodes
> reported these lockups).  The time aligns with when things first started
> to go wrong.
>
> I am running CentOS 5, kernel 2.6.18-53.1.21.el5 #1 SMP x86_64. 
> Ocfs_tools 1.2.7-1 and OCFS 1.2.8-2.
>
> Oct 26 15:37:39 web2 kernel: BUG: soft lockup detected on CPU#1!
> Oct 26 15:37:39 web2 kernel:
> Oct 26 15:37:39 web2 kernel: Call Trace:
> Oct 26 15:37:39 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
> softlockup_tick+0xd5/0xe7
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80091040>]
tasklet_action+0x62/0xac
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8009413e>]
> update_process_times+0x42/0x68
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80075724>]
> smp_local_timer_interrupt+0x2c/0x61
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8005d2fc>]
call_softirq+0x1c/0x28
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80075dec>]
> smp_apic_timer_interrupt+0x41/0x47
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8005cc8e>]
> apic_timer_interrupt+0x66/0x6c
> Oct 26 15:37:39 web2 kernel:  <EOI>  [<ffffffff884644fb>]
> :ocfs2:ocfs2_get_block+0x0/0x967
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800d82a5>]
> cont_prepare_write+0x12c/0x26e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800d822c>]
> cont_prepare_write+0xb3/0x26e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff884644fb>]
> :ocfs2:ocfs2_get_block+0x0/0x967
> Oct 26 15:37:39 web2 kernel:  [<ffffffff88465172>]
> :ocfs2:ocfs2_prepare_write+0x176/0x1d5
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8000f9cb>]
> generic_file_buffered_write+0x25a/0x6d8
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8846281c>]
> :ocfs2:ocfs2_insert_extent+0xb8d/0xc77
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8000ddd9>]
current_fs_time+0x3b/0x40
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80015df1>]
> __generic_file_aio_write_nolock+0x36c/0x3b8
> Oct 26 15:37:39 web2 kernel:  [<ffffffff88472d53>]
> :ocfs2:ocfs2_extend_file+0xc89/0xd17
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf380>]
> generic_file_aio_write_nolock+0x20/0x6c
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800bf74c>]
> generic_file_write_nolock+0x8f/0xa8
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8009c4a2>]
> autoremove_wake_function+0x0/0x2e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff800635bf>]
__down_read+0x12/0x92
> Oct 26 15:37:39 web2 kernel:  [<ffffffff88473435>]
> :ocfs2:ocfs2_file_write+0x1cb/0x28a
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80016203>]
vfs_write+0xce/0x174
> Oct 26 15:37:39 web2 kernel:  [<ffffffff80016ad0>]
sys_write+0x45/0x6e
> Oct 26 15:37:39 web2 kernel:  [<ffffffff8005c28d>] tracesys+0xd5/0xe0
> Oct 26 15:37:39 web2 kernel:
> Oct 26 15:38:10 web2 kernel: BUG: soft lockup detected on CPU#3!
> Oct 26 15:38:10 web2 kernel:
> Oct 26 15:38:10 web2 kernel: Call Trace:
> Oct 26 15:38:10 web2 kernel:  <IRQ>  [<ffffffff800b6475>]
> softlockup_tick+0xd5/0xe7
> Oct 26 15:38:10 web2 kernel:  [<ffffffff80091040>]
tasklet_action+0x62/0xac
> Oct 26 15:38:10 web2 kernel:  [<ffffffff8009413e>]
> update_process_times+0x42/0x68
>
> The only other syslog messages that might be of interest were reported
> on a different system, some hours later, when I attempted to reboot the
> cluster nodes.
>
> Oct 26 18:33:51 web3 kernel: o2net: connection to node web1 (num 1) at
> x.x.x.x:7777 has been idle for 30.0 seconds, shutting it down.
> Oct 26 18:33:51 web3 kernel: (0,0):o2net_idle_timer:1426 here are some
> times that might help debug the situation: (tmr 1225046001.897649 no
> w 1225046031.898315 dr 1225046001.897645 adv
> 1225046001.897651:1225046001.897652 func (efd9a54a:502)
> 1225046001.881165:1225046001.881184)
> Oct 26 18:33:51 web3 kernel: o2net: no longer connected to node web1
> (num 1) at 10.5.7.10:7777
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_broadcast_vote:731 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_do_request_vote:804 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_query_inode_wipe:760 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (30119,0):ocfs2_delete_inode:851 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel:
> (30119,0):dlm_send_remote_unlock_request:360 ERROR: status = -107
> Oct 26 18:33:51 web3 kernel:
> (30119,0):dlm_send_remote_convert_request:398 ERROR: status = -107
> Oct 26 18:33:51 web3 kernel: (30119,0):dlm_wait_for_node_death:365
> E7CF726AF2D745CC82E521F820198D83: waiting 5000ms for notification of dea
> th of node 1
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
> status = -112
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
> status = -107
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
> status = -107
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_drop_lockres_ref:2284 ERROR:
> status = -107
> Oct 26 18:33:51 web3 kernel: (4389,3):dlm_purge_lockres:189 ERROR:
> status = -107
>
> [and pages more of these status codes]
>
> Thanks in advance,
> Frans
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Ocfs2 users - Oct 2008 - Soft lockup preceding directory becoming inaccessible.

[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.

[Ocfs2-users] Soft lockup preceding directory becoming inaccessible.