thr3ads.net - Ocfs2 users - [Ocfs2-users] recovery after node crash [May 2010]

If this information is useful, please help other people find it:
Share via:

Nikola Ciprich

2010-May-23 20:13 UTC

[Ocfs2-users] recovery after node crash

Hello,

I'm trying to make my cluster more fault tolerant, and while testing
various problem scenarios, I've hit following problem:

If I have 2-node pacemaker cluster with OCFS2 filesystem on top of
primary/primary
DRBD using ocfs2_controld.pcmk and I kill one of the nodes, then second node
ocfs2
filesystem gets stuck and keeps waiting till second node goes online.

after 120 seconds, kernel detects hung processes, showing following backtraces:
May 21 18:23:56 rock1 [ 2520.497485] INFO: task mc:5827 blocked for more than
120 seconds.
May 21 18:23:56 rock1 [ 2520.497489] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 21 18:23:56 rock1 [ 2520.497493] mc            D ffff8801340a8580     0 
5827   4694 0x00000000
May 21 18:23:56 rock1 [ 2520.497500]  ffff880137903d08 0000000000000082
0000000000000000 ffff88013b6ce340
May 21 18:23:56 rock1 [ 2520.497507]  0000000000013780 ffff8801335f80c0
ffff8801335f8000 ffffffff81516080
May 21 18:23:56 rock1 [ 2520.497514]  ffff8801335f83b8 ffff880137903fd8
0000000000000282 000000010007cf50
May 21 18:23:56 rock1 [ 2520.497520] Call Trace:
May 21 18:23:56 rock1 [ 2520.497551]  [<ffffffffa05d6055>]
ocfs2_wait_for_recovery+0x55/0x90 [ocfs2]
May 21 18:23:56 rock1 [ 2520.497559]  [<ffffffff8106d7b0>] ?
autoremove_wake_function+0x0/0x40
May 21 18:23:56 rock1 [ 2520.497580]  [<ffffffffa05ba96f>]
ocfs2_inode_lock_full_nested+0x53f/0x7d0 [ocfs2]
May 21 18:23:56 rock1 [ 2520.497597]  [<ffffffffa05d238e>]
ocfs2_inode_revalidate+0x6e/0x390 [ocfs2]
May 21 18:23:56 rock1 [ 2520.497607]  [<ffffffffa05c8fdb>]
ocfs2_getattr+0x5b/0x260 [ocfs2]
May 21 18:23:56 rock1 [ 2520.497611]  [<ffffffff81113392>]
vfs_getattr+0x52/0x80
May 21 18:23:56 rock1 [ 2520.497613]  [<ffffffff81113422>]
vfs_fstatat+0x62/0x70
May 21 18:23:56 rock1 [ 2520.497615]  [<ffffffff81113499>]
vfs_lstat+0x19/0x20
May 21 18:23:56 rock1 [ 2520.497617]  [<ffffffff811134bf>]
sys_newlstat+0x1f/0x50
May 21 18:23:56 rock1 [ 2520.497620]  [<ffffffff8102e46c>] ?
do_page_fault+0x19c/0x2e0
May 21 18:23:56 rock1 [ 2520.497623]  [<ffffffff813385cf>] ?
page_fault+0x1f/0x30
May 21 18:23:56 rock1 [ 2520.497627]  [<ffffffff8100b32b>]
system_call_fastpath+0x16/0x1b

showing ocfs2_wait_for_recovery as possible culprit. According to default values
in sources, irresponsive
nodes should be considered dead after ~60 seconds, so it's strange that node
keeps waiting for so long.
Another thing is, that I don't know how can I set (or check) timeouts.
I'm using pcmk cluster stack,
and I can't find any sysfs or proc file I could use for it.

Could anybody please advise me what am I doing wrong, or what I should
set/check?
I'm running 2.6.32.13 and ocfs2-tools-1.4.4

Thanks a lot in advance!

with best regards

nik

Marco

2010-May-24 13:21 UTC

head link

[Ocfs2-users] recovery after node crash

* Nikola Ciprich <extmaillist at linuxbox.cz> [2010 05 23,
22:13]:> after 120 seconds, kernel detects hung processes, showing following
backtraces:
> May 21 18:23:56 rock1 [ 2520.497485] INFO: task mc:5827 blocked for more
than 120 seconds.
> May 21 18:23:56 rock1 [ 2520.497489] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
 I am seeing that message in my logs quite often as well -- details
available in my previous mail in the list archives:

http://oss.oracle.com/pipermail/ocfs2-users/2010-April/004427.html

 Sometimes it has no consequences, other times it seems to lead to a
kernel panic :(

 Here using a SAN volume, 2.6.32 and ocfs2-tools 1.6.0 from 

http://oss.oracle.com/git/?p=ocfs2-tools.git;a=commit;h=de50bf25701e8af08aea3664d5982b993ddc7433

Best regards

Ocfs2 users - May 2010 - recovery after node crash

[Ocfs2-users] recovery after node crash

[Ocfs2-users] recovery after node crash