Marek Królikowski
2011-Dec-20 17:46 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Sorry i don`t copy everything: TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc debugfs.ocfs2 1.6.4 5239722 26198604 246266859 TEST-MAIL1# echo "ls //orphan_dir:0001"|debugfs.ocfs2 /dev/dm-0|wc debugfs.ocfs2 1.6.4 6074335 30371669 285493670 TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc debugfs.ocfs2 1.6.4 5239722 26198604 246266859 TEST-MAIL2 ~ # echo "ls //orphan_dir:0001"|debugfs.ocfs2 /dev/dm-0|wc debugfs.ocfs2 1.6.4 6074335 30371669 285493670 Thanks for Your help. From: Marek Kr?likowski Sent: Tuesday, December 20, 2011 6:39 PM To: ocfs2-users at oss.oracle.com Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both> I think you are running into a known issue. Are there lot of orphan > files in orphan directory? I am not sure if the problem is still there, > if not please run the same test and once you see the same symptoms, > please run the following and provide me the output > > echo "ls //orphan_dir:0000"|debugfs.ocfs2 <device>|wc > echo "ls //orphan_dir:0001"|debugfs.ocfs2 <device>|wcHello Thank You for answer - strange i don`t get email with Your answer. This is what You want: TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc debugfs.ocfs2 1.6.4 5239722 26198604 246266859 TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc debugfs.ocfs2 1.6.4 5239722 26198604 246266859 This is my testing cluster so if u need do more tests please tell me i do for You. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111220/14a25763/attachment.html
Srinivas Eeda
2011-Dec-20 18:58 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Marek Kr?likowski wrote:> Sorry i don`t copy everything: > TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc > debugfs.ocfs2 1.6.4 > 5239722 26198604 246266859^^^^^ those numbers (5239722, 6074335) are the problem. What they are telling is the orphan directory is filled with flood of files. This is because of the change of unlink behavior introduced by patch "ea455f8ab68338ba69f5d3362b342c115bea8e13". If you are interested in details, ... in normal unlink case an entry for the deleting file is created in orphan directory as an intermediate step and the entry is cleared towards the end of the unlink process. But because of that patch, entry doesn't get cleared and sticks around. OCFS2 has a function called orphan scan which is executed as part of a thread which gets a ex lock on orphan scan lock and it then scans to clear all entries but it can't because the open lock is still around. Since this can takes longer because of the huge number of entries getting created, *new deletes will get delayed* as they need the ex lock. So what can be done? for now if you are not using quotas feature you should get a new kernel by backing out the following patches 5fd131893793567c361ae64cbeb28a2a753bbe35 f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a ea455f8ab68338ba69f5d3362b342c115bea8e13 or periodically umount the file system on all nodes and remount whenever the problem becomes severe. Thanks, --Srini> TEST-MAIL1# echo "ls //orphan_dir:0001"|debugfs.ocfs2 /dev/dm-0|wc > debugfs.ocfs2 1.6.4 > 6074335 30371669 285493670 > > TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc > debugfs.ocfs2 1.6.4 > 5239722 26198604 246266859 > TEST-MAIL2 ~ # echo "ls //orphan_dir:0001"|debugfs.ocfs2 /dev/dm-0|wc > debugfs.ocfs2 1.6.4 > 6074335 30371669 285493670 > > Thanks for Your help. > > > *From:* Marek Kr?likowski <mailto:admin at wset.edu.pl> > *Sent:* Tuesday, December 20, 2011 6:39 PM > *To:* ocfs2-users at oss.oracle.com <mailto:ocfs2-users at oss.oracle.com> > *Subject:* Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read > from both > > > I think you are running into a known issue. Are there lot of orphan > > files in orphan directory? I am not sure if the problem is still there, > > if not please run the same test and once you see the same symptoms, > > please run the following and provide me the output > > > > echo "ls //orphan_dir:0000"|debugfs.ocfs2 <device>|wc > > echo "ls //orphan_dir:0001"|debugfs.ocfs2 <device>|wc > Hello > Thank You for answer - strange i don`t get email with Your answer. > This is what You want: > TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc > debugfs.ocfs2 1.6.4 > 5239722 26198604 246266859 > > > TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc > debugfs.ocfs2 1.6.4 > 5239722 26198604 246266859 > > > This is my testing cluster so if u need do more tests please tell me i > do for You. > Thanks > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Marek Królikowski
2011-Dec-20 21:32 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Hello I compile kernel with all patch and create new FS: TEST-MAIL1 ~ # mkfs.ocfs2 -N 2 -L MAIL /dev/dm-0 mkfs.ocfs2 1.6.4 Cluster stack: classic o2cb Overwriting existing ocfs2 partition. Proceed (y/N): Y Label: MAIL Features: sparse backup-super unwritten inline-data strict-journal-super xattr Block size: 4096 (12 bits) Cluster size: 4096 (12 bits) Volume size: 1729073381376 (422137056 clusters) (422137056 blocks) Cluster groups: 13088 (tail covers 2784 clusters, rest cover 32256 clusters) Extent allocator size: 868220928 (207 groups) Journal size: 268435456 Node slots: 2 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing backup superblock: 6 block(s) Formatting Journals: done Growing extent allocator: done Formatting slot map: done Formatting quota files: done Writing lost+found: done mkfs.ocfs2 successful TEST-MAIL1 ~ # mount /dev/dm-0 /mnt/EMC TEST-MAIL2 ~ # mount /dev/dm-0 /mnt/EMC TEST-MAIL1 ~ # cat /proc/mounts |grep EMC /dev/dm-0 /mnt/EMC ocfs2 rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,coherency=full,user_xattr,acl 0 0 And now i run on both server my script - we will see what happened tommorow. Again thank You for Your time.
srinivas eeda
2011-Dec-22 20:12 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
We need to know what happened to node 2. Was the node rebooted because of a network timeout or kernel panic? can you please configure netconsole, serial console and rerun the test? On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello > After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but > TEST-MAIL1 got in dmesg: > TEST-MAIL1 ~ #dmesg > [cut] > o2net: accepted connection from node TEST-MAIL2 (num 1) at > 172.17.1.252:7777 > o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611 > o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1 > o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has > been idle for 60.0 seconds, shutting it down. > (swapper,0,0):o2net_idle_timer:1562 Here are some times that might > help debug the situation: (Timer: 33127732045, Now 33187808090, > DataReady 33127732039, Advance 33127732051-33127732051, Key > 0xebb9cd47, Func 506, FuncTime 33127732045-33127732048) > o2net: no longer connected to node TEST-MAIL2 (num 1) at > 172.17.1.252:7777 > (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down! > (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, > error -112 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, > error -107 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107 > (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection > established with node 1 after 60.0 seconds, giving up and returning > errors. > (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node > 1 from group B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at > least one node (1) to recover before lock mastery can begin > (ocfs2rec,5504,6):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at > least one node (1) to recover before lock mastery can begin > (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1 > (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11 > (du,5099,12):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node > (1) to recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to > recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 > B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must > master $RECOVERY lock now > (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the > Recovery Master for the Dead Node 1 for Domain > B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from > slot 1 on device (253,0) > (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota > recovery in slot 1 > (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota > recovery in slot 1 > > And i try give this command: > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > allow > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or > directory > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > off > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or > directory > > But not working.... > > > -----Oryginalna wiadomo??----- From: Srinivas Eeda > Sent: Wednesday, December 21, 2011 8:43 PM > To: Marek Kr?likowski > Cc: ocfs2-users at oss.oracle.com > Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read > from both > > Those numbers look good. Basically with the fixes backed out and another > fix I gave, you are not seeing that many orphans hanging around and > hence not seeing the process stuck kernel stacks. You can run the test > longer or if you are satisfied, please enable quotas and re-run the test > with the modified kernel. You might see a dead lock which needs to be > fixed(I was not able to reproduce this yet). If the system hangs, please > capture the following and provide me the output > > 1. echo t > /proc/sysrq-trigger > 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC > EXTENT_MAP allow > 3. wait for 10 minutes > 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC > EXTENT_MAP off > 5. echo t > /proc/sysrq-trigger >
Marek Królikowski
2011-Dec-22 20:49 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Ok i reconfigure server and do again test hope tommorow die again because i see in log he crash after 10 hours work with no problem. Thanks -----Oryginalna wiadomo??----- From: srinivas eeda Sent: Thursday, December 22, 2011 9:12 PM To: Marek Kr?likowski Cc: ocfs2-users at oss.oracle.com Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both We need to know what happened to node 2. Was the node rebooted because of a network timeout or kernel panic? can you please configure netconsole, serial console and rerun the test? On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello > After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but > TEST-MAIL1 got in dmesg: > TEST-MAIL1 ~ #dmesg > [cut] > o2net: accepted connection from node TEST-MAIL2 (num 1) at > 172.17.1.252:7777 > o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611 > o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1 > o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been > idle for 60.0 seconds, shutting it down. > (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help > debug the situation: (Timer: 33127732045, Now 33187808090, DataReady > 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, > FuncTime 33127732045-33127732048) > o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 > (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down! > (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, > error -112 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, > error -107 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107 > (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection > established with node 1 after 60.0 seconds, giving up and returning > errors. > (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 > from group B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least > one node (1) to recover before lock mastery can begin > (ocfs2rec,5504,6):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least > one node (1) to recover before lock mastery can begin > (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1 > (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11 > (du,5099,12):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) > to recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to > recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 > B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must > master $RECOVERY lock now > (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the > Recovery Master for the Dead Node 1 for Domain > B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 > on device (253,0) > (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery > in slot 1 > (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota > recovery in slot 1 > > And i try give this command: > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > allow > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory > > But not working.... > > > -----Oryginalna wiadomo??----- From: Srinivas Eeda > Sent: Wednesday, December 21, 2011 8:43 PM > To: Marek Kr?likowski > Cc: ocfs2-users at oss.oracle.com > Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from > both > > Those numbers look good. Basically with the fixes backed out and another > fix I gave, you are not seeing that many orphans hanging around and > hence not seeing the process stuck kernel stacks. You can run the test > longer or if you are satisfied, please enable quotas and re-run the test > with the modified kernel. You might see a dead lock which needs to be > fixed(I was not able to reproduce this yet). If the system hangs, please > capture the following and provide me the output > > 1. echo t > /proc/sysrq-trigger > 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > allow > 3. wait for 10 minutes > 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > off > 5. echo t > /proc/sysrq-trigger >
Marek Królikowski
2011-Dec-23 16:55 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Hello After 24hours test both servers still working with np. I run one more script on both servers: TEST-MAIL1 ~ # cat terror3.sh #!/bin/bash while true do du -sh /mnt/EMC/TEST-MAIL2 find /mnt/EMC/TEST-MAIL2 sleep 30 done; TEST-MAIL2 ~ # cat terror3.sh #!/bin/bash while true do du -sh /mnt/EMC/TEST-MAIL1 find . /mnt/EMC/TEST-MAIL1 sleep 30 done; This script do find and du -sh on file who upload another machine to ocfs2. Cheers -----Oryginalna wiadomo??----- From: srinivas eeda Sent: Thursday, December 22, 2011 9:12 PM To: Marek Kr?likowski Cc: ocfs2-users at oss.oracle.com Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both We need to know what happened to node 2. Was the node rebooted because of a network timeout or kernel panic? can you please configure netconsole, serial console and rerun the test? On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello > After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but > TEST-MAIL1 got in dmesg: > TEST-MAIL1 ~ #dmesg > [cut] > o2net: accepted connection from node TEST-MAIL2 (num 1) at > 172.17.1.252:7777 > o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611 > o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1 > o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been > idle for 60.0 seconds, shutting it down. > (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help > debug the situation: (Timer: 33127732045, Now 33187808090, DataReady > 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, > FuncTime 33127732045-33127732048) > o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 > (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down! > (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, > error -112 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, > error -107 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107 > (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection > established with node 1 after 60.0 seconds, giving up and returning > errors. > (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 > from group B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least > one node (1) to recover before lock mastery can begin > (ocfs2rec,5504,6):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least > one node (1) to recover before lock mastery can begin > (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1 > (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11 > (du,5099,12):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) > to recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to > recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 > B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must > master $RECOVERY lock now > (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the > Recovery Master for the Dead Node 1 for Domain > B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 > on device (253,0) > (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery > in slot 1 > (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota > recovery in slot 1 > > And i try give this command: > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > allow > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory > > But not working.... > > > -----Oryginalna wiadomo??----- From: Srinivas Eeda > Sent: Wednesday, December 21, 2011 8:43 PM > To: Marek Kr?likowski > Cc: ocfs2-users at oss.oracle.com > Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from > both > > Those numbers look good. Basically with the fixes backed out and another > fix I gave, you are not seeing that many orphans hanging around and > hence not seeing the process stuck kernel stacks. You can run the test > longer or if you are satisfied, please enable quotas and re-run the test > with the modified kernel. You might see a dead lock which needs to be > fixed(I was not able to reproduce this yet). If the system hangs, please > capture the following and provide me the output > > 1. echo t > /proc/sysrq-trigger > 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > allow > 3. wait for 10 minutes > 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > off > 5. echo t > /proc/sysrq-trigger >
Marek Królikowski
2011-Dec-23 21:19 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Hello I get oops on TEST-MAIL2: INFO: task ocfs2dc:15430 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ocfs2dc D ffff88107f232c40 0 15430 2 0x00000000 ffff881014889080 0000000000000046 ffff881000000000 ffff88102060c080 0000000000012c40 ffff88101eefbfd8 0000000000012c40 ffff88101eefa010 ffff88101eefbfd8 0000000000012c40 0000000000000001 00000001130a4380 Call Trace: [<ffffffff8148db41>] ? __mutex_lock_slowpath+0xd1/0x140 [<ffffffff8148da53>] ? mutex_lock+0x23/0x40 [<ffffffff81181eb6>] ? dqget+0x246/0x3a0 [<ffffffff81182281>] ? __dquot_initialize+0x121/0x210 [<ffffffff8114c90d>] ? d_kill+0x9d/0x100 [<ffffffffa0a601c3>] ? ocfs2_find_local_alias+0x23/0x100 [ocfs2] [<ffffffffa0a7fca8>] ? ocfs2_delete_inode+0x98/0x3e0 [ocfs2] [<ffffffffa0a7106c>] ? ocfs2_unblock_lock+0x10c/0x770 [ocfs2] [<ffffffffa0a80969>] ? ocfs2_evict_inode+0x19/0x40 [ocfs2] [<ffffffff8114e9cc>] ? evict+0x8c/0x170 [<ffffffffa0a5fccd>] ? ocfs2_dentry_lock_put+0x5d/0x90 [ocfs2] [<ffffffffa0a7177a>] ? ocfs2_process_blocked_lock+0xaa/0x280 [ocfs2] [<ffffffff8107beb2>] ? prepare_to_wait+0x82/0x90 [<ffffffff8107bceb>] ? finish_wait+0x4b/0xa0 [<ffffffffa0a71aa0>] ? ocfs2_downconvert_thread+0x150/0x270 [ocfs2] [<ffffffff8107bb60>] ? wake_up_bit+0x40/0x40 [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2] [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2] [<ffffffff8107b686>] ? kthread+0x96/0xa0 [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10 [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190 [<ffffffff81498a70>] ? gs_change+0x13/0x13 INFO: task kworker/0:1:30806 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/0:1 D ffff88107f212c40 0 30806 2 0x00000000 ffff8810152f4080 0000000000000046 0000000000000000 ffffffff81a0d020 0000000000012c40 ffff880c28a57fd8 0000000000012c40 ffff880c28a56010 ffff880c28a57fd8 0000000000012c40 ffff880c28a57a08 00000001152f4080 Call Trace: [<ffffffff8148d45d>] ? schedule_timeout+0x1ed/0x2d0 [<ffffffffa05244c0>] ? __jbd2_journal_file_buffer+0xd0/0x230 [jbd2] [<ffffffff8148ce5c>] ? wait_for_common+0x12c/0x1a0 [<ffffffff81052230>] ? try_to_wake_up+0x280/0x280 [<ffffffff81085e21>] ? ktime_get+0x61/0xf0 [<ffffffffa0a6e850>] ? __ocfs2_cluster_lock+0x1f0/0x780 [ocfs2] [<ffffffff81046fa7>] ? find_busiest_group+0x1f7/0xb00 [<ffffffffa0a73a56>] ? ocfs2_inode_lock_full_nested+0x126/0x540 [ocfs2] [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2] [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2] [<ffffffffa0ad71df>] ? ocfs2_sync_dquot_helper+0xbf/0x330 [ocfs2] [<ffffffffa0ad7120>] ? ocfs2_acquire_dquot+0x390/0x390 [ocfs2] [<ffffffff81181c3a>] ? dquot_scan_active+0xda/0x110 [<ffffffffa0ad4ca0>] ? ocfs2_global_is_id+0x60/0x60 [ocfs2] [<ffffffffa0ad4cc1>] ? qsync_work_fn+0x21/0x40 [ocfs2] [<ffffffff810753f3>] ? process_one_work+0x123/0x450 [<ffffffff8107690b>] ? worker_thread+0x15b/0x370 [<ffffffff810767b0>] ? manage_workers+0x110/0x110 [<ffffffff810767b0>] ? manage_workers+0x110/0x110 [<ffffffff8107b686>] ? kthread+0x96/0xa0 [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10 [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190 [<ffffffff81498a70>] ? gs_change+0x13/0x13 And i can`t login to TEST-MAIL1 after give login and password console say when i lastlog but i don`t get bash - console don`t answer... but there is no OOPS or something like this on screen. I don`t restart both server tell me what to do now. Thanks -----Oryginalna wiadomo??----- From: srinivas eeda Sent: Thursday, December 22, 2011 9:12 PM To: Marek Kr?likowski Cc: ocfs2-users at oss.oracle.com Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both We need to know what happened to node 2. Was the node rebooted because of a network timeout or kernel panic? can you please configure netconsole, serial console and rerun the test? On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello > After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but > TEST-MAIL1 got in dmesg: > TEST-MAIL1 ~ #dmesg > [cut] > o2net: accepted connection from node TEST-MAIL2 (num 1) at > 172.17.1.252:7777 > o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611 > o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1 > o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been > idle for 60.0 seconds, shutting it down. > (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help > debug the situation: (Timer: 33127732045, Now 33187808090, DataReady > 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, > FuncTime 33127732045-33127732048) > o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 > (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down! > (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, > error -112 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112 > (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: > B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, > error -107 send AST to node 1 > (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107 > (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection > established with node 1 after 60.0 seconds, giving up and returning > errors. > (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 > from group B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least > one node (1) to recover before lock mastery can begin > (ocfs2rec,5504,6):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least > one node (1) to recover before lock mastery can begin > (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1 > (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11 > (du,5099,12):dlm_get_lock_resource:888 > B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) > to recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 > B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to > recover before lock mastery can begin > (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 > B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must > master $RECOVERY lock now > (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the > Recovery Master for the Dead Node 1 for Domain > B24C4493BBC74FEAA3371E2534BB3611 > (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 > on device (253,0) > (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery > in slot 1 > (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota > recovery in slot 1 > > And i try give this command: > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > allow > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory > debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off > debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory > > But not working.... > > > -----Oryginalna wiadomo??----- From: Srinivas Eeda > Sent: Wednesday, December 21, 2011 8:43 PM > To: Marek Kr?likowski > Cc: ocfs2-users at oss.oracle.com > Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from > both > > Those numbers look good. Basically with the fixes backed out and another > fix I gave, you are not seeing that many orphans hanging around and > hence not seeing the process stuck kernel stacks. You can run the test > longer or if you are satisfied, please enable quotas and re-run the test > with the modified kernel. You might see a dead lock which needs to be > fixed(I was not able to reproduce this yet). If the system hangs, please > capture the following and provide me the output > > 1. echo t > /proc/sysrq-trigger > 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > allow > 3. wait for 10 minutes > 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP > off > 5. echo t > /proc/sysrq-trigger >
srinivas eeda
2011-Dec-23 21:52 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Please press sysrq key and t to dump kernel stacks on both nodes and please email me the messages files. On 12/23/2011 1:19 PM, Marek Kr?likowski wrote:> Hello > I get oops on TEST-MAIL2: > > INFO: task ocfs2dc:15430 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > ocfs2dc D ffff88107f232c40 0 15430 2 0x00000000 > ffff881014889080 0000000000000046 ffff881000000000 ffff88102060c080 > 0000000000012c40 ffff88101eefbfd8 0000000000012c40 ffff88101eefa010 > ffff88101eefbfd8 0000000000012c40 0000000000000001 00000001130a4380 > Call Trace: > [<ffffffff8148db41>] ? __mutex_lock_slowpath+0xd1/0x140 > [<ffffffff8148da53>] ? mutex_lock+0x23/0x40 > [<ffffffff81181eb6>] ? dqget+0x246/0x3a0 > [<ffffffff81182281>] ? __dquot_initialize+0x121/0x210 > [<ffffffff8114c90d>] ? d_kill+0x9d/0x100 > [<ffffffffa0a601c3>] ? ocfs2_find_local_alias+0x23/0x100 [ocfs2] > [<ffffffffa0a7fca8>] ? ocfs2_delete_inode+0x98/0x3e0 [ocfs2] > [<ffffffffa0a7106c>] ? ocfs2_unblock_lock+0x10c/0x770 [ocfs2] > [<ffffffffa0a80969>] ? ocfs2_evict_inode+0x19/0x40 [ocfs2] > [<ffffffff8114e9cc>] ? evict+0x8c/0x170 > [<ffffffffa0a5fccd>] ? ocfs2_dentry_lock_put+0x5d/0x90 [ocfs2] > [<ffffffffa0a7177a>] ? ocfs2_process_blocked_lock+0xaa/0x280 [ocfs2] > [<ffffffff8107beb2>] ? prepare_to_wait+0x82/0x90 > [<ffffffff8107bceb>] ? finish_wait+0x4b/0xa0 > [<ffffffffa0a71aa0>] ? ocfs2_downconvert_thread+0x150/0x270 [ocfs2] > [<ffffffff8107bb60>] ? wake_up_bit+0x40/0x40 > [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2] > [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2] > [<ffffffff8107b686>] ? kthread+0x96/0xa0 > [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190 > [<ffffffff81498a70>] ? gs_change+0x13/0x13 > INFO: task kworker/0:1:30806 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/0:1 D ffff88107f212c40 0 30806 2 0x00000000 > ffff8810152f4080 0000000000000046 0000000000000000 ffffffff81a0d020 > 0000000000012c40 ffff880c28a57fd8 0000000000012c40 ffff880c28a56010 > ffff880c28a57fd8 0000000000012c40 ffff880c28a57a08 00000001152f4080 > Call Trace: > [<ffffffff8148d45d>] ? schedule_timeout+0x1ed/0x2d0 > [<ffffffffa05244c0>] ? __jbd2_journal_file_buffer+0xd0/0x230 [jbd2] > [<ffffffff8148ce5c>] ? wait_for_common+0x12c/0x1a0 > [<ffffffff81052230>] ? try_to_wake_up+0x280/0x280 > [<ffffffff81085e21>] ? ktime_get+0x61/0xf0 > [<ffffffffa0a6e850>] ? __ocfs2_cluster_lock+0x1f0/0x780 [ocfs2] > [<ffffffff81046fa7>] ? find_busiest_group+0x1f7/0xb00 > [<ffffffffa0a73a56>] ? ocfs2_inode_lock_full_nested+0x126/0x540 [ocfs2] > [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2] > [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2] > [<ffffffffa0ad71df>] ? ocfs2_sync_dquot_helper+0xbf/0x330 [ocfs2] > [<ffffffffa0ad7120>] ? ocfs2_acquire_dquot+0x390/0x390 [ocfs2] > [<ffffffff81181c3a>] ? dquot_scan_active+0xda/0x110 > [<ffffffffa0ad4ca0>] ? ocfs2_global_is_id+0x60/0x60 [ocfs2] > [<ffffffffa0ad4cc1>] ? qsync_work_fn+0x21/0x40 [ocfs2] > [<ffffffff810753f3>] ? process_one_work+0x123/0x450 > [<ffffffff8107690b>] ? worker_thread+0x15b/0x370 > [<ffffffff810767b0>] ? manage_workers+0x110/0x110 > [<ffffffff810767b0>] ? manage_workers+0x110/0x110 > [<ffffffff8107b686>] ? kthread+0x96/0xa0 > [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190 > [<ffffffff81498a70>] ? gs_change+0x13/0x13 > > And i can`t login to TEST-MAIL1 after give login and password console > say when i lastlog but i don`t get bash - console don`t answer... but > there is no OOPS or something like this on screen. > I don`t restart both server tell me what to do now. > Thanks > > > -----Oryginalna wiadomo??----- From: srinivas eeda > Sent: Thursday, December 22, 2011 9:12 PM > To: Marek Kr?likowski > Cc: ocfs2-users at oss.oracle.com > Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read > from both > > We need to know what happened to node 2. Was the node rebooted because > of a network timeout or kernel panic? can you please configure > netconsole, serial console and rerun the test? > > On 12/22/2011 8:08 AM, Marek Kr?likowski wrote: >> Hello >> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but >> TEST-MAIL1 got in dmesg: >> TEST-MAIL1 ~ #dmesg >> [cut] >> o2net: accepted connection from node TEST-MAIL2 (num 1) at >> 172.17.1.252:7777 >> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611 >> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1 >> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has >> been idle for 60.0 seconds, shutting it down. >> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might >> help debug the situation: (Timer: 33127732045, Now 33187808090, >> DataReady 33127732039, Advance 33127732051-33127732051, Key >> 0xebb9cd47, Func 506, FuncTime 33127732045-33127732048) >> o2net: no longer connected to node TEST-MAIL2 (num 1) at >> 172.17.1.252:7777 >> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down! >> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112 >> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: >> B24C4493BBC74FEAA3371E2534BB3611: res >> M000000000000000000000cf023ef70, error -112 send AST to node 1 >> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112 >> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: >> B24C4493BBC74FEAA3371E2534BB3611: res >> P000000000000000000000000000000, error -107 send AST to node 1 >> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107 >> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection >> established with node 1 after 60.0 seconds, giving up and returning >> errors. >> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted >> node 1 from group B24C4493BBC74FEAA3371E2534BB3611 >> (ocfs2rec,5504,6):dlm_get_lock_resource:834 >> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at >> least one node (1) to recover before lock mastery can begin >> (ocfs2rec,5504,6):dlm_get_lock_resource:888 >> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at >> least one node (1) to recover before lock mastery can begin >> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1 >> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11 >> (du,5099,12):dlm_get_lock_resource:888 >> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node >> (1) to recover before lock mastery can begin >> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 >> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to >> recover before lock mastery can begin >> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 >> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must >> master $RECOVERY lock now >> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the >> Recovery Master for the Dead Node 1 for Domain >> B24C4493BBC74FEAA3371E2534BB3611 >> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from >> slot 1 on device (253,0) >> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota >> recovery in slot 1 >> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota >> recovery in slot 1 >> >> And i try give this command: >> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC >> EXTENT_MAP allow >> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or >> directory >> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC >> EXTENT_MAP off >> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or >> directory >> >> But not working.... >> >> >> -----Oryginalna wiadomo??----- From: Srinivas Eeda >> Sent: Wednesday, December 21, 2011 8:43 PM >> To: Marek Kr?likowski >> Cc: ocfs2-users at oss.oracle.com >> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read >> from both >> >> Those numbers look good. Basically with the fixes backed out and another >> fix I gave, you are not seeing that many orphans hanging around and >> hence not seeing the process stuck kernel stacks. You can run the test >> longer or if you are satisfied, please enable quotas and re-run the test >> with the modified kernel. You might see a dead lock which needs to be >> fixed(I was not able to reproduce this yet). If the system hangs, please >> capture the following and provide me the output >> >> 1. echo t > /proc/sysrq-trigger >> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC >> EXTENT_MAP allow >> 3. wait for 10 minutes >> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC >> EXTENT_MAP off >> 5. echo t > /proc/sysrq-trigger >> >
Marek Krolikowski
2012-Jan-03 06:29 UTC
[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
Hello and happy new year! I do enable quota and i got oops on both servers and can`t login - console frozen after give right login and password. I do sysrq t,s,b and this is what i get: https://wizja2.tktelekom.pl/ocfs2/2012.01.03-3.1.6/ anything else You need? Cheers! -----Oryginalna wiadomo??----- From: srinivas eeda Sent: Friday, December 23, 2011 10:52 PM To: Marek Kr?likowski Cc: ocfs2-users at oss.oracle.com Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both Please press sysrq key and t to dump kernel stacks on both nodes and please email me the messages files. On 12/23/2011 1:19 PM, Marek Kr?likowski wrote:> Hello > I get oops on TEST-MAIL2: > > INFO: task ocfs2dc:15430 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > ocfs2dc D ffff88107f232c40 0 15430 2 0x00000000 > ffff881014889080 0000000000000046 ffff881000000000 ffff88102060c080 > 0000000000012c40 ffff88101eefbfd8 0000000000012c40 ffff88101eefa010 > ffff88101eefbfd8 0000000000012c40 0000000000000001 00000001130a4380 > Call Trace: > [<ffffffff8148db41>] ? __mutex_lock_slowpath+0xd1/0x140 > [<ffffffff8148da53>] ? mutex_lock+0x23/0x40 > [<ffffffff81181eb6>] ? dqget+0x246/0x3a0 > [<ffffffff81182281>] ? __dquot_initialize+0x121/0x210 > [<ffffffff8114c90d>] ? d_kill+0x9d/0x100 > [<ffffffffa0a601c3>] ? ocfs2_find_local_alias+0x23/0x100 [ocfs2] > [<ffffffffa0a7fca8>] ? ocfs2_delete_inode+0x98/0x3e0 [ocfs2] > [<ffffffffa0a7106c>] ? ocfs2_unblock_lock+0x10c/0x770 [ocfs2] > [<ffffffffa0a80969>] ? ocfs2_evict_inode+0x19/0x40 [ocfs2] > [<ffffffff8114e9cc>] ? evict+0x8c/0x170 > [<ffffffffa0a5fccd>] ? ocfs2_dentry_lock_put+0x5d/0x90 [ocfs2] > [<ffffffffa0a7177a>] ? ocfs2_process_blocked_lock+0xaa/0x280 [ocfs2] > [<ffffffff8107beb2>] ? prepare_to_wait+0x82/0x90 > [<ffffffff8107bceb>] ? finish_wait+0x4b/0xa0 > [<ffffffffa0a71aa0>] ? ocfs2_downconvert_thread+0x150/0x270 [ocfs2] > [<ffffffff8107bb60>] ? wake_up_bit+0x40/0x40 > [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2] > [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2] > [<ffffffff8107b686>] ? kthread+0x96/0xa0 > [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190 > [<ffffffff81498a70>] ? gs_change+0x13/0x13 > INFO: task kworker/0:1:30806 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/0:1 D ffff88107f212c40 0 30806 2 0x00000000 > ffff8810152f4080 0000000000000046 0000000000000000 ffffffff81a0d020 > 0000000000012c40 ffff880c28a57fd8 0000000000012c40 ffff880c28a56010 > ffff880c28a57fd8 0000000000012c40 ffff880c28a57a08 00000001152f4080 > Call Trace: > [<ffffffff8148d45d>] ? schedule_timeout+0x1ed/0x2d0 > [<ffffffffa05244c0>] ? __jbd2_journal_file_buffer+0xd0/0x230 [jbd2] > [<ffffffff8148ce5c>] ? wait_for_common+0x12c/0x1a0 > [<ffffffff81052230>] ? try_to_wake_up+0x280/0x280 > [<ffffffff81085e21>] ? ktime_get+0x61/0xf0 > [<ffffffffa0a6e850>] ? __ocfs2_cluster_lock+0x1f0/0x780 [ocfs2] > [<ffffffff81046fa7>] ? find_busiest_group+0x1f7/0xb00 > [<ffffffffa0a73a56>] ? ocfs2_inode_lock_full_nested+0x126/0x540 [ocfs2] > [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2] > [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2] > [<ffffffffa0ad71df>] ? ocfs2_sync_dquot_helper+0xbf/0x330 [ocfs2] > [<ffffffffa0ad7120>] ? ocfs2_acquire_dquot+0x390/0x390 [ocfs2] > [<ffffffff81181c3a>] ? dquot_scan_active+0xda/0x110 > [<ffffffffa0ad4ca0>] ? ocfs2_global_is_id+0x60/0x60 [ocfs2] > [<ffffffffa0ad4cc1>] ? qsync_work_fn+0x21/0x40 [ocfs2] > [<ffffffff810753f3>] ? process_one_work+0x123/0x450 > [<ffffffff8107690b>] ? worker_thread+0x15b/0x370 > [<ffffffff810767b0>] ? manage_workers+0x110/0x110 > [<ffffffff810767b0>] ? manage_workers+0x110/0x110 > [<ffffffff8107b686>] ? kthread+0x96/0xa0 > [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190 > [<ffffffff81498a70>] ? gs_change+0x13/0x13 > > And i can`t login to TEST-MAIL1 after give login and password console say > when i lastlog but i don`t get bash - console don`t answer... but there is > no OOPS or something like this on screen. > I don`t restart both server tell me what to do now. > Thanks > > > -----Oryginalna wiadomo??----- From: srinivas eeda > Sent: Thursday, December 22, 2011 9:12 PM > To: Marek Kr?likowski > Cc: ocfs2-users at oss.oracle.com > Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from > both > > We need to know what happened to node 2. Was the node rebooted because > of a network timeout or kernel panic? can you please configure > netconsole, serial console and rerun the test? > > On 12/22/2011 8:08 AM, Marek Kr?likowski wrote: >> Hello >> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but >> TEST-MAIL1 got in dmesg: >> TEST-MAIL1 ~ #dmesg >> [cut] >> o2net: accepted connection from node TEST-MAIL2 (num 1) at >> 172.17.1.252:7777 >> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611 >> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1 >> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has >> been idle for 60.0 seconds, shutting it down. >> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help >> debug the situation: (Timer: 33127732045, Now 33187808090, DataReady >> 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, >> FuncTime 33127732045-33127732048) >> o2net: no longer connected to node TEST-MAIL2 (num 1) at >> 172.17.1.252:7777 >> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down! >> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112 >> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: >> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, >> error -112 send AST to node 1 >> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112 >> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: >> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, >> error -107 send AST to node 1 >> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107 >> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection >> established with node 1 after 60.0 seconds, giving up and returning >> errors. >> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 >> from group B24C4493BBC74FEAA3371E2534BB3611 >> (ocfs2rec,5504,6):dlm_get_lock_resource:834 >> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at >> least one node (1) to recover before lock mastery can begin >> (ocfs2rec,5504,6):dlm_get_lock_resource:888 >> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at >> least one node (1) to recover before lock mastery can begin >> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1 >> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11 >> (du,5099,12):dlm_get_lock_resource:888 >> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) >> to recover before lock mastery can begin >> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 >> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to >> recover before lock mastery can begin >> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 >> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must >> master $RECOVERY lock now >> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the >> Recovery Master for the Dead Node 1 for Domain >> B24C4493BBC74FEAA3371E2534BB3611 >> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 >> on device (253,0) >> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery >> in slot 1 >> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota >> recovery in slot 1 >> >> And i try give this command: >> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP >> allow >> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or >> directory >> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP >> off >> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or >> directory >> >> But not working.... >> >> >> -----Oryginalna wiadomo??----- From: Srinivas Eeda >> Sent: Wednesday, December 21, 2011 8:43 PM >> To: Marek Kr?likowski >> Cc: ocfs2-users at oss.oracle.com >> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from >> both >> >> Those numbers look good. Basically with the fixes backed out and another >> fix I gave, you are not seeing that many orphans hanging around and >> hence not seeing the process stuck kernel stacks. You can run the test >> longer or if you are satisfied, please enable quotas and re-run the test >> with the modified kernel. You might see a dead lock which needs to be >> fixed(I was not able to reproduce this yet). If the system hangs, please >> capture the following and provide me the output >> >> 1. echo t > /proc/sysrq-trigger >> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP >> allow >> 3. wait for 10 minutes >> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP >> off >> 5. echo t > /proc/sysrq-trigger >> >
Possibly Parallel Threads
- How to break out the unstop loop in the recovery thread? Thanks a lot.
- ocfs2 - disk usage inconsistencies
- OCFS2 problems when connectivity lost
- Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)
- fix mlog_errno in ocfs2_global_read_info