thr3ads.net - Ocfs2 users - [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Marek Królikowski

2011-Dec-20 17:46 UTC

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Sorry i don`t copy everything:
TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc
debugfs.ocfs2 1.6.4
5239722 26198604 246266859
TEST-MAIL1# echo "ls //orphan_dir:0001"|debugfs.ocfs2 /dev/dm-0|wc
debugfs.ocfs2 1.6.4
6074335 30371669 285493670

TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc
debugfs.ocfs2 1.6.4
5239722 26198604 246266859
TEST-MAIL2 ~ # echo "ls //orphan_dir:0001"|debugfs.ocfs2 /dev/dm-0|wc
debugfs.ocfs2 1.6.4
6074335 30371669 285493670

Thanks for Your help.


From: Marek Kr?likowski 
Sent: Tuesday, December 20, 2011 6:39 PM
To: ocfs2-users at oss.oracle.com 
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both
> I think you are running into a known issue. Are there lot of orphan 
> files in orphan directory? I am not sure if the problem is still there, 
> if not please run the same test and once you see the same symptoms, 
> please run the following and provide me the output
> 
> echo "ls //orphan_dir:0000"|debugfs.ocfs2 <device>|wc
> echo "ls //orphan_dir:0001"|debugfs.ocfs2 <device>|wcHello
Thank You for answer - strange i don`t get email with Your answer.
This is what You want:
TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc
debugfs.ocfs2 1.6.4
5239722 26198604 246266859


TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2 /dev/dm-0|wc
debugfs.ocfs2 1.6.4
5239722 26198604 246266859


This is my testing cluster so if u need do more tests please tell me i do for
You.
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111220/14a25763/attachment.html

Srinivas Eeda

2011-Dec-20 18:58 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Marek Kr?likowski wrote:> Sorry i don`t copy everything:
> TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2
/dev/dm-0|wc
> debugfs.ocfs2 1.6.4
> 5239722 26198604 246266859^^^^^ those numbers (5239722, 6074335) are the problem. What they are 
telling is the orphan directory is filled with flood of files. This is 
because of the change of unlink behavior introduced by patch 
"ea455f8ab68338ba69f5d3362b342c115bea8e13".

If you are interested in details, ... in normal unlink case an entry for 
the deleting file is created in orphan directory as an intermediate step 
and the entry is cleared towards the end of the unlink process. But 
because of that patch, entry doesn't get cleared and sticks around.

OCFS2 has a function called orphan scan which is executed as part of a 
thread which gets a ex lock on orphan scan lock and it then scans to 
clear all entries but it can't because the open lock is still around. 
Since this can takes longer because of the huge number of entries 
getting created, *new deletes will get delayed* as they need the ex lock.

So what can be done? for now if you are not using quotas feature you 
should get a new kernel by backing out the following patches

5fd131893793567c361ae64cbeb28a2a753bbe35
f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a
ea455f8ab68338ba69f5d3362b342c115bea8e13

or periodically umount the file system on all nodes and remount whenever 
the problem becomes severe.

Thanks,
--Srini
> TEST-MAIL1# echo "ls //orphan_dir:0001"|debugfs.ocfs2
/dev/dm-0|wc
> debugfs.ocfs2 1.6.4
> 6074335 30371669 285493670
>  
> TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2
/dev/dm-0|wc
> debugfs.ocfs2 1.6.4
> 5239722 26198604 246266859
> TEST-MAIL2 ~ # echo "ls //orphan_dir:0001"|debugfs.ocfs2
/dev/dm-0|wc
> debugfs.ocfs2 1.6.4
> 6074335 30371669 285493670
>  
> Thanks for Your help.
>  
>  
> *From:* Marek Kr?likowski <mailto:admin at wset.edu.pl>
> *Sent:* Tuesday, December 20, 2011 6:39 PM
> *To:* ocfs2-users at oss.oracle.com <mailto:ocfs2-users at
oss.oracle.com>
> *Subject:* Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read 
> from both
>  
> > I think you are running into a known issue. Are there lot of orphan
> > files in orphan directory? I am not sure if the problem is still
there,
> > if not please run the same test and once you see the same symptoms,
> > please run the following and provide me the output
> >
> > echo "ls //orphan_dir:0000"|debugfs.ocfs2 <device>|wc
> > echo "ls //orphan_dir:0001"|debugfs.ocfs2 <device>|wc
> Hello
> Thank You for answer - strange i don`t get email with Your answer.
> This is what You want:
> TEST-MAIL1# echo "ls //orphan_dir:0000"|debugfs.ocfs2
/dev/dm-0|wc
> debugfs.ocfs2 1.6.4
> 5239722 26198604 246266859
>  
>  
> TEST-MAIL2 ~ # echo "ls //orphan_dir:0000"|debugfs.ocfs2
/dev/dm-0|wc
> debugfs.ocfs2 1.6.4
> 5239722 26198604 246266859
>  
>  
> This is my testing cluster so if u need do more tests please tell me i 
> do for You.
> Thanks
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

Marek Królikowski

2011-Dec-20 21:32 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Hello
I compile kernel with all patch and create new FS:

TEST-MAIL1 ~ # mkfs.ocfs2 -N 2 -L MAIL /dev/dm-0
mkfs.ocfs2 1.6.4
Cluster stack: classic o2cb
Overwriting existing ocfs2 partition.
Proceed (y/N): Y
Label: MAIL
Features: sparse backup-super unwritten inline-data strict-journal-super 
xattr
Block size: 4096 (12 bits)
Cluster size: 4096 (12 bits)
Volume size: 1729073381376 (422137056 clusters) (422137056 blocks)
Cluster groups: 13088 (tail covers 2784 clusters, rest cover 32256 clusters)
Extent allocator size: 868220928 (207 groups)
Journal size: 268435456
Node slots: 2
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 6 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful

TEST-MAIL1 ~ # mount /dev/dm-0 /mnt/EMC
TEST-MAIL2 ~ # mount /dev/dm-0 /mnt/EMC
TEST-MAIL1 ~ # cat /proc/mounts |grep EMC
/dev/dm-0 /mnt/EMC ocfs2 
rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,coherency=full,user_xattr,acl
0 0

And now i run on both server my script - we will see what happened tommorow.

Again thank You for Your time.

srinivas eeda

2011-Dec-22 20:12 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

We need to know what happened to node 2. Was the node rebooted because 
of a network timeout or kernel panic? can you please configure 
netconsole, serial console and rerun the test?

On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello
> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
> TEST-MAIL1 got in dmesg:
> TEST-MAIL1 ~ #dmesg
> [cut]
> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
> 172.17.1.252:7777
> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has 
> been idle for 60.0 seconds, shutting it down.
> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might 
> help debug the situation: (Timer: 33127732045, Now 33187808090, 
> DataReady 33127732039, Advance 33127732051-33127732051, Key 
> 0xebb9cd47, Func 506, FuncTime 33127732045-33127732048)
> o2net: no longer connected to node TEST-MAIL2 (num 1) at 
> 172.17.1.252:7777
> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
> error -112 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
> error -107 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
> established with node 1 after 60.0 seconds, giving up and returning 
> errors.
> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 
> 1 from group B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
> least one node (1) to recover before lock mastery can begin
> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
> least one node (1) to recover before lock mastery can begin
> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
> (du,5099,12):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node 
> (1) to recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
> recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
> master $RECOVERY lock now
> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
> Recovery Master for the Dead Node 1 for Domain 
> B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from 
> slot 1 on device (253,0)
> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota 
> recovery in slot 1
> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
> recovery in slot 1
>
> And i try give this command:
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or 
> directory
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> off
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or 
> directory
>
> But not working....
>
>
> -----Oryginalna wiadomo??----- From: Srinivas Eeda
> Sent: Wednesday, December 21, 2011 8:43 PM
> To: Marek Kr?likowski
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read 
> from both
>
> Those numbers look good. Basically with the fixes backed out and another
> fix I gave, you are not seeing that many orphans hanging around and
> hence not seeing the process stuck kernel stacks. You can run the test
> longer or if you are satisfied, please enable quotas and re-run the test
> with the modified kernel. You might see a dead lock which needs to be
> fixed(I was not able to reproduce this yet). If the system hangs, please
> capture the following and provide me the output
>
> 1. echo t > /proc/sysrq-trigger
> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
> EXTENT_MAP allow
> 3. wait for 10 minutes
> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
> EXTENT_MAP off
> 5. echo t > /proc/sysrq-trigger
>

Marek Królikowski

2011-Dec-22 20:49 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Ok i reconfigure server and do again test hope tommorow die again because i 
see in log he crash after 10 hours work with no problem.
Thanks


-----Oryginalna wiadomo??----- 
From: srinivas eeda
Sent: Thursday, December 22, 2011 9:12 PM
To: Marek Kr?likowski
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

We need to know what happened to node 2. Was the node rebooted because
of a network timeout or kernel panic? can you please configure
netconsole, serial console and rerun the test?

On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello
> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
> TEST-MAIL1 got in dmesg:
> TEST-MAIL1 ~ #dmesg
> [cut]
> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
> 172.17.1.252:7777
> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been 
> idle for 60.0 seconds, shutting it down.
> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help 
> debug the situation: (Timer: 33127732045, Now 33187808090, DataReady 
> 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, 
> FuncTime 33127732045-33127732048)
> o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777
> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
> error -112 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
> error -107 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
> established with node 1 after 60.0 seconds, giving up and returning 
> errors.
> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 
> from group B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
> (du,5099,12):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) 
> to recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
> recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
> master $RECOVERY lock now
> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
> Recovery Master for the Dead Node 1 for Domain 
> B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 
> on device (253,0)
> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery 
> in slot 1
> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
> recovery in slot 1
>
> And i try give this command:
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or
directory
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or
directory
>
> But not working....
>
>
> -----Oryginalna wiadomo??----- From: Srinivas Eeda
> Sent: Wednesday, December 21, 2011 8:43 PM
> To: Marek Kr?likowski
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
> both
>
> Those numbers look good. Basically with the fixes backed out and another
> fix I gave, you are not seeing that many orphans hanging around and
> hence not seeing the process stuck kernel stacks. You can run the test
> longer or if you are satisfied, please enable quotas and re-run the test
> with the modified kernel. You might see a dead lock which needs to be
> fixed(I was not able to reproduce this yet). If the system hangs, please
> capture the following and provide me the output
>
> 1. echo t > /proc/sysrq-trigger
> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> 3. wait for 10 minutes
> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> off
> 5. echo t > /proc/sysrq-trigger
>

Marek Królikowski

2011-Dec-23 16:55 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Hello
After 24hours test both servers still working with np.
I run one more script on both servers:
TEST-MAIL1 ~ # cat terror3.sh
#!/bin/bash
while true
do
du -sh /mnt/EMC/TEST-MAIL2
find /mnt/EMC/TEST-MAIL2
sleep 30
done;

TEST-MAIL2 ~ # cat terror3.sh
#!/bin/bash
while true
do
du -sh /mnt/EMC/TEST-MAIL1
find . /mnt/EMC/TEST-MAIL1
sleep 30
done;

This script do find and du -sh on file who upload another machine to ocfs2.

Cheers

-----Oryginalna wiadomo??----- 
From: srinivas eeda
Sent: Thursday, December 22, 2011 9:12 PM
To: Marek Kr?likowski
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

We need to know what happened to node 2. Was the node rebooted because
of a network timeout or kernel panic? can you please configure
netconsole, serial console and rerun the test?

On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello
> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
> TEST-MAIL1 got in dmesg:
> TEST-MAIL1 ~ #dmesg
> [cut]
> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
> 172.17.1.252:7777
> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been 
> idle for 60.0 seconds, shutting it down.
> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help 
> debug the situation: (Timer: 33127732045, Now 33187808090, DataReady 
> 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, 
> FuncTime 33127732045-33127732048)
> o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777
> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
> error -112 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
> error -107 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
> established with node 1 after 60.0 seconds, giving up and returning 
> errors.
> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 
> from group B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
> (du,5099,12):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) 
> to recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
> recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
> master $RECOVERY lock now
> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
> Recovery Master for the Dead Node 1 for Domain 
> B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 
> on device (253,0)
> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery 
> in slot 1
> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
> recovery in slot 1
>
> And i try give this command:
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or
directory
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or
directory
>
> But not working....
>
>
> -----Oryginalna wiadomo??----- From: Srinivas Eeda
> Sent: Wednesday, December 21, 2011 8:43 PM
> To: Marek Kr?likowski
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
> both
>
> Those numbers look good. Basically with the fixes backed out and another
> fix I gave, you are not seeing that many orphans hanging around and
> hence not seeing the process stuck kernel stacks. You can run the test
> longer or if you are satisfied, please enable quotas and re-run the test
> with the modified kernel. You might see a dead lock which needs to be
> fixed(I was not able to reproduce this yet). If the system hangs, please
> capture the following and provide me the output
>
> 1. echo t > /proc/sysrq-trigger
> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> 3. wait for 10 minutes
> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> off
> 5. echo t > /proc/sysrq-trigger
>

Marek Królikowski

2011-Dec-23 21:19 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Hello
I get oops on TEST-MAIL2:

INFO: task ocfs2dc:15430 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ocfs2dc         D ffff88107f232c40     0 15430      2 0x00000000
ffff881014889080 0000000000000046 ffff881000000000 ffff88102060c080
0000000000012c40 ffff88101eefbfd8 0000000000012c40 ffff88101eefa010
ffff88101eefbfd8 0000000000012c40 0000000000000001 00000001130a4380
Call Trace:
[<ffffffff8148db41>] ? __mutex_lock_slowpath+0xd1/0x140
[<ffffffff8148da53>] ? mutex_lock+0x23/0x40
[<ffffffff81181eb6>] ? dqget+0x246/0x3a0
[<ffffffff81182281>] ? __dquot_initialize+0x121/0x210
[<ffffffff8114c90d>] ? d_kill+0x9d/0x100
[<ffffffffa0a601c3>] ? ocfs2_find_local_alias+0x23/0x100 [ocfs2]
[<ffffffffa0a7fca8>] ? ocfs2_delete_inode+0x98/0x3e0 [ocfs2]
[<ffffffffa0a7106c>] ? ocfs2_unblock_lock+0x10c/0x770 [ocfs2]
[<ffffffffa0a80969>] ? ocfs2_evict_inode+0x19/0x40 [ocfs2]
[<ffffffff8114e9cc>] ? evict+0x8c/0x170
[<ffffffffa0a5fccd>] ? ocfs2_dentry_lock_put+0x5d/0x90 [ocfs2]
[<ffffffffa0a7177a>] ? ocfs2_process_blocked_lock+0xaa/0x280 [ocfs2]
[<ffffffff8107beb2>] ? prepare_to_wait+0x82/0x90
[<ffffffff8107bceb>] ? finish_wait+0x4b/0xa0
[<ffffffffa0a71aa0>] ? ocfs2_downconvert_thread+0x150/0x270 [ocfs2]
[<ffffffff8107bb60>] ? wake_up_bit+0x40/0x40
[<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
[<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
[<ffffffff8107b686>] ? kthread+0x96/0xa0
[<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
[<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
[<ffffffff81498a70>] ? gs_change+0x13/0x13
INFO: task kworker/0:1:30806 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kworker/0:1     D ffff88107f212c40     0 30806      2 0x00000000
ffff8810152f4080 0000000000000046 0000000000000000 ffffffff81a0d020
0000000000012c40 ffff880c28a57fd8 0000000000012c40 ffff880c28a56010
ffff880c28a57fd8 0000000000012c40 ffff880c28a57a08 00000001152f4080
Call Trace:
[<ffffffff8148d45d>] ? schedule_timeout+0x1ed/0x2d0
[<ffffffffa05244c0>] ? __jbd2_journal_file_buffer+0xd0/0x230 [jbd2]
[<ffffffff8148ce5c>] ? wait_for_common+0x12c/0x1a0
[<ffffffff81052230>] ? try_to_wake_up+0x280/0x280
[<ffffffff81085e21>] ? ktime_get+0x61/0xf0
[<ffffffffa0a6e850>] ? __ocfs2_cluster_lock+0x1f0/0x780 [ocfs2]
[<ffffffff81046fa7>] ? find_busiest_group+0x1f7/0xb00
[<ffffffffa0a73a56>] ? ocfs2_inode_lock_full_nested+0x126/0x540 [ocfs2]
[<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
[<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
[<ffffffffa0ad71df>] ? ocfs2_sync_dquot_helper+0xbf/0x330 [ocfs2]
[<ffffffffa0ad7120>] ? ocfs2_acquire_dquot+0x390/0x390 [ocfs2]
[<ffffffff81181c3a>] ? dquot_scan_active+0xda/0x110
[<ffffffffa0ad4ca0>] ? ocfs2_global_is_id+0x60/0x60 [ocfs2]
[<ffffffffa0ad4cc1>] ? qsync_work_fn+0x21/0x40 [ocfs2]
[<ffffffff810753f3>] ? process_one_work+0x123/0x450
[<ffffffff8107690b>] ? worker_thread+0x15b/0x370
[<ffffffff810767b0>] ? manage_workers+0x110/0x110
[<ffffffff810767b0>] ? manage_workers+0x110/0x110
[<ffffffff8107b686>] ? kthread+0x96/0xa0
[<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
[<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
[<ffffffff81498a70>] ? gs_change+0x13/0x13

And i can`t login to TEST-MAIL1 after give login and password console say 
when i lastlog but i don`t get bash - console don`t answer... but there is 
no OOPS or something like this on screen.
I don`t restart both server tell me what to do now.
Thanks


-----Oryginalna wiadomo??----- 
From: srinivas eeda
Sent: Thursday, December 22, 2011 9:12 PM
To: Marek Kr?likowski
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

We need to know what happened to node 2. Was the node rebooted because
of a network timeout or kernel panic? can you please configure
netconsole, serial console and rerun the test?

On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:> Hello
> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
> TEST-MAIL1 got in dmesg:
> TEST-MAIL1 ~ #dmesg
> [cut]
> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
> 172.17.1.252:7777
> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been 
> idle for 60.0 seconds, shutting it down.
> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help 
> debug the situation: (Timer: 33127732045, Now 33187808090, DataReady 
> 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, 
> FuncTime 33127732045-33127732048)
> o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777
> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
> error -112 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
> error -107 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
> established with node 1 after 60.0 seconds, giving up and returning 
> errors.
> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 
> from group B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
> (du,5099,12):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) 
> to recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
> recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
> master $RECOVERY lock now
> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
> Recovery Master for the Dead Node 1 for Domain 
> B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 
> on device (253,0)
> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery 
> in slot 1
> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
> recovery in slot 1
>
> And i try give this command:
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or
directory
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or
directory
>
> But not working....
>
>
> -----Oryginalna wiadomo??----- From: Srinivas Eeda
> Sent: Wednesday, December 21, 2011 8:43 PM
> To: Marek Kr?likowski
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
> both
>
> Those numbers look good. Basically with the fixes backed out and another
> fix I gave, you are not seeing that many orphans hanging around and
> hence not seeing the process stuck kernel stacks. You can run the test
> longer or if you are satisfied, please enable quotas and re-run the test
> with the modified kernel. You might see a dead lock which needs to be
> fixed(I was not able to reproduce this yet). If the system hangs, please
> capture the following and provide me the output
>
> 1. echo t > /proc/sysrq-trigger
> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> 3. wait for 10 minutes
> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> off
> 5. echo t > /proc/sysrq-trigger
>

srinivas eeda

2011-Dec-23 21:52 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Please press sysrq key and t to dump kernel stacks on both nodes and 
please email me the messages files.

On 12/23/2011 1:19 PM, Marek Kr?likowski wrote:> Hello
> I get oops on TEST-MAIL2:
>
> INFO: task ocfs2dc:15430 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
> ocfs2dc         D ffff88107f232c40     0 15430      2 0x00000000
> ffff881014889080 0000000000000046 ffff881000000000 ffff88102060c080
> 0000000000012c40 ffff88101eefbfd8 0000000000012c40 ffff88101eefa010
> ffff88101eefbfd8 0000000000012c40 0000000000000001 00000001130a4380
> Call Trace:
> [<ffffffff8148db41>] ? __mutex_lock_slowpath+0xd1/0x140
> [<ffffffff8148da53>] ? mutex_lock+0x23/0x40
> [<ffffffff81181eb6>] ? dqget+0x246/0x3a0
> [<ffffffff81182281>] ? __dquot_initialize+0x121/0x210
> [<ffffffff8114c90d>] ? d_kill+0x9d/0x100
> [<ffffffffa0a601c3>] ? ocfs2_find_local_alias+0x23/0x100 [ocfs2]
> [<ffffffffa0a7fca8>] ? ocfs2_delete_inode+0x98/0x3e0 [ocfs2]
> [<ffffffffa0a7106c>] ? ocfs2_unblock_lock+0x10c/0x770 [ocfs2]
> [<ffffffffa0a80969>] ? ocfs2_evict_inode+0x19/0x40 [ocfs2]
> [<ffffffff8114e9cc>] ? evict+0x8c/0x170
> [<ffffffffa0a5fccd>] ? ocfs2_dentry_lock_put+0x5d/0x90 [ocfs2]
> [<ffffffffa0a7177a>] ? ocfs2_process_blocked_lock+0xaa/0x280 [ocfs2]
> [<ffffffff8107beb2>] ? prepare_to_wait+0x82/0x90
> [<ffffffff8107bceb>] ? finish_wait+0x4b/0xa0
> [<ffffffffa0a71aa0>] ? ocfs2_downconvert_thread+0x150/0x270 [ocfs2]
> [<ffffffff8107bb60>] ? wake_up_bit+0x40/0x40
> [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
> [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
> [<ffffffff8107b686>] ? kthread+0x96/0xa0
> [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
> [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
> [<ffffffff81498a70>] ? gs_change+0x13/0x13
> INFO: task kworker/0:1:30806 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
> kworker/0:1     D ffff88107f212c40     0 30806      2 0x00000000
> ffff8810152f4080 0000000000000046 0000000000000000 ffffffff81a0d020
> 0000000000012c40 ffff880c28a57fd8 0000000000012c40 ffff880c28a56010
> ffff880c28a57fd8 0000000000012c40 ffff880c28a57a08 00000001152f4080
> Call Trace:
> [<ffffffff8148d45d>] ? schedule_timeout+0x1ed/0x2d0
> [<ffffffffa05244c0>] ? __jbd2_journal_file_buffer+0xd0/0x230 [jbd2]
> [<ffffffff8148ce5c>] ? wait_for_common+0x12c/0x1a0
> [<ffffffff81052230>] ? try_to_wake_up+0x280/0x280
> [<ffffffff81085e21>] ? ktime_get+0x61/0xf0
> [<ffffffffa0a6e850>] ? __ocfs2_cluster_lock+0x1f0/0x780 [ocfs2]
> [<ffffffff81046fa7>] ? find_busiest_group+0x1f7/0xb00
> [<ffffffffa0a73a56>] ? ocfs2_inode_lock_full_nested+0x126/0x540
[ocfs2]
> [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
> [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
> [<ffffffffa0ad71df>] ? ocfs2_sync_dquot_helper+0xbf/0x330 [ocfs2]
> [<ffffffffa0ad7120>] ? ocfs2_acquire_dquot+0x390/0x390 [ocfs2]
> [<ffffffff81181c3a>] ? dquot_scan_active+0xda/0x110
> [<ffffffffa0ad4ca0>] ? ocfs2_global_is_id+0x60/0x60 [ocfs2]
> [<ffffffffa0ad4cc1>] ? qsync_work_fn+0x21/0x40 [ocfs2]
> [<ffffffff810753f3>] ? process_one_work+0x123/0x450
> [<ffffffff8107690b>] ? worker_thread+0x15b/0x370
> [<ffffffff810767b0>] ? manage_workers+0x110/0x110
> [<ffffffff810767b0>] ? manage_workers+0x110/0x110
> [<ffffffff8107b686>] ? kthread+0x96/0xa0
> [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
> [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
> [<ffffffff81498a70>] ? gs_change+0x13/0x13
>
> And i can`t login to TEST-MAIL1 after give login and password console 
> say when i lastlog but i don`t get bash - console don`t answer... but 
> there is no OOPS or something like this on screen.
> I don`t restart both server tell me what to do now.
> Thanks
>
>
> -----Oryginalna wiadomo??----- From: srinivas eeda
> Sent: Thursday, December 22, 2011 9:12 PM
> To: Marek Kr?likowski
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read 
> from both
>
> We need to know what happened to node 2. Was the node rebooted because
> of a network timeout or kernel panic? can you please configure
> netconsole, serial console and rerun the test?
>
> On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:
>> Hello
>> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
>> TEST-MAIL1 got in dmesg:
>> TEST-MAIL1 ~ #dmesg
>> [cut]
>> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
>> 172.17.1.252:7777
>> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
>> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
>> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has 
>> been idle for 60.0 seconds, shutting it down.
>> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might 
>> help debug the situation: (Timer: 33127732045, Now 33187808090, 
>> DataReady 33127732039, Advance 33127732051-33127732051, Key 
>> 0xebb9cd47, Func 506, FuncTime 33127732045-33127732048)
>> o2net: no longer connected to node TEST-MAIL2 (num 1) at 
>> 172.17.1.252:7777
>> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
>> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
>> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
>> B24C4493BBC74FEAA3371E2534BB3611: res 
>> M000000000000000000000cf023ef70, error -112 send AST to node 1
>> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
>> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
>> B24C4493BBC74FEAA3371E2534BB3611: res 
>> P000000000000000000000000000000, error -107 send AST to node 1
>> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
>> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
>> established with node 1 after 60.0 seconds, giving up and returning 
>> errors.
>> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted 
>> node 1 from group B24C4493BBC74FEAA3371E2534BB3611
>> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
>> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
>> least one node (1) to recover before lock mastery can begin
>> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
>> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
>> least one node (1) to recover before lock mastery can begin
>> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
>> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
>> (du,5099,12):dlm_get_lock_resource:888 
>> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node 
>> (1) to recover before lock mastery can begin
>> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
>> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
>> recover before lock mastery can begin
>> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
>> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
>> master $RECOVERY lock now
>> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
>> Recovery Master for the Dead Node 1 for Domain 
>> B24C4493BBC74FEAA3371E2534BB3611
>> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from 
>> slot 1 on device (253,0)
>> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota 
>> recovery in slot 1
>> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
>> recovery in slot 1
>>
>> And i try give this command:
>> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
>> EXTENT_MAP allow
>> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file
or
>> directory
>> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
>> EXTENT_MAP off
>> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file
or
>> directory
>>
>> But not working....
>>
>>
>> -----Oryginalna wiadomo??----- From: Srinivas Eeda
>> Sent: Wednesday, December 21, 2011 8:43 PM
>> To: Marek Kr?likowski
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read 
>> from both
>>
>> Those numbers look good. Basically with the fixes backed out and
another
>> fix I gave, you are not seeing that many orphans hanging around and
>> hence not seeing the process stuck kernel stacks. You can run the test
>> longer or if you are satisfied, please enable quotas and re-run the
test
>> with the modified kernel. You might see a dead lock which needs to be
>> fixed(I was not able to reproduce this yet). If the system hangs,
please
>> capture the following and provide me the output
>>
>> 1. echo t > /proc/sysrq-trigger
>> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
>> EXTENT_MAP allow
>> 3. wait for 10 minutes
>> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC 
>> EXTENT_MAP off
>> 5. echo t > /proc/sysrq-trigger
>>
>

Marek Krolikowski

2012-Jan-03 06:29 UTC

head link

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Hello and happy new year!
I do enable quota and i got oops on both servers and can`t login - console 
frozen after give right login and password.
I do sysrq t,s,b and this is what i get:
https://wizja2.tktelekom.pl/ocfs2/2012.01.03-3.1.6/
anything else You need?
Cheers!


-----Oryginalna wiadomo??----- 
From: srinivas eeda
Sent: Friday, December 23, 2011 10:52 PM
To: Marek Kr?likowski
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Please press sysrq key and t to dump kernel stacks on both nodes and
please email me the messages files.

On 12/23/2011 1:19 PM, Marek Kr?likowski wrote:> Hello
> I get oops on TEST-MAIL2:
>
> INFO: task ocfs2dc:15430 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
> ocfs2dc         D ffff88107f232c40     0 15430      2 0x00000000
> ffff881014889080 0000000000000046 ffff881000000000 ffff88102060c080
> 0000000000012c40 ffff88101eefbfd8 0000000000012c40 ffff88101eefa010
> ffff88101eefbfd8 0000000000012c40 0000000000000001 00000001130a4380
> Call Trace:
> [<ffffffff8148db41>] ? __mutex_lock_slowpath+0xd1/0x140
> [<ffffffff8148da53>] ? mutex_lock+0x23/0x40
> [<ffffffff81181eb6>] ? dqget+0x246/0x3a0
> [<ffffffff81182281>] ? __dquot_initialize+0x121/0x210
> [<ffffffff8114c90d>] ? d_kill+0x9d/0x100
> [<ffffffffa0a601c3>] ? ocfs2_find_local_alias+0x23/0x100 [ocfs2]
> [<ffffffffa0a7fca8>] ? ocfs2_delete_inode+0x98/0x3e0 [ocfs2]
> [<ffffffffa0a7106c>] ? ocfs2_unblock_lock+0x10c/0x770 [ocfs2]
> [<ffffffffa0a80969>] ? ocfs2_evict_inode+0x19/0x40 [ocfs2]
> [<ffffffff8114e9cc>] ? evict+0x8c/0x170
> [<ffffffffa0a5fccd>] ? ocfs2_dentry_lock_put+0x5d/0x90 [ocfs2]
> [<ffffffffa0a7177a>] ? ocfs2_process_blocked_lock+0xaa/0x280 [ocfs2]
> [<ffffffff8107beb2>] ? prepare_to_wait+0x82/0x90
> [<ffffffff8107bceb>] ? finish_wait+0x4b/0xa0
> [<ffffffffa0a71aa0>] ? ocfs2_downconvert_thread+0x150/0x270 [ocfs2]
> [<ffffffff8107bb60>] ? wake_up_bit+0x40/0x40
> [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
> [<ffffffffa0a71950>] ? ocfs2_process_blocked_lock+0x280/0x280 [ocfs2]
> [<ffffffff8107b686>] ? kthread+0x96/0xa0
> [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
> [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
> [<ffffffff81498a70>] ? gs_change+0x13/0x13
> INFO: task kworker/0:1:30806 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
> kworker/0:1     D ffff88107f212c40     0 30806      2 0x00000000
> ffff8810152f4080 0000000000000046 0000000000000000 ffffffff81a0d020
> 0000000000012c40 ffff880c28a57fd8 0000000000012c40 ffff880c28a56010
> ffff880c28a57fd8 0000000000012c40 ffff880c28a57a08 00000001152f4080
> Call Trace:
> [<ffffffff8148d45d>] ? schedule_timeout+0x1ed/0x2d0
> [<ffffffffa05244c0>] ? __jbd2_journal_file_buffer+0xd0/0x230 [jbd2]
> [<ffffffff8148ce5c>] ? wait_for_common+0x12c/0x1a0
> [<ffffffff81052230>] ? try_to_wake_up+0x280/0x280
> [<ffffffff81085e21>] ? ktime_get+0x61/0xf0
> [<ffffffffa0a6e850>] ? __ocfs2_cluster_lock+0x1f0/0x780 [ocfs2]
> [<ffffffff81046fa7>] ? find_busiest_group+0x1f7/0xb00
> [<ffffffffa0a73a56>] ? ocfs2_inode_lock_full_nested+0x126/0x540
[ocfs2]
> [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
> [<ffffffffa0ad4da9>] ? ocfs2_lock_global_qf+0x29/0xd0 [ocfs2]
> [<ffffffffa0ad71df>] ? ocfs2_sync_dquot_helper+0xbf/0x330 [ocfs2]
> [<ffffffffa0ad7120>] ? ocfs2_acquire_dquot+0x390/0x390 [ocfs2]
> [<ffffffff81181c3a>] ? dquot_scan_active+0xda/0x110
> [<ffffffffa0ad4ca0>] ? ocfs2_global_is_id+0x60/0x60 [ocfs2]
> [<ffffffffa0ad4cc1>] ? qsync_work_fn+0x21/0x40 [ocfs2]
> [<ffffffff810753f3>] ? process_one_work+0x123/0x450
> [<ffffffff8107690b>] ? worker_thread+0x15b/0x370
> [<ffffffff810767b0>] ? manage_workers+0x110/0x110
> [<ffffffff810767b0>] ? manage_workers+0x110/0x110
> [<ffffffff8107b686>] ? kthread+0x96/0xa0
> [<ffffffff81498a74>] ? kernel_thread_helper+0x4/0x10
> [<ffffffff8107b5f0>] ? kthread_worker_fn+0x190/0x190
> [<ffffffff81498a70>] ? gs_change+0x13/0x13
>
> And i can`t login to TEST-MAIL1 after give login and password console say 
> when i lastlog but i don`t get bash - console don`t answer... but there is 
> no OOPS or something like this on screen.
> I don`t restart both server tell me what to do now.
> Thanks
>
>
> -----Oryginalna wiadomo??----- From: srinivas eeda
> Sent: Thursday, December 22, 2011 9:12 PM
> To: Marek Kr?likowski
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
> both
>
> We need to know what happened to node 2. Was the node rebooted because
> of a network timeout or kernel panic? can you please configure
> netconsole, serial console and rerun the test?
>
> On 12/22/2011 8:08 AM, Marek Kr?likowski wrote:
>> Hello
>> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
>> TEST-MAIL1 got in dmesg:
>> TEST-MAIL1 ~ #dmesg
>> [cut]
>> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
>> 172.17.1.252:7777
>> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
>> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
>> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has 
>> been idle for 60.0 seconds, shutting it down.
>> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help
>> debug the situation: (Timer: 33127732045, Now 33187808090, DataReady 
>> 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506,
>> FuncTime 33127732045-33127732048)
>> o2net: no longer connected to node TEST-MAIL2 (num 1) at 
>> 172.17.1.252:7777
>> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
>> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
>> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
>> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
>> error -112 send AST to node 1
>> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
>> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
>> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
>> error -107 send AST to node 1
>> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
>> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
>> established with node 1 after 60.0 seconds, giving up and returning 
>> errors.
>> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node
1
>> from group B24C4493BBC74FEAA3371E2534BB3611
>> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
>> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
>> least one node (1) to recover before lock mastery can begin
>> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
>> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at 
>> least one node (1) to recover before lock mastery can begin
>> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
>> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
>> (du,5099,12):dlm_get_lock_resource:888 
>> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node
(1)
>> to recover before lock mastery can begin
>> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
>> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
>> recover before lock mastery can begin
>> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
>> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
>> master $RECOVERY lock now
>> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
>> Recovery Master for the Dead Node 1 for Domain 
>> B24C4493BBC74FEAA3371E2534BB3611
>> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot
1
>> on device (253,0)
>> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota
recovery
>> in slot 1
>> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
>> recovery in slot 1
>>
>> And i try give this command:
>> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
>> allow
>> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file
or
>> directory
>> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
>> off
>> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file
or
>> directory
>>
>> But not working....
>>
>>
>> -----Oryginalna wiadomo??----- From: Srinivas Eeda
>> Sent: Wednesday, December 21, 2011 8:43 PM
>> To: Marek Kr?likowski
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from
>> both
>>
>> Those numbers look good. Basically with the fixes backed out and
another
>> fix I gave, you are not seeing that many orphans hanging around and
>> hence not seeing the process stuck kernel stacks. You can run the test
>> longer or if you are satisfied, please enable quotas and re-run the
test
>> with the modified kernel. You might see a dead lock which needs to be
>> fixed(I was not able to reproduce this yet). If the system hangs,
please
>> capture the following and provide me the output
>>
>> 1. echo t > /proc/sysrq-trigger
>> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC
EXTENT_MAP
>> allow
>> 3. wait for 10 minutes
>> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC
EXTENT_MAP
>> off
>> 5. echo t > /proc/sysrq-trigger
>>
>

Seemingly Similar Threads

Search for more possibly parallel threads

Ocfs2 users - Dec 2011 - ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

[Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Seemingly Similar Threads