bob findlay (TOC)
2008-Jan-11 03:17 UTC
[Ocfs2-users] systems hang when accessing parts of the OCFS2 file system
Hi everyone Firstly, apologies for the cross post, I am not sure which list is most appropriate for this question. I should also point out, that I did not install OCFS2 and I am not the person that normally looks after these kind of things, so please can you bear that in mind when you make any suggestions (I will need a lot of detail!) The problem: accessing certain directories within the cluster file system e.g. with "ls" cause the process to hang permanently. I cannot cancel the request, I have to terminate the session. This is happening across multiple nodes, so I am assuming that OCFS2 is the root cause of the problem. Accessing the directory in debug mode seems to work fine eg this command will hang my session [root@jic55124 databases]# ls -l /common/users/cbu/vigourom Whereas this works fine [root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" | debugfs.ocfs2 -n /dev/sdf1 25447960 drwxr-xr-x 33 2522 2004 4096 10-Jan-2008 16:30 . 25447672 drwxr-xr-x 5 3773 2004 4096 30-Nov-2007 14:27 .. 25447961 drwx------ 2 2522 2004 4096 1-Aug-2007 12:06 .ssh 25447963 -rw-r--r-- 1 2522 2004 3814 1-Aug-2007 17:04 addgi_new3.pl 25447964 -rw-r--r-- 1 0 0 0 1-Aug-2007 17:05 allmaize.out 25447965 -rw------- 1 2522 2004 1741 15-Aug-2007 11:13 .viminfo 25447966 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 12:07 .mcop 25447970 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:43 forUNIGENE 25447971 -rw-r--r-- 1 0 0 325655 1-Aug-2007 15:02 maize.out 25447972 -rw-r--r-- 1 0 0 264 1-Aug-2007 15:42 README 25447973 -rwxr--r-- 1 2522 2004 7209696 8-Aug-2007 14:53 bioperl-1.5.2_102.zip 25447974 drwxrwsr-x 9 2522 2004 4096 13-Aug-2007 14:59 bioperl-1.5.2_102 22610705 drwxr-xr-x 2 2522 2004 4096 14-Aug-2007 17:10 perl5lib 22610706 drwxr-xr-x 3 2522 2004 4096 14-Aug-2007 17:11 .cpan 22610709 drwx------ 4 2522 2004 4096 4-Sep-2007 11:39 .gnome 22610713 drwx------ 4 2522 2004 4096 4-Sep-2007 14:58 .gnome2 22610719 drwx------ 2 2522 2004 4096 4-Sep-2007 11:39 .gnome2_private 22610720 drwx------ 4 2522 2004 4096 4-Sep-2007 11:40 .kde 229702011 -rw------- 1 2522 2004 771 10-Jan-2008 09:40 .Xauthority 22610820 drwx------ 4 2522 2004 4096 9-Jan-2008 14:08 .gconf 22610835 drwx------ 2 2522 2004 4096 10-Jan-2008 13:41 .gconfd 22610837 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 11:39 .nautilus 22610842 drwxr-xr-x 4 2522 2004 4096 4-Sep-2007 15:27 Desktop 28545914 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:40 .qt 28545917 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:42 .fonts 28545922 drwx------ 3 2522 2004 4096 4-Sep-2007 12:13 .mozilla 4567882 -rw-r--r-- 1 2522 2004 53 9-Jan-2008 14:08 .fonts.cache-1 28545956 -rw------- 1 2522 2004 0 6-Sep-2007 15:30 .ICEauthority 28545957 -rw-r--r-- 1 2522 2004 110 4-Sep-2007 11:42 .fonts.conf 28545958 -rw------- 1 2522 2004 31 4-Sep-2007 12:07 .mcoprc 28545959 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 12:17 .wp 28545962 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:04 .seqlab-node7 28545967 -rw-r--r-- 1 2522 2004 707 4-Sep-2007 16:16 .seqlab-history 28545968 drwxr-xr-x 5 2522 2004 4096 4-Sep-2007 15:05 GCGSeqmergeTests etc stat gives [root@jic55124 databases]# echo "stat /users/cbu/vigourom" | debugfs.ocfs2 -n /dev/sdf1 Inode: 25447960 Mode: 0755 Generation: 1766836575 (0x694fc95f) FS Generation: 3856768928 (0xe5e19fa0) Type: Directory Attr: 0x0 Flags: Valid User: 2522 (vigourom) Group: 2004 (cbu) Size: 4096 Links: 33 Clusters: 1 ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008 atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007 mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008 dtime: 0x0 -- Thu Jan 1 01:00:00 1970 ctime_nsec: 0x33de5143 -- 870207811 atime_nsec: 0x0ba52bb0 -- 195374000 mtime_nsec: 0x33de5143 -- 870207811 Last Extblk: 0 Sub Alloc Slot: 4 Sub Alloc Bit: 544 Tree Depth: 0 Count: 243 Next Free Rec: 1 ## Offset Clusters Block# 0 0 1 20289216 fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other things, which sounds pretty bad. Is it? [root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1 Checking OCFS2 filesystem in /dev/sdf1: label: oracle uuid: e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8 number of blocks: 243930952 bytes per block: 4096 number of clusters: 30491369 bytes per cluster: 32768 max slots: 24 ** Skipping journal replay because -n was given. There may be spurious errors that journal replay would fix. ** /dev/sdf1 was run with -f, check forced. Pass 0a: Checking cluster allocation chains [GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2 free bits which is more than 0 bits indicated by the bitmap.n Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate cluster 22151173 [DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n [CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster bitmap but it isn't in use. Clear its bit in the bitmap? n [CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster bitmap but it isn't in use. Clear its bit in the bitmap? n Pass 2: Checking directory entries. [DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number 74502784 which isn't allocated, clear the entry? n Pass 3: Checking directory connectivity. [DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the filesystem. Move it to lost+found? n Pass 4a: checking for orphaned inodes ** Skipping orphan dir replay because -n was given. Pass 4b: Checking inodes link counts. [INODE_COUNT] Inode 74502784 has a link count of 0 on disk but directory entry references come to 1. Update the count on disk to man [INODE_COUNT] Inode 142698567 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to mn pass4: Internal logic faliure fsck's thinks inode 149371307 has a link count of 1 but on disk it is 0 [INODE_COUNT] Inode 149371307 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to mn [INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any directory entries. Move it to lost+found? n [INODE_COUNT] Inode 149371341 has a link count of 2 on disk but directory entry references come to 0. Update the count on disk to mn All passes succeeded. This has happened before and was "resolved" by shutting down the cluster and performing a fsck.ocfs2, but that doesn't help us prevent it in the future, so I would really like to resolve it properly. any suggestions as to how I can narrow down the cause of this problem please? (or how to fix it would be even better! ;-) Thanks Bob. ====================================================Bob Findlay The Operations Centre - Norwich BioScience Institutes Tel: 01603 450474 (2474 internal) Fax: 01603 450045 ==================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080111/dfbb3538/attachment.html
bob findlay (TOC)
2008-Jan-11 05:50 UTC
[Ocfs2-users] RE: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem
Sorry, should have also stated the obviously necessary information - [root@jic55124 bin]# cat /proc/fs/ocfs2/version OCFS2 1.2.7 Tue Oct 9 16:15:59 PDT 2007 (build d443ce77532cea8d1e167ab2de51b8c8) [root@jic55124 bin]# rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}\n" x86_64 [root@jic55124 bin]# uname -a Linux jic55124 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux Thanks Bob. ====================================================Bob Findlay The Operations Centre - Norwich BioScience Institutes Tel: 01603 450474 (2474 internal) Fax: 01603 450045 ==================================================== -----Original Message----- From: ocfs2-devel-bounces@oss.oracle.com [mailto:ocfs2-devel-bounces@oss.oracle.com] On Behalf Of bob findlay (TOC) Sent: 11 January 2008 11:17 To: ocfs2-devel@oss.oracle.com; ocfs2-users@oss.oracle.com Subject: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem Hi everyone Firstly, apologies for the cross post, I am not sure which list is most appropriate for this question. I should also point out, that I did not install OCFS2 and I am not the person that normally looks after these kind of things, so please can you bear that in mind when you make any suggestions (I will need a lot of detail!) The problem: accessing certain directories within the cluster file system e.g. with "ls" cause the process to hang permanently. I cannot cancel the request, I have to terminate the session. This is happening across multiple nodes, so I am assuming that OCFS2 is the root cause of the problem. Accessing the directory in debug mode seems to work fine eg this command will hang my session [root@jic55124 databases]# ls -l /common/users/cbu/vigourom Whereas this works fine [root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" | debugfs.ocfs2 -n /dev/sdf1 25447960 drwxr-xr-x 33 2522 2004 4096 10-Jan-2008 16:30 . 25447672 drwxr-xr-x 5 3773 2004 4096 30-Nov-2007 14:27 .. 25447961 drwx------ 2 2522 2004 4096 1-Aug-2007 12:06 .ssh 25447963 -rw-r--r-- 1 2522 2004 3814 1-Aug-2007 17:04 addgi_new3.pl 25447964 -rw-r--r-- 1 0 0 0 1-Aug-2007 17:05 allmaize.out 25447965 -rw------- 1 2522 2004 1741 15-Aug-2007 11:13 .viminfo 25447966 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 12:07 .mcop 25447970 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:43 forUNIGENE 25447971 -rw-r--r-- 1 0 0 325655 1-Aug-2007 15:02 maize.out 25447972 -rw-r--r-- 1 0 0 264 1-Aug-2007 15:42 README 25447973 -rwxr--r-- 1 2522 2004 7209696 8-Aug-2007 14:53 bioperl-1.5.2_102.zip 25447974 drwxrwsr-x 9 2522 2004 4096 13-Aug-2007 14:59 bioperl-1.5.2_102 22610705 drwxr-xr-x 2 2522 2004 4096 14-Aug-2007 17:10 perl5lib 22610706 drwxr-xr-x 3 2522 2004 4096 14-Aug-2007 17:11 .cpan 22610709 drwx------ 4 2522 2004 4096 4-Sep-2007 11:39 .gnome 22610713 drwx------ 4 2522 2004 4096 4-Sep-2007 14:58 .gnome2 22610719 drwx------ 2 2522 2004 4096 4-Sep-2007 11:39 .gnome2_private 22610720 drwx------ 4 2522 2004 4096 4-Sep-2007 11:40 .kde 229702011 -rw------- 1 2522 2004 771 10-Jan-2008 09:40 .Xauthority 22610820 drwx------ 4 2522 2004 4096 9-Jan-2008 14:08 .gconf 22610835 drwx------ 2 2522 2004 4096 10-Jan-2008 13:41 .gconfd 22610837 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 11:39 .nautilus 22610842 drwxr-xr-x 4 2522 2004 4096 4-Sep-2007 15:27 Desktop 28545914 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:40 .qt 28545917 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:42 .fonts 28545922 drwx------ 3 2522 2004 4096 4-Sep-2007 12:13 .mozilla 4567882 -rw-r--r-- 1 2522 2004 53 9-Jan-2008 14:08 .fonts.cache-1 28545956 -rw------- 1 2522 2004 0 6-Sep-2007 15:30 .ICEauthority 28545957 -rw-r--r-- 1 2522 2004 110 4-Sep-2007 11:42 .fonts.conf 28545958 -rw------- 1 2522 2004 31 4-Sep-2007 12:07 .mcoprc 28545959 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 12:17 .wp 28545962 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:04 .seqlab-node7 28545967 -rw-r--r-- 1 2522 2004 707 4-Sep-2007 16:16 .seqlab-history 28545968 drwxr-xr-x 5 2522 2004 4096 4-Sep-2007 15:05 GCGSeqmergeTests etc stat gives [root@jic55124 databases]# echo "stat /users/cbu/vigourom" | debugfs.ocfs2 -n /dev/sdf1 Inode: 25447960 Mode: 0755 Generation: 1766836575 (0x694fc95f) FS Generation: 3856768928 (0xe5e19fa0) Type: Directory Attr: 0x0 Flags: Valid User: 2522 (vigourom) Group: 2004 (cbu) Size: 4096 Links: 33 Clusters: 1 ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008 atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007 mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008 dtime: 0x0 -- Thu Jan 1 01:00:00 1970 ctime_nsec: 0x33de5143 -- 870207811 atime_nsec: 0x0ba52bb0 -- 195374000 mtime_nsec: 0x33de5143 -- 870207811 Last Extblk: 0 Sub Alloc Slot: 4 Sub Alloc Bit: 544 Tree Depth: 0 Count: 243 Next Free Rec: 1 ## Offset Clusters Block# 0 0 1 20289216 fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other things, which sounds pretty bad. Is it? [root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1 Checking OCFS2 filesystem in /dev/sdf1: label: oracle uuid: e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8 number of blocks: 243930952 bytes per block: 4096 number of clusters: 30491369 bytes per cluster: 32768 max slots: 24 ** Skipping journal replay because -n was given. There may be spurious errors that journal replay would fix. ** /dev/sdf1 was run with -f, check forced. Pass 0a: Checking cluster allocation chains [GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2 free bits which is more than 0 bits indicated by the bitmap.n Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate cluster 22151173 [DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n [CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster bitmap but it isn't in use. Clear its bit in the bitmap? n [CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster bitmap but it isn't in use. Clear its bit in the bitmap? n Pass 2: Checking directory entries. [DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number 74502784 which isn't allocated, clear the entry? n Pass 3: Checking directory connectivity. [DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the filesystem. Move it to lost+found? n Pass 4a: checking for orphaned inodes ** Skipping orphan dir replay because -n was given. Pass 4b: Checking inodes link counts. [INODE_COUNT] Inode 74502784 has a link count of 0 on disk but directory entry references come to 1. Update the count on disk to man [INODE_COUNT] Inode 142698567 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to mn pass4: Internal logic faliure fsck's thinks inode 149371307 has a link count of 1 but on disk it is 0 [INODE_COUNT] Inode 149371307 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to mn [INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any directory entries. Move it to lost+found? n [INODE_COUNT] Inode 149371341 has a link count of 2 on disk but directory entry references come to 0. Update the count on disk to mn All passes succeeded. This has happened before and was "resolved" by shutting down the cluster and performing a fsck.ocfs2, but that doesn't help us prevent it in the future, so I would really like to resolve it properly. any suggestions as to how I can narrow down the cause of this problem please? (or how to fix it would be even better! ;-) Thanks Bob. ====================================================Bob Findlay The Operations Centre - Norwich BioScience Institutes Tel: 01603 450474 (2474 internal) Fax: 01603 450045 =====================================================
bob findlay (TOC)
2008-Jan-11 06:01 UTC
[Ocfs2-users] RE: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem
is having both sdf & sdf1 cause for concern? especially as the mounted.ocfs2 -f complains about a bad magic number on sdf. it doesn't seem right that both sdf and sdf1 have oracle as the label? we're mounting by label, and it's sdf1 that gets mounted. [root@jic55124 bin]# mounted.ocfs2 -d Device FS UUID Label /dev/sdf ocfs2 e9b6b495-a72d-4792-9b51-b294702b7ed4 oracle /dev/sdf1 ocfs2 e418cb00-242f-4df2-96b4-6f3b0ae9b2e8 oracle /dev/sdg ocfs2 79a4a600-4f9c-4be0-b983-fbadf44a35d7 temp [root@jic55124 bin]# mounted.ocfs2 -f Device FS Nodes /dev/sdf ocfs2 Unknown: Bad magic number in inode /dev/sdf1 ocfs2 jic55124, jic55123, node3, node8, node4, node1, node5, node6, node7 /dev/sdg ocfs2 jic55123 Thanks Bob. ====================================================Bob Findlay The Operations Centre - Norwich BioScience Institutes Tel: 01603 450474 (2474 internal) Fax: 01603 450045 ==================================================== -----Original Message----- From: ocfs2-devel-bounces@oss.oracle.com [mailto:ocfs2-devel-bounces@oss.oracle.com] On Behalf Of bob findlay (TOC) Sent: 11 January 2008 11:17 To: ocfs2-devel@oss.oracle.com; ocfs2-users@oss.oracle.com Subject: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem Hi everyone Firstly, apologies for the cross post, I am not sure which list is most appropriate for this question. I should also point out, that I did not install OCFS2 and I am not the person that normally looks after these kind of things, so please can you bear that in mind when you make any suggestions (I will need a lot of detail!) The problem: accessing certain directories within the cluster file system e.g. with "ls" cause the process to hang permanently. I cannot cancel the request, I have to terminate the session. This is happening across multiple nodes, so I am assuming that OCFS2 is the root cause of the problem. Accessing the directory in debug mode seems to work fine eg this command will hang my session [root@jic55124 databases]# ls -l /common/users/cbu/vigourom Whereas this works fine [root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" | debugfs.ocfs2 -n /dev/sdf1 25447960 drwxr-xr-x 33 2522 2004 4096 10-Jan-2008 16:30 . 25447672 drwxr-xr-x 5 3773 2004 4096 30-Nov-2007 14:27 .. 25447961 drwx------ 2 2522 2004 4096 1-Aug-2007 12:06 .ssh 25447963 -rw-r--r-- 1 2522 2004 3814 1-Aug-2007 17:04 addgi_new3.pl 25447964 -rw-r--r-- 1 0 0 0 1-Aug-2007 17:05 allmaize.out 25447965 -rw------- 1 2522 2004 1741 15-Aug-2007 11:13 .viminfo 25447966 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 12:07 .mcop 25447970 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:43 forUNIGENE 25447971 -rw-r--r-- 1 0 0 325655 1-Aug-2007 15:02 maize.out 25447972 -rw-r--r-- 1 0 0 264 1-Aug-2007 15:42 README 25447973 -rwxr--r-- 1 2522 2004 7209696 8-Aug-2007 14:53 bioperl-1.5.2_102.zip 25447974 drwxrwsr-x 9 2522 2004 4096 13-Aug-2007 14:59 bioperl-1.5.2_102 22610705 drwxr-xr-x 2 2522 2004 4096 14-Aug-2007 17:10 perl5lib 22610706 drwxr-xr-x 3 2522 2004 4096 14-Aug-2007 17:11 .cpan 22610709 drwx------ 4 2522 2004 4096 4-Sep-2007 11:39 .gnome 22610713 drwx------ 4 2522 2004 4096 4-Sep-2007 14:58 .gnome2 22610719 drwx------ 2 2522 2004 4096 4-Sep-2007 11:39 .gnome2_private 22610720 drwx------ 4 2522 2004 4096 4-Sep-2007 11:40 .kde 229702011 -rw------- 1 2522 2004 771 10-Jan-2008 09:40 .Xauthority 22610820 drwx------ 4 2522 2004 4096 9-Jan-2008 14:08 .gconf 22610835 drwx------ 2 2522 2004 4096 10-Jan-2008 13:41 .gconfd 22610837 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 11:39 .nautilus 22610842 drwxr-xr-x 4 2522 2004 4096 4-Sep-2007 15:27 Desktop 28545914 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:40 .qt 28545917 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:42 .fonts 28545922 drwx------ 3 2522 2004 4096 4-Sep-2007 12:13 .mozilla 4567882 -rw-r--r-- 1 2522 2004 53 9-Jan-2008 14:08 .fonts.cache-1 28545956 -rw------- 1 2522 2004 0 6-Sep-2007 15:30 .ICEauthority 28545957 -rw-r--r-- 1 2522 2004 110 4-Sep-2007 11:42 .fonts.conf 28545958 -rw------- 1 2522 2004 31 4-Sep-2007 12:07 .mcoprc 28545959 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 12:17 .wp 28545962 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:04 .seqlab-node7 28545967 -rw-r--r-- 1 2522 2004 707 4-Sep-2007 16:16 .seqlab-history 28545968 drwxr-xr-x 5 2522 2004 4096 4-Sep-2007 15:05 GCGSeqmergeTests etc stat gives [root@jic55124 databases]# echo "stat /users/cbu/vigourom" | debugfs.ocfs2 -n /dev/sdf1 Inode: 25447960 Mode: 0755 Generation: 1766836575 (0x694fc95f) FS Generation: 3856768928 (0xe5e19fa0) Type: Directory Attr: 0x0 Flags: Valid User: 2522 (vigourom) Group: 2004 (cbu) Size: 4096 Links: 33 Clusters: 1 ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008 atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007 mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008 dtime: 0x0 -- Thu Jan 1 01:00:00 1970 ctime_nsec: 0x33de5143 -- 870207811 atime_nsec: 0x0ba52bb0 -- 195374000 mtime_nsec: 0x33de5143 -- 870207811 Last Extblk: 0 Sub Alloc Slot: 4 Sub Alloc Bit: 544 Tree Depth: 0 Count: 243 Next Free Rec: 1 ## Offset Clusters Block# 0 0 1 20289216 fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other things, which sounds pretty bad. Is it? [root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1 Checking OCFS2 filesystem in /dev/sdf1: label: oracle uuid: e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8 number of blocks: 243930952 bytes per block: 4096 number of clusters: 30491369 bytes per cluster: 32768 max slots: 24 ** Skipping journal replay because -n was given. There may be spurious errors that journal replay would fix. ** /dev/sdf1 was run with -f, check forced. Pass 0a: Checking cluster allocation chains [GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2 free bits which is more than 0 bits indicated by the bitmap.n Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate cluster 22151173 [DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n [CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster bitmap but it isn't in use. Clear its bit in the bitmap? n [CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster bitmap but it isn't in use. Clear its bit in the bitmap? n Pass 2: Checking directory entries. [DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number 74502784 which isn't allocated, clear the entry? n Pass 3: Checking directory connectivity. [DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the filesystem. Move it to lost+found? n Pass 4a: checking for orphaned inodes ** Skipping orphan dir replay because -n was given. Pass 4b: Checking inodes link counts. [INODE_COUNT] Inode 74502784 has a link count of 0 on disk but directory entry references come to 1. Update the count on disk to man [INODE_COUNT] Inode 142698567 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to mn pass4: Internal logic faliure fsck's thinks inode 149371307 has a link count of 1 but on disk it is 0 [INODE_COUNT] Inode 149371307 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to mn [INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any directory entries. Move it to lost+found? n [INODE_COUNT] Inode 149371341 has a link count of 2 on disk but directory entry references come to 0. Update the count on disk to mn All passes succeeded. This has happened before and was "resolved" by shutting down the cluster and performing a fsck.ocfs2, but that doesn't help us prevent it in the future, so I would really like to resolve it properly. any suggestions as to how I can narrow down the cause of this problem please? (or how to fix it would be even better! ;-) Thanks Bob. ====================================================Bob Findlay The Operations Centre - Norwich BioScience Institutes Tel: 01603 450474 (2474 internal) Fax: 01603 450045 ==================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080111/93f3e5e0/attachment-0001.html
Sunil Mushran
2008-Jan-11 10:27 UTC
[Ocfs2-users] RE: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem
If ls is hanging, invariably means the dlm is waiting for a node to respond. Do: $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN If the ls process is in ocfs2_wait_for_status_completion() it means that that is the case. You can find more info on debugging here. http://oss.oracle.com/osswiki/OCFS2/Debugging One thing to check is: $ cat /proc/fs/ocfs2_nodemanager/sock_containers If you notice that a connect between two nodes is missing, you are encountering a issue we are currently fixing. That is, a connect between two live nodes breaks and a reconnect attempt is not made. This will lead to a hang too. As far as other issues: 1. Yes, debugfs.ocfs2 will always be able to read the device as it does dirty reads. The output for mounted vols may be stale. 2. No, fsck.ocfs2 is not showing real errors. If you run fsck when the device is mounted, fsck is not seeing the full picture as the current image of the block(s) may be cached by other nodes in the cluster. 3. sdf was formatted with ocfs2 before it was partitioned. While mounted.ocfs2 can successfully read the superblock it errors because it is unable to see more of the device (and that is correct.) We will fix mounted to ignore such cases. Sunil bob findlay (TOC) wrote:> > is having both sdf & sdf1 cause for concern? especially as the > mounted.ocfs2 -f complains about a bad magic number on sdf. it doesn't > seem right that both sdf and sdf1 have oracle as the label? we're > mounting by label, and it's sdf1 that gets mounted. > > [root@jic55124 bin]# mounted.ocfs2 -d > Device FS UUID Label > /dev/sdf ocfs2 e9b6b495-a72d-4792-9b51-b294702b7ed4 oracle > /dev/sdf1 ocfs2 e418cb00-242f-4df2-96b4-6f3b0ae9b2e8 oracle > /dev/sdg ocfs2 79a4a600-4f9c-4be0-b983-fbadf44a35d7 temp > [root@jic55124 bin]# mounted.ocfs2 -f > Device FS Nodes > /dev/sdf ocfs2 *Unknown: Bad magic number in inode* > /dev/sdf1 ocfs2 jic55124, jic55123, node3, node8, node4, node1, node5, > node6, node7 > /dev/sdg ocfs2 jic55123 > > > Thanks > > Bob. > > ====================================================> Bob Findlay > The Operations Centre ? Norwich BioScience Institutes > Tel: 01603 450474 (2474 internal) > Fax: 01603 450045 > ====================================================> > > -----Original Message----- > From: ocfs2-devel-bounces@oss.oracle.com > [mailto:ocfs2-devel-bounces@oss.oracle.com] On Behalf Of bob findlay (TOC) > Sent: 11 January 2008 11:17 > To: ocfs2-devel@oss.oracle.com; ocfs2-users@oss.oracle.com > Subject: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 > filesystem > > Hi everyone > > Firstly, apologies for the cross post, I am not sure which list is > most appropriate for this question. I should also point out, that I > did not install OCFS2 and I am not the person that normally looks > after these kind of things, so please can you bear that in mind when > you make any suggestions (I will need a lot of detail!) > > The problem: accessing certain directories within the cluster file > system e.g. with "ls" cause the process to hang permanently. I cannot > cancel the request, I have to terminate the session. This is happening > across multiple nodes, so I am assuming that OCFS2 is the root cause > of the problem. > > Accessing the directory in debug mode seems to work fine eg this > command will hang my session > > [root@jic55124 databases]# ls -l /common/users/cbu/vigourom > > Whereas this works fine > > [root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" | > debugfs.ocfs2 -n /dev/sdf1 > 25447960 drwxr-xr-x 33 2522 2004 4096 10-Jan-2008 16:30 . > 25447672 drwxr-xr-x 5 3773 2004 4096 30-Nov-2007 14:27 .. > 25447961 drwx------ 2 2522 2004 4096 1-Aug-2007 12:06 .ssh > 25447963 -rw-r--r-- 1 2522 2004 3814 1-Aug-2007 17:04 addgi_new3.pl > 25447964 -rw-r--r-- 1 0 0 0 1-Aug-2007 17:05 allmaize.out > 25447965 -rw------- 1 2522 2004 1741 15-Aug-2007 11:13 .viminfo > 25447966 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 12:07 .mcop > 25447970 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:43 forUNIGENE > 25447971 -rw-r--r-- 1 0 0 325655 1-Aug-2007 15:02 maize.out > 25447972 -rw-r--r-- 1 0 0 264 1-Aug-2007 15:42 README > 25447973 -rwxr--r-- 1 2522 2004 7209696 8-Aug-2007 14:53 > bioperl-1.5.2_102.zip > 25447974 drwxrwsr-x 9 2522 2004 4096 13-Aug-2007 14:59 bioperl-1.5.2_102 > 22610705 drwxr-xr-x 2 2522 2004 4096 14-Aug-2007 17:10 perl5lib > 22610706 drwxr-xr-x 3 2522 2004 4096 14-Aug-2007 17:11 .cpan > 22610709 drwx------ 4 2522 2004 4096 4-Sep-2007 11:39 .gnome > 22610713 drwx------ 4 2522 2004 4096 4-Sep-2007 14:58 .gnome2 > 22610719 drwx------ 2 2522 2004 4096 4-Sep-2007 11:39 .gnome2_private > 22610720 drwx------ 4 2522 2004 4096 4-Sep-2007 11:40 .kde > 229702011 -rw------- 1 2522 2004 771 10-Jan-2008 09:40 .Xauthority > 22610820 drwx------ 4 2522 2004 4096 9-Jan-2008 14:08 .gconf > 22610835 drwx------ 2 2522 2004 4096 10-Jan-2008 13:41 .gconfd > 22610837 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 11:39 .nautilus > 22610842 drwxr-xr-x 4 2522 2004 4096 4-Sep-2007 15:27 Desktop > 28545914 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:40 .qt > 28545917 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:42 .fonts > 28545922 drwx------ 3 2522 2004 4096 4-Sep-2007 12:13 .mozilla > 4567882 -rw-r--r-- 1 2522 2004 53 9-Jan-2008 14:08 .fonts.cache-1 > 28545956 -rw------- 1 2522 2004 0 6-Sep-2007 15:30 .ICEauthority > 28545957 -rw-r--r-- 1 2522 2004 110 4-Sep-2007 11:42 .fonts.conf > 28545958 -rw------- 1 2522 2004 31 4-Sep-2007 12:07 .mcoprc > 28545959 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 12:17 .wp > 28545962 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:04 .seqlab-node7 > 28545967 -rw-r--r-- 1 2522 2004 707 4-Sep-2007 16:16 .seqlab-history > 28545968 drwxr-xr-x 5 2522 2004 4096 4-Sep-2007 15:05 GCGSeqmergeTests > etc > > stat gives > > [root@jic55124 databases]# echo "stat /users/cbu/vigourom" | > debugfs.ocfs2 -n /dev/sdf1 > Inode: 25447960 Mode: 0755 Generation: 1766836575 (0x694fc95f) > FS Generation: 3856768928 (0xe5e19fa0) > Type: Directory Attr: 0x0 Flags: Valid > User: 2522 (vigourom) Group: 2004 (cbu) Size: 4096 > Links: 33 Clusters: 1 > ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008 > atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007 > mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008 > dtime: 0x0 -- Thu Jan 1 01:00:00 1970 > ctime_nsec: 0x33de5143 -- 870207811 > atime_nsec: 0x0ba52bb0 -- 195374000 > mtime_nsec: 0x33de5143 -- 870207811 > Last Extblk: 0 > Sub Alloc Slot: 4 Sub Alloc Bit: 544 > Tree Depth: 0 Count: 243 Next Free Rec: 1 > ## Offset Clusters Block# > 0 0 1 20289216 > > fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other > things, which sounds pretty bad. Is it? > > [root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1 > Checking OCFS2 filesystem in /dev/sdf1: > label: oracle > uuid: e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8 > number of blocks: 243930952 > bytes per block: 4096 > number of clusters: 30491369 > bytes per cluster: 32768 > max slots: 24 > > ** Skipping journal replay because -n was given. There may be spurious > errors that journal replay would fix. ** > /dev/sdf1 was run with -f, check forced. > Pass 0a: Checking cluster allocation chains > [GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2 > free bits which is more than 0 bits indicated by the bitmap.n > Pass 0b: Checking inode allocation chains > Pass 0c: Checking extent block allocation chains > Pass 1: Checking inodes and blocks. > o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate > cluster 22151173 > [DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n > [CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster > bitmap but it isn't in use. Clear its bit in the bitmap? n > [CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster > bitmap but it isn't in use. Clear its bit in the bitmap? n > Pass 2: Checking directory entries. > [DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number > 74502784 which isn't allocated, clear the entry? n > Pass 3: Checking directory connectivity. > [DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the > filesystem. Move it to lost+found? n > Pass 4a: checking for orphaned inodes > ** Skipping orphan dir replay because -n was given. > Pass 4b: Checking inodes link counts. > [INODE_COUNT] Inode 74502784 has a link count of 0 on disk but > directory entry references come to 1. Update the count on disk to man > [INODE_COUNT] Inode 142698567 has a link count of 1 on disk but > directory entry references come to 2. Update the count on disk to mn > pass4: Internal logic faliure fsck's thinks inode 149371307 has a link > count of 1 but on disk it is 0 > [INODE_COUNT] Inode 149371307 has a link count of 1 on disk but > directory entry references come to 2. Update the count on disk to mn > [INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any > directory entries. Move it to lost+found? n > [INODE_COUNT] Inode 149371341 has a link count of 2 on disk but > directory entry references come to 0. Update the count on disk to mn > All passes succeeded. > > > This has happened before and was "resolved" by shutting down the > cluster and performing a fsck.ocfs2, but that doesn't help us prevent > it in the future, so I would really like to resolve it properly. > > any suggestions as to how I can narrow down the cause of this problem > please? (or how to fix it would be even better! ;-) > > Thanks > > Bob. > > ====================================================> Bob Findlay > The Operations Centre ? Norwich BioScience Institutes > Tel: 01603 450474 (2474 internal) > Fax: 01603 450045 > ====================================================> > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users