bob findlay (TOC)
2008-Jan-11  03:17 UTC
[Ocfs2-users] systems hang when accessing parts of the OCFS2 file system
Hi everyone
 
Firstly, apologies for the cross post, I am not sure which list is most
appropriate for this question.  I should also point out, that I did not
install OCFS2 and I am not the person that normally looks after these
kind of things, so please can you bear that in mind when you make any
suggestions (I will need a lot of detail!)
 
The problem: accessing certain directories within the cluster file
system e.g. with "ls" cause the process to hang permanently.  I cannot
cancel the request, I have to terminate the session.  This is happening
across multiple nodes, so I am assuming that OCFS2 is the root cause of
the problem.
 
Accessing the directory in debug mode seems to work fine eg this command
will hang my session
 
[root@jic55124 databases]# ls -l /common/users/cbu/vigourom
Whereas this works fine
 
[root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" |
debugfs.ocfs2 -n /dev/sdf1
        25447960        drwxr-xr-x  33  2522  2004            4096
10-Jan-2008 16:30 .
        25447672        drwxr-xr-x   5  3773  2004            4096
30-Nov-2007 14:27 ..
        25447961        drwx------   2  2522  2004            4096
1-Aug-2007 12:06 .ssh
        25447963        -rw-r--r--   1  2522  2004            3814
1-Aug-2007 17:04 addgi_new3.pl
        25447964        -rw-r--r--   1     0     0               0
1-Aug-2007 17:05 allmaize.out
        25447965        -rw-------   1  2522  2004            1741
15-Aug-2007 11:13 .viminfo
        25447966        drwxr-xr-x   3  2522  2004            4096
4-Sep-2007 12:07 .mcop
        25447970        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 15:43 forUNIGENE
        25447971        -rw-r--r--   1     0     0          325655
1-Aug-2007 15:02 maize.out
        25447972        -rw-r--r--   1     0     0             264
1-Aug-2007 15:42 README
        25447973        -rwxr--r--   1  2522  2004         7209696
8-Aug-2007 14:53 bioperl-1.5.2_102.zip
        25447974        drwxrwsr-x   9  2522  2004            4096
13-Aug-2007 14:59 bioperl-1.5.2_102
        22610705        drwxr-xr-x   2  2522  2004            4096
14-Aug-2007 17:10 perl5lib
        22610706        drwxr-xr-x   3  2522  2004            4096
14-Aug-2007 17:11 .cpan
        22610709        drwx------   4  2522  2004            4096
4-Sep-2007 11:39 .gnome
        22610713        drwx------   4  2522  2004            4096
4-Sep-2007 14:58 .gnome2
        22610719        drwx------   2  2522  2004            4096
4-Sep-2007 11:39 .gnome2_private
        22610720        drwx------   4  2522  2004            4096
4-Sep-2007 11:40 .kde
        229702011       -rw-------   1  2522  2004             771
10-Jan-2008 09:40 .Xauthority
        22610820        drwx------   4  2522  2004            4096
9-Jan-2008 14:08 .gconf
        22610835        drwx------   2  2522  2004            4096
10-Jan-2008 13:41 .gconfd
        22610837        drwxr-xr-x   3  2522  2004            4096
4-Sep-2007 11:39 .nautilus
        22610842        drwxr-xr-x   4  2522  2004            4096
4-Sep-2007 15:27 Desktop
        28545914        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 11:40 .qt
        28545917        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 11:42 .fonts
        28545922        drwx------   3  2522  2004            4096
4-Sep-2007 12:13 .mozilla
        4567882         -rw-r--r--   1  2522  2004              53
9-Jan-2008 14:08 .fonts.cache-1
        28545956        -rw-------   1  2522  2004               0
6-Sep-2007 15:30 .ICEauthority
        28545957        -rw-r--r--   1  2522  2004             110
4-Sep-2007 11:42 .fonts.conf
        28545958        -rw-------   1  2522  2004              31
4-Sep-2007 12:07 .mcoprc
        28545959        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 12:17 .wp
        28545962        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 15:04 .seqlab-node7
        28545967        -rw-r--r--   1  2522  2004             707
4-Sep-2007 16:16 .seqlab-history
        28545968        drwxr-xr-x   5  2522  2004            4096
4-Sep-2007 15:05 GCGSeqmergeTests
etc
 
stat gives 
 
[root@jic55124 databases]# echo "stat /users/cbu/vigourom" |
debugfs.ocfs2 -n /dev/sdf1     
        Inode: 25447960   Mode: 0755   Generation: 1766836575
(0x694fc95f)
        FS Generation: 3856768928 (0xe5e19fa0)
        Type: Directory   Attr: 0x0   Flags: Valid 
        User: 2522 (vigourom)   Group: 2004 (cbu)   Size: 4096
        Links: 33   Clusters: 1
        ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008
        atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007
        mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008
        dtime: 0x0 -- Thu Jan  1 01:00:00 1970
        ctime_nsec: 0x33de5143 -- 870207811
        atime_nsec: 0x0ba52bb0 -- 195374000
        mtime_nsec: 0x33de5143 -- 870207811
        Last Extblk: 0
        Sub Alloc Slot: 4   Sub Alloc Bit: 544
        Tree Depth: 0   Count: 243   Next Free Rec: 1
        ## Offset        Clusters       Block#
        0  0             1              20289216
 
fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other
things, which sounds pretty bad.  Is it?
 
[root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1
Checking OCFS2 filesystem in /dev/sdf1:
  label:              oracle
  uuid:               e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8 
  number of blocks:   243930952
  bytes per block:    4096
  number of clusters: 30491369
  bytes per cluster:  32768
  max slots:          24
 
** Skipping journal replay because -n was given.  There may be spurious
errors that journal replay would fix. **
/dev/sdf1 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
[GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2
free bits which is more than 0 bits indicated by the bitmap.n
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate
cluster 22151173
[DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n
[CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? n
[CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? n
Pass 2: Checking directory entries.
[DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number
74502784 which isn't allocated, clear the entry? n
Pass 3: Checking directory connectivity.
[DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the
filesystem.  Move it to lost+found? n
Pass 4a: checking for orphaned inodes
** Skipping orphan dir replay because -n was given.
Pass 4b: Checking inodes link counts.
[INODE_COUNT] Inode 74502784 has a link count of 0 on disk but directory
entry references come to 1. Update the count on disk to man
[INODE_COUNT] Inode 142698567 has a link count of 1 on disk but
directory entry references come to 2. Update the count on disk to mn
pass4: Internal logic faliure fsck's thinks inode 149371307 has a link
count of 1 but on disk it is 0
[INODE_COUNT] Inode 149371307 has a link count of 1 on disk but
directory entry references come to 2. Update the count on disk to mn
[INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any directory
entries.  Move it to lost+found? n
[INODE_COUNT] Inode 149371341 has a link count of 2 on disk but
directory entry references come to 0. Update the count on disk to mn
All passes succeeded.
 
 
This has happened before and was "resolved" by shutting down the
cluster
and performing a fsck.ocfs2, but that doesn't help us prevent it in the
future, so I would really like to resolve it properly.  
 
any suggestions as to how I can narrow down the cause of this problem
please?  (or how to fix it would be even better! ;-)
 
Thanks
 
Bob.
 
====================================================Bob Findlay
The Operations Centre - Norwich BioScience Institutes
Tel: 01603 450474  (2474 internal)
Fax: 01603 450045
====================================================
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080111/dfbb3538/attachment.html
bob findlay (TOC)
2008-Jan-11  05:50 UTC
[Ocfs2-users] RE: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem
Sorry, should have also stated the obviously necessary information -
[root@jic55124 bin]# cat /proc/fs/ocfs2/version
OCFS2 1.2.7 Tue Oct  9 16:15:59 PDT 2007 (build
d443ce77532cea8d1e167ab2de51b8c8)
[root@jic55124 bin]# rpm -qf /boot/vmlinuz-`uname -r` --queryformat
"%{ARCH}\n"
x86_64
[root@jic55124 bin]# uname -a
Linux jic55124 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux
Thanks
 
Bob.
 
====================================================Bob Findlay
The Operations Centre - Norwich BioScience Institutes
Tel: 01603 450474  (2474 internal)
Fax: 01603 450045
====================================================
-----Original Message-----
From: ocfs2-devel-bounces@oss.oracle.com
[mailto:ocfs2-devel-bounces@oss.oracle.com] On Behalf Of bob findlay
(TOC)
Sent: 11 January 2008 11:17
To: ocfs2-devel@oss.oracle.com; ocfs2-users@oss.oracle.com
Subject: [Ocfs2-devel] systems hang when accessing parts of the OCFS2
filesystem
Hi everyone
 
Firstly, apologies for the cross post, I am not sure which list is most
appropriate for this question.  I should also point out, that I did not
install OCFS2 and I am not the person that normally looks after these
kind of things, so please can you bear that in mind when you make any
suggestions (I will need a lot of detail!)
 
The problem: accessing certain directories within the cluster file
system e.g. with "ls" cause the process to hang permanently.  I cannot
cancel the request, I have to terminate the session.  This is happening
across multiple nodes, so I am assuming that OCFS2 is the root cause of
the problem.
 
Accessing the directory in debug mode seems to work fine eg this command
will hang my session
 
[root@jic55124 databases]# ls -l /common/users/cbu/vigourom
Whereas this works fine
 
[root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" |
debugfs.ocfs2 -n /dev/sdf1
        25447960        drwxr-xr-x  33  2522  2004            4096
10-Jan-2008 16:30 .
        25447672        drwxr-xr-x   5  3773  2004            4096
30-Nov-2007 14:27 ..
        25447961        drwx------   2  2522  2004            4096
1-Aug-2007 12:06 .ssh
        25447963        -rw-r--r--   1  2522  2004            3814
1-Aug-2007 17:04 addgi_new3.pl
        25447964        -rw-r--r--   1     0     0               0
1-Aug-2007 17:05 allmaize.out
        25447965        -rw-------   1  2522  2004            1741
15-Aug-2007 11:13 .viminfo
        25447966        drwxr-xr-x   3  2522  2004            4096
4-Sep-2007 12:07 .mcop
        25447970        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 15:43 forUNIGENE
        25447971        -rw-r--r--   1     0     0          325655
1-Aug-2007 15:02 maize.out
        25447972        -rw-r--r--   1     0     0             264
1-Aug-2007 15:42 README
        25447973        -rwxr--r--   1  2522  2004         7209696
8-Aug-2007 14:53 bioperl-1.5.2_102.zip
        25447974        drwxrwsr-x   9  2522  2004            4096
13-Aug-2007 14:59 bioperl-1.5.2_102
        22610705        drwxr-xr-x   2  2522  2004            4096
14-Aug-2007 17:10 perl5lib
        22610706        drwxr-xr-x   3  2522  2004            4096
14-Aug-2007 17:11 .cpan
        22610709        drwx------   4  2522  2004            4096
4-Sep-2007 11:39 .gnome
        22610713        drwx------   4  2522  2004            4096
4-Sep-2007 14:58 .gnome2
        22610719        drwx------   2  2522  2004            4096
4-Sep-2007 11:39 .gnome2_private
        22610720        drwx------   4  2522  2004            4096
4-Sep-2007 11:40 .kde
        229702011       -rw-------   1  2522  2004             771
10-Jan-2008 09:40 .Xauthority
        22610820        drwx------   4  2522  2004            4096
9-Jan-2008 14:08 .gconf
        22610835        drwx------   2  2522  2004            4096
10-Jan-2008 13:41 .gconfd
        22610837        drwxr-xr-x   3  2522  2004            4096
4-Sep-2007 11:39 .nautilus
        22610842        drwxr-xr-x   4  2522  2004            4096
4-Sep-2007 15:27 Desktop
        28545914        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 11:40 .qt
        28545917        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 11:42 .fonts
        28545922        drwx------   3  2522  2004            4096
4-Sep-2007 12:13 .mozilla
        4567882         -rw-r--r--   1  2522  2004              53
9-Jan-2008 14:08 .fonts.cache-1
        28545956        -rw-------   1  2522  2004               0
6-Sep-2007 15:30 .ICEauthority
        28545957        -rw-r--r--   1  2522  2004             110
4-Sep-2007 11:42 .fonts.conf
        28545958        -rw-------   1  2522  2004              31
4-Sep-2007 12:07 .mcoprc
        28545959        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 12:17 .wp
        28545962        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 15:04 .seqlab-node7
        28545967        -rw-r--r--   1  2522  2004             707
4-Sep-2007 16:16 .seqlab-history
        28545968        drwxr-xr-x   5  2522  2004            4096
4-Sep-2007 15:05 GCGSeqmergeTests
etc
 
stat gives 
 
[root@jic55124 databases]# echo "stat /users/cbu/vigourom" |
debugfs.ocfs2 -n /dev/sdf1     
        Inode: 25447960   Mode: 0755   Generation: 1766836575
(0x694fc95f)
        FS Generation: 3856768928 (0xe5e19fa0)
        Type: Directory   Attr: 0x0   Flags: Valid 
        User: 2522 (vigourom)   Group: 2004 (cbu)   Size: 4096
        Links: 33   Clusters: 1
        ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008
        atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007
        mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008
        dtime: 0x0 -- Thu Jan  1 01:00:00 1970
        ctime_nsec: 0x33de5143 -- 870207811
        atime_nsec: 0x0ba52bb0 -- 195374000
        mtime_nsec: 0x33de5143 -- 870207811
        Last Extblk: 0
        Sub Alloc Slot: 4   Sub Alloc Bit: 544
        Tree Depth: 0   Count: 243   Next Free Rec: 1
        ## Offset        Clusters       Block#
        0  0             1              20289216
 
fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other
things, which sounds pretty bad.  Is it?
 
[root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1
Checking OCFS2 filesystem in /dev/sdf1:
  label:              oracle
  uuid:               e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8 
  number of blocks:   243930952
  bytes per block:    4096
  number of clusters: 30491369
  bytes per cluster:  32768
  max slots:          24
 
** Skipping journal replay because -n was given.  There may be spurious
errors that journal replay would fix. **
/dev/sdf1 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
[GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2
free bits which is more than 0 bits indicated by the bitmap.n
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate
cluster 22151173
[DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n
[CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? n
[CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? n
Pass 2: Checking directory entries.
[DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number
74502784 which isn't allocated, clear the entry? n
Pass 3: Checking directory connectivity.
[DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the
filesystem.  Move it to lost+found? n
Pass 4a: checking for orphaned inodes
** Skipping orphan dir replay because -n was given.
Pass 4b: Checking inodes link counts.
[INODE_COUNT] Inode 74502784 has a link count of 0 on disk but directory
entry references come to 1. Update the count on disk to man
[INODE_COUNT] Inode 142698567 has a link count of 1 on disk but
directory entry references come to 2. Update the count on disk to mn
pass4: Internal logic faliure fsck's thinks inode 149371307 has a link
count of 1 but on disk it is 0
[INODE_COUNT] Inode 149371307 has a link count of 1 on disk but
directory entry references come to 2. Update the count on disk to mn
[INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any directory
entries.  Move it to lost+found? n
[INODE_COUNT] Inode 149371341 has a link count of 2 on disk but
directory entry references come to 0. Update the count on disk to mn
All passes succeeded.
 
 
This has happened before and was "resolved" by shutting down the
cluster
and performing a fsck.ocfs2, but that doesn't help us prevent it in the
future, so I would really like to resolve it properly.  
 
any suggestions as to how I can narrow down the cause of this problem
please?  (or how to fix it would be even better! ;-)
 
Thanks
 
Bob.
 
====================================================Bob Findlay
The Operations Centre - Norwich BioScience Institutes
Tel: 01603 450474  (2474 internal)
Fax: 01603 450045
=====================================================
bob findlay (TOC)
2008-Jan-11  06:01 UTC
[Ocfs2-users] RE: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem
is having both sdf & sdf1 cause for concern?  especially as the
mounted.ocfs2 -f complains about a bad magic number on sdf.  it doesn't
seem right that both sdf and sdf1 have oracle as the label?  we're
mounting by label, and it's sdf1 that gets mounted.
[root@jic55124 bin]# mounted.ocfs2 -d
Device                FS     UUID                                  Label
/dev/sdf              ocfs2  e9b6b495-a72d-4792-9b51-b294702b7ed4
oracle
/dev/sdf1             ocfs2  e418cb00-242f-4df2-96b4-6f3b0ae9b2e8
oracle
/dev/sdg              ocfs2  79a4a600-4f9c-4be0-b983-fbadf44a35d7  temp
[root@jic55124 bin]# mounted.ocfs2 -f
Device                FS     Nodes
/dev/sdf              ocfs2  Unknown: Bad magic number in inode  
/dev/sdf1             ocfs2  jic55124, jic55123, node3, node8, node4,
node1, node5, node6, node7
/dev/sdg              ocfs2  jic55123
Thanks
Bob.
====================================================Bob Findlay
The Operations Centre - Norwich BioScience Institutes
Tel: 01603 450474  (2474 internal)
Fax: 01603 450045
====================================================
-----Original Message-----
From: ocfs2-devel-bounces@oss.oracle.com
[mailto:ocfs2-devel-bounces@oss.oracle.com] On Behalf Of bob findlay
(TOC)
Sent: 11 January 2008 11:17
To: ocfs2-devel@oss.oracle.com; ocfs2-users@oss.oracle.com
Subject: [Ocfs2-devel] systems hang when accessing parts of the OCFS2
filesystem
Hi everyone
Firstly, apologies for the cross post, I am not sure which list is most
appropriate for this question.  I should also point out, that I did not
install OCFS2 and I am not the person that normally looks after these
kind of things, so please can you bear that in mind when you make any
suggestions (I will need a lot of detail!)
The problem: accessing certain directories within the cluster file
system e.g. with "ls" cause the process to hang permanently.  I cannot
cancel the request, I have to terminate the session.  This is happening
across multiple nodes, so I am assuming that OCFS2 is the root cause of
the problem.
Accessing the directory in debug mode seems to work fine eg this command
will hang my session
[root@jic55124 databases]# ls -l /common/users/cbu/vigourom
Whereas this works fine
[root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" |
debugfs.ocfs2 -n /dev/sdf1
        25447960        drwxr-xr-x  33  2522  2004            4096
10-Jan-2008 16:30 .
        25447672        drwxr-xr-x   5  3773  2004            4096
30-Nov-2007 14:27 ..
        25447961        drwx------   2  2522  2004            4096
1-Aug-2007 12:06 .ssh
        25447963        -rw-r--r--   1  2522  2004            3814
1-Aug-2007 17:04 addgi_new3.pl
        25447964        -rw-r--r--   1     0     0               0
1-Aug-2007 17:05 allmaize.out
        25447965        -rw-------   1  2522  2004            1741
15-Aug-2007 11:13 .viminfo
        25447966        drwxr-xr-x   3  2522  2004            4096
4-Sep-2007 12:07 .mcop
        25447970        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 15:43 forUNIGENE
        25447971        -rw-r--r--   1     0     0          325655
1-Aug-2007 15:02 maize.out
        25447972        -rw-r--r--   1     0     0             264
1-Aug-2007 15:42 README
        25447973        -rwxr--r--   1  2522  2004         7209696
8-Aug-2007 14:53 bioperl-1.5.2_102.zip
        25447974        drwxrwsr-x   9  2522  2004            4096
13-Aug-2007 14:59 bioperl-1.5.2_102
        22610705        drwxr-xr-x   2  2522  2004            4096
14-Aug-2007 17:10 perl5lib
        22610706        drwxr-xr-x   3  2522  2004            4096
14-Aug-2007 17:11 .cpan
        22610709        drwx------   4  2522  2004            4096
4-Sep-2007 11:39 .gnome
        22610713        drwx------   4  2522  2004            4096
4-Sep-2007 14:58 .gnome2
        22610719        drwx------   2  2522  2004            4096
4-Sep-2007 11:39 .gnome2_private
        22610720        drwx------   4  2522  2004            4096
4-Sep-2007 11:40 .kde
        229702011       -rw-------   1  2522  2004             771
10-Jan-2008 09:40 .Xauthority
        22610820        drwx------   4  2522  2004            4096
9-Jan-2008 14:08 .gconf
        22610835        drwx------   2  2522  2004            4096
10-Jan-2008 13:41 .gconfd
        22610837        drwxr-xr-x   3  2522  2004            4096
4-Sep-2007 11:39 .nautilus
        22610842        drwxr-xr-x   4  2522  2004            4096
4-Sep-2007 15:27 Desktop
        28545914        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 11:40 .qt
        28545917        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 11:42 .fonts
        28545922        drwx------   3  2522  2004            4096
4-Sep-2007 12:13 .mozilla
        4567882         -rw-r--r--   1  2522  2004              53
9-Jan-2008 14:08 .fonts.cache-1
        28545956        -rw-------   1  2522  2004               0
6-Sep-2007 15:30 .ICEauthority
        28545957        -rw-r--r--   1  2522  2004             110
4-Sep-2007 11:42 .fonts.conf
        28545958        -rw-------   1  2522  2004              31
4-Sep-2007 12:07 .mcoprc
        28545959        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 12:17 .wp
        28545962        drwxr-xr-x   2  2522  2004            4096
4-Sep-2007 15:04 .seqlab-node7
        28545967        -rw-r--r--   1  2522  2004             707
4-Sep-2007 16:16 .seqlab-history
        28545968        drwxr-xr-x   5  2522  2004            4096
4-Sep-2007 15:05 GCGSeqmergeTests
etc
stat gives
[root@jic55124 databases]# echo "stat /users/cbu/vigourom" |
debugfs.ocfs2 -n /dev/sdf1    
        Inode: 25447960   Mode: 0755   Generation: 1766836575
(0x694fc95f)
        FS Generation: 3856768928 (0xe5e19fa0)
        Type: Directory   Attr: 0x0   Flags: Valid
        User: 2522 (vigourom)   Group: 2004 (cbu)   Size: 4096
        Links: 33   Clusters: 1
        ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008
        atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007
        mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008
        dtime: 0x0 -- Thu Jan  1 01:00:00 1970
        ctime_nsec: 0x33de5143 -- 870207811
        atime_nsec: 0x0ba52bb0 -- 195374000
        mtime_nsec: 0x33de5143 -- 870207811
        Last Extblk: 0
        Sub Alloc Slot: 4   Sub Alloc Bit: 544
        Tree Depth: 0   Count: 243   Next Free Rec: 1
        ## Offset        Clusters       Block#
        0  0             1              20289216
fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other
things, which sounds pretty bad.  Is it?
[root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1
Checking OCFS2 filesystem in /dev/sdf1:
  label:              oracle
  uuid:               e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8
  number of blocks:   243930952
  bytes per block:    4096
  number of clusters: 30491369
  bytes per cluster:  32768
  max slots:          24
** Skipping journal replay because -n was given.  There may be spurious
errors that journal replay would fix. **
/dev/sdf1 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
[GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2
free bits which is more than 0 bits indicated by the bitmap.n
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate
cluster 22151173
[DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n
[CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? n
[CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? n
Pass 2: Checking directory entries.
[DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number
74502784 which isn't allocated, clear the entry? n
Pass 3: Checking directory connectivity.
[DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the
filesystem.  Move it to lost+found? n
Pass 4a: checking for orphaned inodes
** Skipping orphan dir replay because -n was given.
Pass 4b: Checking inodes link counts.
[INODE_COUNT] Inode 74502784 has a link count of 0 on disk but directory
entry references come to 1. Update the count on disk to man
[INODE_COUNT] Inode 142698567 has a link count of 1 on disk but
directory entry references come to 2. Update the count on disk to mn
pass4: Internal logic faliure fsck's thinks inode 149371307 has a link
count of 1 but on disk it is 0
[INODE_COUNT] Inode 149371307 has a link count of 1 on disk but
directory entry references come to 2. Update the count on disk to mn
[INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any directory
entries.  Move it to lost+found? n
[INODE_COUNT] Inode 149371341 has a link count of 2 on disk but
directory entry references come to 0. Update the count on disk to mn
All passes succeeded.
This has happened before and was "resolved" by shutting down the
cluster
and performing a fsck.ocfs2, but that doesn't help us prevent it in the
future, so I would really like to resolve it properly. 
any suggestions as to how I can narrow down the cause of this problem
please?  (or how to fix it would be even better! ;-)
Thanks
Bob.
====================================================Bob Findlay
The Operations Centre - Norwich BioScience Institutes
Tel: 01603 450474  (2474 internal)
Fax: 01603 450045
====================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080111/93f3e5e0/attachment-0001.html
Sunil Mushran
2008-Jan-11  10:27 UTC
[Ocfs2-users] RE: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 filesystem
If ls is hanging, invariably means the dlm is waiting for a node to respond. Do: $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN If the ls process is in ocfs2_wait_for_status_completion() it means that that is the case. You can find more info on debugging here. http://oss.oracle.com/osswiki/OCFS2/Debugging One thing to check is: $ cat /proc/fs/ocfs2_nodemanager/sock_containers If you notice that a connect between two nodes is missing, you are encountering a issue we are currently fixing. That is, a connect between two live nodes breaks and a reconnect attempt is not made. This will lead to a hang too. As far as other issues: 1. Yes, debugfs.ocfs2 will always be able to read the device as it does dirty reads. The output for mounted vols may be stale. 2. No, fsck.ocfs2 is not showing real errors. If you run fsck when the device is mounted, fsck is not seeing the full picture as the current image of the block(s) may be cached by other nodes in the cluster. 3. sdf was formatted with ocfs2 before it was partitioned. While mounted.ocfs2 can successfully read the superblock it errors because it is unable to see more of the device (and that is correct.) We will fix mounted to ignore such cases. Sunil bob findlay (TOC) wrote:> > is having both sdf & sdf1 cause for concern? especially as the > mounted.ocfs2 -f complains about a bad magic number on sdf. it doesn't > seem right that both sdf and sdf1 have oracle as the label? we're > mounting by label, and it's sdf1 that gets mounted. > > [root@jic55124 bin]# mounted.ocfs2 -d > Device FS UUID Label > /dev/sdf ocfs2 e9b6b495-a72d-4792-9b51-b294702b7ed4 oracle > /dev/sdf1 ocfs2 e418cb00-242f-4df2-96b4-6f3b0ae9b2e8 oracle > /dev/sdg ocfs2 79a4a600-4f9c-4be0-b983-fbadf44a35d7 temp > [root@jic55124 bin]# mounted.ocfs2 -f > Device FS Nodes > /dev/sdf ocfs2 *Unknown: Bad magic number in inode* > /dev/sdf1 ocfs2 jic55124, jic55123, node3, node8, node4, node1, node5, > node6, node7 > /dev/sdg ocfs2 jic55123 > > > Thanks > > Bob. > > ====================================================> Bob Findlay > The Operations Centre ? Norwich BioScience Institutes > Tel: 01603 450474 (2474 internal) > Fax: 01603 450045 > ====================================================> > > -----Original Message----- > From: ocfs2-devel-bounces@oss.oracle.com > [mailto:ocfs2-devel-bounces@oss.oracle.com] On Behalf Of bob findlay (TOC) > Sent: 11 January 2008 11:17 > To: ocfs2-devel@oss.oracle.com; ocfs2-users@oss.oracle.com > Subject: [Ocfs2-devel] systems hang when accessing parts of the OCFS2 > filesystem > > Hi everyone > > Firstly, apologies for the cross post, I am not sure which list is > most appropriate for this question. I should also point out, that I > did not install OCFS2 and I am not the person that normally looks > after these kind of things, so please can you bear that in mind when > you make any suggestions (I will need a lot of detail!) > > The problem: accessing certain directories within the cluster file > system e.g. with "ls" cause the process to hang permanently. I cannot > cancel the request, I have to terminate the session. This is happening > across multiple nodes, so I am assuming that OCFS2 is the root cause > of the problem. > > Accessing the directory in debug mode seems to work fine eg this > command will hang my session > > [root@jic55124 databases]# ls -l /common/users/cbu/vigourom > > Whereas this works fine > > [root@jic55124 databases]# echo "ls -l /users/cbu/vigourom" | > debugfs.ocfs2 -n /dev/sdf1 > 25447960 drwxr-xr-x 33 2522 2004 4096 10-Jan-2008 16:30 . > 25447672 drwxr-xr-x 5 3773 2004 4096 30-Nov-2007 14:27 .. > 25447961 drwx------ 2 2522 2004 4096 1-Aug-2007 12:06 .ssh > 25447963 -rw-r--r-- 1 2522 2004 3814 1-Aug-2007 17:04 addgi_new3.pl > 25447964 -rw-r--r-- 1 0 0 0 1-Aug-2007 17:05 allmaize.out > 25447965 -rw------- 1 2522 2004 1741 15-Aug-2007 11:13 .viminfo > 25447966 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 12:07 .mcop > 25447970 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:43 forUNIGENE > 25447971 -rw-r--r-- 1 0 0 325655 1-Aug-2007 15:02 maize.out > 25447972 -rw-r--r-- 1 0 0 264 1-Aug-2007 15:42 README > 25447973 -rwxr--r-- 1 2522 2004 7209696 8-Aug-2007 14:53 > bioperl-1.5.2_102.zip > 25447974 drwxrwsr-x 9 2522 2004 4096 13-Aug-2007 14:59 bioperl-1.5.2_102 > 22610705 drwxr-xr-x 2 2522 2004 4096 14-Aug-2007 17:10 perl5lib > 22610706 drwxr-xr-x 3 2522 2004 4096 14-Aug-2007 17:11 .cpan > 22610709 drwx------ 4 2522 2004 4096 4-Sep-2007 11:39 .gnome > 22610713 drwx------ 4 2522 2004 4096 4-Sep-2007 14:58 .gnome2 > 22610719 drwx------ 2 2522 2004 4096 4-Sep-2007 11:39 .gnome2_private > 22610720 drwx------ 4 2522 2004 4096 4-Sep-2007 11:40 .kde > 229702011 -rw------- 1 2522 2004 771 10-Jan-2008 09:40 .Xauthority > 22610820 drwx------ 4 2522 2004 4096 9-Jan-2008 14:08 .gconf > 22610835 drwx------ 2 2522 2004 4096 10-Jan-2008 13:41 .gconfd > 22610837 drwxr-xr-x 3 2522 2004 4096 4-Sep-2007 11:39 .nautilus > 22610842 drwxr-xr-x 4 2522 2004 4096 4-Sep-2007 15:27 Desktop > 28545914 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:40 .qt > 28545917 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 11:42 .fonts > 28545922 drwx------ 3 2522 2004 4096 4-Sep-2007 12:13 .mozilla > 4567882 -rw-r--r-- 1 2522 2004 53 9-Jan-2008 14:08 .fonts.cache-1 > 28545956 -rw------- 1 2522 2004 0 6-Sep-2007 15:30 .ICEauthority > 28545957 -rw-r--r-- 1 2522 2004 110 4-Sep-2007 11:42 .fonts.conf > 28545958 -rw------- 1 2522 2004 31 4-Sep-2007 12:07 .mcoprc > 28545959 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 12:17 .wp > 28545962 drwxr-xr-x 2 2522 2004 4096 4-Sep-2007 15:04 .seqlab-node7 > 28545967 -rw-r--r-- 1 2522 2004 707 4-Sep-2007 16:16 .seqlab-history > 28545968 drwxr-xr-x 5 2522 2004 4096 4-Sep-2007 15:05 GCGSeqmergeTests > etc > > stat gives > > [root@jic55124 databases]# echo "stat /users/cbu/vigourom" | > debugfs.ocfs2 -n /dev/sdf1 > Inode: 25447960 Mode: 0755 Generation: 1766836575 (0x694fc95f) > FS Generation: 3856768928 (0xe5e19fa0) > Type: Directory Attr: 0x0 Flags: Valid > User: 2522 (vigourom) Group: 2004 (cbu) Size: 4096 > Links: 33 Clusters: 1 > ctime: 0x4786481b -- Thu Jan 10 16:30:19 2008 > atime: 0x46a9a7dc -- Fri Jul 27 09:07:56 2007 > mtime: 0x4786481b -- Thu Jan 10 16:30:19 2008 > dtime: 0x0 -- Thu Jan 1 01:00:00 1970 > ctime_nsec: 0x33de5143 -- 870207811 > atime_nsec: 0x0ba52bb0 -- 195374000 > mtime_nsec: 0x33de5143 -- 870207811 > Last Extblk: 0 > Sub Alloc Slot: 4 Sub Alloc Bit: 544 > Tree Depth: 0 Count: 243 Next Free Rec: 1 > ## Offset Clusters Block# > 0 0 1 20289216 > > fsck.ocfs2 gives internal logic failures (or faliures ;) amongst other > things, which sounds pretty bad. Is it? > > [root@jic55124 ~]# fsck.ocfs2 -fn /dev/sdf1 > Checking OCFS2 filesystem in /dev/sdf1: > label: oracle > uuid: e4 18 cb 00 24 2f 4d f2 96 b4 6f 3b 0a e9 b2 e8 > number of blocks: 243930952 > bytes per block: 4096 > number of clusters: 30491369 > bytes per cluster: 32768 > max slots: 24 > > ** Skipping journal replay because -n was given. There may be spurious > errors that journal replay would fix. ** > /dev/sdf1 was run with -f, check forced. > Pass 0a: Checking cluster allocation chains > [GROUP_FREE_BITS] Group descriptor at block 177020928 claims to have 2 > free bits which is more than 0 bits indicated by the bitmap.n > Pass 0b: Checking inode allocation chains > Pass 0c: Checking extent block allocation chains > Pass 1: Checking inodes and blocks. > o2fsck_mark_cluster_allocated: Internal logic faliure !! duplicate > cluster 22151173 > [DIR_ZERO] Inode 149371341 is a zero length directory, clear it? n > [CLUSTER_ALLOC_BIT] Cluster 11553628 is marked in the global cluster > bitmap but it isn't in use. Clear its bit in the bitmap? n > [CLUSTER_ALLOC_BIT] Cluster 16917926 is marked in the global cluster > bitmap but it isn't in use. Clear its bit in the bitmap? n > Pass 2: Checking directory entries. > [DIRENT_INODE_FREE] Directory entry '#74502784' refers to inode number > 74502784 which isn't allocated, clear the entry? n > Pass 3: Checking directory connectivity. > [DIR_NOT_CONNECTED] Directory inode 149371341 isn't connected to the > filesystem. Move it to lost+found? n > Pass 4a: checking for orphaned inodes > ** Skipping orphan dir replay because -n was given. > Pass 4b: Checking inodes link counts. > [INODE_COUNT] Inode 74502784 has a link count of 0 on disk but > directory entry references come to 1. Update the count on disk to man > [INODE_COUNT] Inode 142698567 has a link count of 1 on disk but > directory entry references come to 2. Update the count on disk to mn > pass4: Internal logic faliure fsck's thinks inode 149371307 has a link > count of 1 but on disk it is 0 > [INODE_COUNT] Inode 149371307 has a link count of 1 on disk but > directory entry references come to 2. Update the count on disk to mn > [INODE_NOT_CONNECTED] Inode 149371307 isn't referenced by any > directory entries. Move it to lost+found? n > [INODE_COUNT] Inode 149371341 has a link count of 2 on disk but > directory entry references come to 0. Update the count on disk to mn > All passes succeeded. > > > This has happened before and was "resolved" by shutting down the > cluster and performing a fsck.ocfs2, but that doesn't help us prevent > it in the future, so I would really like to resolve it properly. > > any suggestions as to how I can narrow down the cause of this problem > please? (or how to fix it would be even better! ;-) > > Thanks > > Bob. > > ====================================================> Bob Findlay > The Operations Centre ? Norwich BioScience Institutes > Tel: 01603 450474 (2474 internal) > Fax: 01603 450045 > ====================================================> > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users