thr3ads.net - Ocfs2 users - [Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Jari Takkala

2009-Mar-17 11:37 UTC

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

Hello,

We recently ran into an issue with another one of our OCFS2 clusters where OCFS2
detected on-disk corruption. The filesystem in question has a capacity of 641GB,
and I had attempted to remove about 500GB of files. The number of files that I
was removing would have been about 20,000.

I started the 'rm -rf' on host2. Shortly after that the filesystem was
automatically mounted read-only and the following errors logged:

Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_commit_truncate:6490 ERROR: status
= -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_delete_inode:974 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):__ocfs2_flush_truncate_log:5111 ERROR:
status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_free_clusters:1842 ERROR: status =
-5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_free_suballoc_bits:1755 ERROR:
status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_replay_truncate_records:5039 ERROR:
status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_truncate_for_delete:562 ERROR:
status = -5
Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_wipe_inode:733 ERROR: status = -5
Mar 15 14:31:34 host2 kernel: File system is now read-only due to the potential
of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
Mar 15 14:31:34 host2 kernel: OCFS2: ERROR (device xvdb1):
ocfs2_check_group_descriptor: Group descriptor # 12257280 has bit count 32256
but claims that 32639 are free
Mar 15 14:31:36 host2 kernel: (1796,1):__ocfs2_flush_truncate_log:5111 ERROR:
status = -5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_free_clusters:1842 ERROR: status =
-5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_free_suballoc_bits:1755 ERROR:
status = -5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_replay_truncate_records:5039 ERROR:
status = -5
Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_truncate_log_worker:5150 ERROR:
status = -5
Mar 15 14:31:36 host2 kernel: OCFS2: ERROR (device xvdb1):
ocfs2_check_group_descriptor: Group descriptor # 12257280 has bit count 32256
but claims that 32639 are free

I unmounted the filesystem from all three nodes and ran 'ocfs2.fsck -f'
on it. I had to run ocfs2.fsck twice before it reported the clean. A snippet of
the fsck output:

[GROUP_FREE_BITS] Group descriptor at block 12257280 claims to have 32639 free
bits which is more than 32238 bits indicated by the bitmap. Drop its free bit
count down to the total? <y> y
[CHAIN_BITS] Chain 137 in allocator inode 11 has 249271 bits marked free out of
677376 total bits but the block groups in the chain have 248870 free out of
677376 total.  Fix this by updating the chain record? <y> y
[CHAIN_GROUP_BITS] Allocator inode 11 has 109425928 bits marked used out of
167772803 total bits but the chains have 109426329 used out of 167772803 total. 
Fix this by updating the inode counts <y> y
[CLUSTER_ALLOC_BIT] Cluster 12268148 is marked in the global cluster bitmap but
it isn't in use.  Clear its bit in the bitmap? <y> y
[INODE_ORPHANED] Inode 15975682 was found in the orphan directory. Delete its
contents and unlink it? <y> y

*See the attachment for all errors logged and some more output from ocfs2.fsck.

Following the fsck, I remounted the filesystem on all nodes and was able to
delete the remainder of the files. I ran a quick test using our application on
one of the remaining files and it appeared to be intact, with no data
corruption.

During the maintenance our application was running, and it may have been writing
some files to disk, however the amount would have been very small (I see a 493K
file created at 14:34) compared to our busy times. Our monitoring graphs show a
much lighter load than during normal operations, so the OCFS2 filesystem was not
under any unusual load apart from the 'rm -rf'.

We are running RHEL 5.2 as Xen guests on all nodes in the cluster, with kernel
2.6.18-92.el5xen and ocfs2-2.6.18-92.el5xen-1.4.1-1.el5 installed.

I did a quick search on the mailing list for some of the errors we encountered,
but couldn't find any results that seemed to document a similar issue.

I have a snapshot of the LUN with the original data, and I can run some tests on
it if necessary and try to reproduce the problem. Note that we had another OCFS2
issue recently
(http://oss.oracle.com/pipermail/ocfs2-users/2009-February/003369.html), and
we're still investigating that. However, that problem was on a database
cluster, which is on a different storage array, and it does not seem to be the
same problem.

Has anyone seen this issue before, or does anyone have any advice on how we can
troubleshoot it?

Regards,

Jari
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ocfs2-group-descriptor.txt
Url:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090317/be7354f5/attachment.txt

Joel Becker

2009-Mar-17 17:44 UTC

head link

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

On Tue, Mar 17, 2009 at 11:37:29AM +0000, Jari Takkala
wrote:> We recently ran into an issue with another one of our OCFS2 clusters where
OCFS2 detected on-disk corruption. The filesystem in question has a capacity of
641GB, and I had attempted to remove about 500GB of files. The number of files
that I was removing would have been about 20,000.
	First and foremost, can you file a bugzilla bug?  This is great
detail, and it should be captured there.  More comments below.
> I started the 'rm -rf' on host2. Shortly after that the filesystem
was automatically mounted read-only and the following errors logged:
> 
> Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_commit_truncate:6490 ERROR:
status = -5
> Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_delete_inode:974 ERROR: status
= -5
> Mar 15 14:31:34 host2 kernel: (2008,1):__ocfs2_flush_truncate_log:5111
ERROR: status = -5
> Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_free_clusters:1842 ERROR:
status = -5
> Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_free_suballoc_bits:1755 ERROR:
status = -5
> Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_replay_truncate_records:5039
ERROR: status = -5
> Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_truncate_for_delete:562 ERROR:
status = -5
> Mar 15 14:31:34 host2 kernel: (2008,1):ocfs2_wipe_inode:733 ERROR: status =
-5
> Mar 15 14:31:34 host2 kernel: File system is now read-only due to the
potential of on-disk corruption. Please run fsck.ocfs2 once the file system is
unmounted.
> Mar 15 14:31:34 host2 kernel: OCFS2: ERROR (device xvdb1):
ocfs2_check_group_descriptor: Group descriptor # 12257280 has bit count 32256
but claims that 32639 are free
> Mar 15 14:31:36 host2 kernel: (1796,1):__ocfs2_flush_truncate_log:5111
ERROR: status = -5
> Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_free_clusters:1842 ERROR:
status = -5
> Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_free_suballoc_bits:1755 ERROR:
status = -5
> Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_replay_truncate_records:5039
ERROR: status = -5
> Mar 15 14:31:36 host2 kernel: (1796,1):ocfs2_truncate_log_worker:5150
ERROR: status = -5
> Mar 15 14:31:36 host2 kernel: OCFS2: ERROR (device xvdb1):
ocfs2_check_group_descriptor: Group descriptor # 12257280 has bit count 32256
but claims that 32639 are free
	All your errors are -5, or EIO.  They appear to all be coming
from the group descriptor error, but your log is very weird - it's
almost in the reverse order the functions are called.
> I unmounted the filesystem from all three nodes and ran 'ocfs2.fsck
-f' on it. I had to run ocfs2.fsck twice before it reported the clean. A
snippet of the fsck output:
> 
> [GROUP_FREE_BITS] Group descriptor at block 12257280 claims to have 32639
free bits which is more than 32238 bits indicated by the bitmap. Drop its free
bit count down to the total? <y> y
> [CHAIN_BITS] Chain 137 in allocator inode 11 has 249271 bits marked free
out of 677376 total bits but the block groups in the chain have 248870 free out
of 677376 total.  Fix this by updating the chain record? <y> y
> [CHAIN_GROUP_BITS] Allocator inode 11 has 109425928 bits marked used out of
167772803 total bits but the chains have 109426329 used out of 167772803 total. 
Fix this by updating the inode counts <y> y
	These errors are self-consistent.  That is, the higher levels of
the chain agree with the lower levels.  Of course, they all agree on the
bit count at the lowest level that is wrong.  How it came to be wrong is
the $64k question.
	Can you attach the message logs from all nodes to the bugzilla
bug?  Maybe one of the other nodes did something.
> I have a snapshot of the LUN with the original data, and I can run some
tests on it if necessary and try to reproduce the problem. Note that we had
another OCFS2 issue recently
(http://oss.oracle.com/pipermail/ocfs2-users/2009-February/003369.html), and
we're still investigating that. However, that problem was on a database
cluster, which is on a different storage array, and it does not seem to be the
same problem.
	I'm guessing this was a global bitmap cluster group based on the
function call chain, but I'd like to verify.  Is it possible to get an
o2image of the volume for us to look at?  o2image should create an image
without data so that its safe to send to us.

Joel

-- 

"Reality is merely an illusion, albeit a very persistent one."
        - Albert Einstien

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

Jari Takkala

2009-Mar-18 14:05 UTC

head link

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

Hi Joel,

Thanks for your response. My comments are inline below.

----- "Joel Becker" <Joel.Becker at oracle.com> wrote:
> 	First and foremost, can you file a bugzilla bug?  This is great
> detail, and it should be captured there.  More comments below.
Done, bug 1090 opened, http://oss.oracle.com/bugzilla/show_bug.cgi?id=1090.
> 	All your errors are -5, or EIO.  They appear to all be coming
> from the group descriptor error, but your log is very weird - it's
> almost in the reverse order the functions are called.
> 	These errors are self-consistent.  That is, the higher levels of
> the chain agree with the lower levels.  Of course, they all agree on
> the bit count at the lowest level that is wrong.  How it came to be wrong
> is the $64k question.
> 	Can you attach the message logs from all nodes to the bugzilla
> bug?  Maybe one of the other nodes did something.
The logs I attached are from /var/log/messages, the order is the same in the
dmesg buffer. I've attached the same logs to the bugzilla bug. Unfortunately
there's nothing more that was logged during that time period then what
I've already posted. The only thing I did not save were all of the hundreds
of lines of output from the two fsck's.
> 	I'm guessing this was a global bitmap cluster group based on the
> function call chain, but I'd like to verify.  Is it possible to get
> an o2image of the volume for us to look at?  o2image should create an
> image without data so that its safe to send to us.
I've run o2image against the snapshot. I can email that directly to you, or
if there is a private FTP server you want me to upload it to please let me know.
It's 5.2MB compressed, 4.2GB uncompressed. Even though it's metadata, I
don't think I'll be able to attach it to the bug report for security
reasons.

I did a quick 'fsck.ocfs2 -f' on the snapshot of the volume and it
reports the group descriptor mismatch problem. I aborted the fsck and didn't
make any changes. This snapshot was taken with the filesystem offline on all
systems. Following the snapshot I brought the filesystem back online, started
our application, and then began the 'rm -rf'.

I can do some more tests on the snapshot if you necessary. At this time the only
modification I've made is to relabel the filesystem so that it does not
clash with the the actual volume.

Thanks!

Jari

Jari Takkala

2009-Mar-20 13:25 UTC

head link

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

Hi Joel,

I emailed you and Srinivas the image file a few days ago. Can you confirm that
you received the attachment?

Jari

----- "Joel Becker" <Joel.Becker at oracle.com> wrote:
> > I've run o2image against the snapshot. I can email that directly
to
> > you, or if there is a private FTP server you want me to upload it to
> > please let me know. It's 5.2MB compressed, 4.2GB uncompressed.
Even
> > though it's metadata, I don't think I'll be able to attach
it to the
> > bug report for security reasons.
> 
> 	Email it to me and copy srinivas.eeda at oracle.com.  Don't copy
> ocfs2-users at oss.oracle.com :-)
>

Ocfs2 users - Mar 2009 - OCFS2 Error: Group Descriptor Mismatch

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch

[Ocfs2-users] OCFS2 Error: Group Descriptor Mismatch