thr3ads.net - Ocfs2 users - [Ocfs2-users] Diagnosing some OCFS2 error messages [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Patrick J. LoPresti

2010-Jun-14 02:14 UTC

[Ocfs2-users] Diagnosing some OCFS2 error messages

Hello.  I am experimenting with OCFS2 on Suse Linux Enterprise Server
11 Service Pack 1.

I am performing various stress tests.  My current exercise involves
writing to files using a shared-writable mmap() from two nodes.  (Each
node mmaps and writes to different files; I am not trying to access
the same file from multiple nodes.)

Both nodes are logging messages like these:

[94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443 ERROR:
CRC32 failed: stored: 2715161149, computed 575704001.  Applying ECC.

[94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457 ERROR:
Fixed CRC32 failed: stored: 2715161149, computed 2102707465

[94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903
ERROR: Checksum failed for extent block 2321665

[94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status = -5

[94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status = -5

[94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655
ERROR: status = -5

[94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status = -5

[94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR: status = -5

[94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR:
status = -5

[94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status = -5

[94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status = -5


...although the particular extent block number varies somewhat.

In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O
error:

dp-1:~ # fsck.ocfs2 -y -f /dev/md0

fsck.ocfs2 1.4.3

Checking OCFS2 filesystem in /dev/md0:

  Label:              <NONE>

  UUID:               29BB12B5AA4C449E9DDE906405F5BDE4

  Number of blocks:   3221225472

  Block size:         4096

  Number of clusters: 12582912

  Cluster size:       1048576

  Number of slots:    4



/dev/md0 was run with -f, check forced.

Pass 0a: Checking cluster allocation chains

Pass 0b: Checking inode allocation chains

Pass 0c: Checking extent block allocation chains

Pass 1: Checking inodes and blocks.

extent.c: I/O error on channel reading extent block at 2321665 in
owner 9704867 for verification

pass1: I/O error on channel while iterating over the blocks for inode 9704867

fsck.ocfs2: I/O error on channel while performing pass 1



This looks like a straightforward I/O error, right?  The only problem
is that there is nothing in any log (dmesg, /var/log/messages, event
log on the hardware RAID) to indicate any hardware problem.  That is,
when fsck.ocfs2 reports this I/O error, no other errors are logged
anywhere as far as I can tell.  Shouldn't the kernel log a message if
a block device gets an I/O error?

I am using a pair of hardware RAID chassis accessed via iSCSI, and
then using Linux md (RAID-0) to stripe between them.

Questions:

1) I would like to confirm this I/O error for myself using dd.  How do
I map the numbers above ("extent block at 2321665 in owner 9704867")
to an actual offset on the block device so I can try to read the
blocks by hand?

2) Is there any plausible explanation for these errors other than bad hardware?

Thanks!

 - Pat

Brian Kroth

2010-Jun-14 14:22 UTC

head link

[Ocfs2-users] Diagnosing some OCFS2 error messages

Patrick J. LoPresti <lopresti at gmail.com> 2010-06-13
19:14:> Hello.  I am experimenting with OCFS2 on Suse Linux Enterprise Server
> 11 Service Pack 1.
> 
> I am performing various stress tests.  My current exercise involves
> writing to files using a shared-writable mmap() from two nodes.  (Each
> node mmaps and writes to different files; I am not trying to access
> the same file from multiple nodes.)
> 
> Both nodes are logging messages like these:
> 
> [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443 ERROR:
> CRC32 failed: stored: 2715161149, computed 575704001.  Applying ECC.
> 
> [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457 ERROR:
> Fixed CRC32 failed: stored: 2715161149, computed 2102707465
> 
> [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903
> ERROR: Checksum failed for extent block 2321665
> 
> [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status = -5
> 
> [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status = -5
> 
> [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655
> ERROR: status = -5
> 
> [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status = -5
> 
> [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR: status =
-5
> 
> [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR:
> status = -5
> 
> [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status = -5
> 
> [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status = -5
> 
> 
> ...although the particular extent block number varies somewhat.
> 
> In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O
error:
>
> dp-1:~ # fsck.ocfs2 -y -f /dev/md0
> 
> fsck.ocfs2 1.4.3
> 
> Checking OCFS2 filesystem in /dev/md0:
> 
>   Label:              <NONE>
> 
>   UUID:               29BB12B5AA4C449E9DDE906405F5BDE4
> 
>   Number of blocks:   3221225472
> 
>   Block size:         4096
> 
>   Number of clusters: 12582912
> 
>   Cluster size:       1048576
> 
>   Number of slots:    4
> 
> 
> 
> /dev/md0 was run with -f, check forced.
> 
> Pass 0a: Checking cluster allocation chains
> 
> Pass 0b: Checking inode allocation chains
> 
> Pass 0c: Checking extent block allocation chains
> 
> Pass 1: Checking inodes and blocks.
> 
> extent.c: I/O error on channel reading extent block at 2321665 in
> owner 9704867 for verification
> 
> pass1: I/O error on channel while iterating over the blocks for inode
9704867
> 
> fsck.ocfs2: I/O error on channel while performing pass 1
> 
> 
> 
> This looks like a straightforward I/O error, right?  The only problem
> is that there is nothing in any log (dmesg, /var/log/messages, event
> log on the hardware RAID) to indicate any hardware problem.  That is,
> when fsck.ocfs2 reports this I/O error, no other errors are logged
> anywhere as far as I can tell.  Shouldn't the kernel log a message if
> a block device gets an I/O error?
> 
> I am using a pair of hardware RAID chassis accessed via iSCSI, and
> then using Linux md (RAID-0) to stripe between them.
> 
> Questions:
> 
> 1) I would like to confirm this I/O error for myself using dd.  How do
> I map the numbers above ("extent block at 2321665 in owner
9704867")
> to an actual offset on the block device so I can try to read the
> blocks by hand?
> 
> 2) Is there any plausible explanation for these errors other than bad
hardware?
> 
> Thanks!
> 
>  - Pat
I don't believe OCFS2 can currently support any logical volume manager
other than a simple concatenation (and even then it's with extreme
caution).  The overhead involved in the lower software layer doing
striping needs to somehow be coordinated among all the nodes in the
cluster else all fs consistency guarantees provided by the SCSI layer
are lost.

Brian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100614/091a1718/attachment.bin

Sunil Mushran

2010-Jun-14 16:27 UTC

head link

[Ocfs2-users] Diagnosing some OCFS2 error messages

----- bpkroth at gmail.com wrote:
> Patrick J. LoPresti <lopresti at gmail.com> 2010-06-13 19:14:
> > Hello.  I am experimenting with OCFS2 on Suse Linux Enterprise
> Server
> > 11 Service Pack 1.
> > 
> > I am performing various stress tests.  My current exercise involves
> > writing to files using a shared-writable mmap() from two nodes. 
> (Each
> > node mmaps and writes to different files; I am not trying to access
> > the same file from multiple nodes.)
> > 
> > Both nodes are logging messages like these:
> > 
> > [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443
> ERROR:
> > CRC32 failed: stored: 2715161149, computed 575704001.  Applying
> ECC.
> > 
> > [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457
> ERROR:
> > Fixed CRC32 failed: stored: 2715161149, computed 2102707465
> > 
> > [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903
> > ERROR: Checksum failed for extent block 2321665
> > 
> > [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR:
> status = -5
> > 
> > [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status
> = -5
> > 
> > [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655
> > ERROR: status = -5
> > 
> > [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR:
> status = -5
> > 
> > [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR:
> status = -5
> > 
> > [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597
> ERROR:
> > status = -5
> > 
> > [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status
> = -5
> > 
> > [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR:
> status = -5
> > 
> > 
> > ...although the particular extent block number varies somewhat.
> > 
> > In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get
an I/O
> error:
> >
> > dp-1:~ # fsck.ocfs2 -y -f /dev/md0
> > 
> > fsck.ocfs2 1.4.3
> > 
> > Checking OCFS2 filesystem in /dev/md0:
> > 
> >   Label:              <NONE>
> > 
> >   UUID:               29BB12B5AA4C449E9DDE906405F5BDE4
> > 
> >   Number of blocks:   3221225472
> > 
> >   Block size:         4096
> > 
> >   Number of clusters: 12582912
> > 
> >   Cluster size:       1048576
> > 
> >   Number of slots:    4
> > 
> > 
> > 
> > /dev/md0 was run with -f, check forced.
> > 
> > Pass 0a: Checking cluster allocation chains
> > 
> > Pass 0b: Checking inode allocation chains
> > 
> > Pass 0c: Checking extent block allocation chains
> > 
> > Pass 1: Checking inodes and blocks.
> > 
> > extent.c: I/O error on channel reading extent block at 2321665 in
> > owner 9704867 for verification
> > 
> > pass1: I/O error on channel while iterating over the blocks for
> inode 9704867
> > 
> > fsck.ocfs2: I/O error on channel while performing pass 1
> > 
> > 
> > 
> > This looks like a straightforward I/O error, right?  The only
> problem
> > is that there is nothing in any log (dmesg, /var/log/messages,
> event
> > log on the hardware RAID) to indicate any hardware problem.  That
> is,
> > when fsck.ocfs2 reports this I/O error, no other errors are logged
> > anywhere as far as I can tell.  Shouldn't the kernel log a message
> if
> > a block device gets an I/O error?
> > 
> > I am using a pair of hardware RAID chassis accessed via iSCSI, and
> > then using Linux md (RAID-0) to stripe between them.
> > 
> > Questions:
> > 
> > 1) I would like to confirm this I/O error for myself using dd.  How
> do
> > I map the numbers above ("extent block at 2321665 in owner
> 9704867")
> > to an actual offset on the block device so I can try to read the
> > blocks by hand?
> > 
> > 2) Is there any plausible explanation for these errors other than
> bad hardware?
> > 
> > Thanks!
> > 
> >  - Pat
> 
> I don't believe OCFS2 can currently support any logical volume
> manager
> other than a simple concatenation (and even then it's with extreme
> caution).  The overhead involved in the lower software layer doing
> striping needs to somehow be coordinated among all the nodes in the
> cluster else all fs consistency guarantees provided by the SCSI layer
> are lost.
Not true. For OCFS2 to work with a LVM, it needs to be (a) cluster-aware,
and, (b) use the same cluster stack as the fs. 

SLES11 has the Pacemaker (pcmk) cluster stack. Just configure both
OCFS2 and cLVM2 to use pcmk.

Sunil

Sunil Mushran

2010-Jun-14 16:44 UTC

head link

[Ocfs2-users] Diagnosing some OCFS2 error messages

----- lopresti at gmail.com wrote:
> Hello.  I am experimenting with OCFS2 on Suse Linux Enterprise Server
> 11 Service Pack 1.
> 
> I am performing various stress tests.  My current exercise involves
> writing to files using a shared-writable mmap() from two nodes. 
> (Each
> node mmaps and writes to different files; I am not trying to access
> the same file from multiple nodes.)
> 
> Both nodes are logging messages like these:
> 
> [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443
> ERROR:
> CRC32 failed: stored: 2715161149, computed 575704001.  Applying ECC.
> 
> [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457
> ERROR:
> Fixed CRC32 failed: stored: 2715161149, computed 2102707465
> 
> [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903
> ERROR: Checksum failed for extent block 2321665
> 
> [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status
> = -5
> 
> [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status > -5
> 
> [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655
> ERROR: status = -5
> 
> [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status
> = -5
> 
> [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR:
> status = -5
> 
> [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR:
> status = -5
> 
> [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status > -5
> 
> [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status
> = -5
> 
> 
> ...although the particular extent block number varies somewhat.
> 
> In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O
> error:
> 
> dp-1:~ # fsck.ocfs2 -y -f /dev/md0
> 
> fsck.ocfs2 1.4.3
> 
> Checking OCFS2 filesystem in /dev/md0:
> 
>   Label:              <NONE>
> 
>   UUID:               29BB12B5AA4C449E9DDE906405F5BDE4
> 
>   Number of blocks:   3221225472
> 
>   Block size:         4096
> 
>   Number of clusters: 12582912
> 
>   Cluster size:       1048576
> 
>   Number of slots:    4
> 
> 
> 
> /dev/md0 was run with -f, check forced.
> 
> Pass 0a: Checking cluster allocation chains
> 
> Pass 0b: Checking inode allocation chains
> 
> Pass 0c: Checking extent block allocation chains
> 
> Pass 1: Checking inodes and blocks.
> 
> extent.c: I/O error on channel reading extent block at 2321665 in
> owner 9704867 for verification
> 
> pass1: I/O error on channel while iterating over the blocks for inode
> 9704867
> 
> fsck.ocfs2: I/O error on channel while performing pass 1
> 
> 
> 
> This looks like a straightforward I/O error, right?  The only problem
> is that there is nothing in any log (dmesg, /var/log/messages, event
> log on the hardware RAID) to indicate any hardware problem.  That is,
> when fsck.ocfs2 reports this I/O error, no other errors are logged
> anywhere as far as I can tell.  Shouldn't the kernel log a message if
> a block device gets an I/O error?
> 
> I am using a pair of hardware RAID chassis accessed via iSCSI, and
> then using Linux md (RAID-0) to stripe between them.
> 
> Questions:
> 
> 1) I would like to confirm this I/O error for myself using dd.  How
> do
> I map the numbers above ("extent block at 2321665 in owner
9704867")
> to an actual offset on the block device so I can try to read the
> blocks by hand?
> 
> 2) Is there any plausible explanation for these errors other than bad
> hardware?
> 
> Thanks!
Well, it is either bad hardware and/or bad software. We will need additional
information. And for that, you will have to disable metaecc. You can
re-enable it later.

Disable metaecc
# tunefs.ocfs2 --fs-features=nometaecc /dev/sdX

File a bugzilla and attach the three outputs below. Also, just cut-paste
your email.
# debugfs.ocfs2 -R "stats" /dev/sdX  >/tmp/sb.out
# debugfs.ocfs2 -R "stat <9704867>" /dev/sdX >inode.out
# dd if=/dev/sdX of=/tmp/ext2321665 bs=4K skip=2321665 count=1

Rerun fsck.
# fsck.ocfs2 -fy /dev/sdX
See if it runs clean.

Do you have a test that reproduces this problem? And can be shared.
If so, attach that to the bz to.

Definitely log this with Novell. If there is a fix, they'll be the one
providing the fix for sles11. And, good if you can file on oss.oracle.com
too.

Sunil

Maybe Matching Threads

Search for more apparently analagous threads

Ocfs2 users - Jun 2010 - Diagnosing some OCFS2 error messages

[Ocfs2-users] Diagnosing some OCFS2 error messages

[Ocfs2-users] Diagnosing some OCFS2 error messages

[Ocfs2-users] Diagnosing some OCFS2 error messages

[Ocfs2-users] Diagnosing some OCFS2 error messages

Maybe Matching Threads