Patrick J. LoPresti
2010-Jun-14 02:14 UTC
[Ocfs2-users] Diagnosing some OCFS2 error messages
Hello. I am experimenting with OCFS2 on Suse Linux Enterprise Server 11 Service Pack 1. I am performing various stress tests. My current exercise involves writing to files using a shared-writable mmap() from two nodes. (Each node mmaps and writes to different files; I am not trying to access the same file from multiple nodes.) Both nodes are logging messages like these: [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443 ERROR: CRC32 failed: stored: 2715161149, computed 575704001. Applying ECC. [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457 ERROR: Fixed CRC32 failed: stored: 2715161149, computed 2102707465 [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903 ERROR: Checksum failed for extent block 2321665 [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status = -5 [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status = -5 [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655 ERROR: status = -5 [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status = -5 [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR: status = -5 [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR: status = -5 [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status = -5 [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status = -5 ...although the particular extent block number varies somewhat. In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O error: dp-1:~ # fsck.ocfs2 -y -f /dev/md0 fsck.ocfs2 1.4.3 Checking OCFS2 filesystem in /dev/md0: Label: <NONE> UUID: 29BB12B5AA4C449E9DDE906405F5BDE4 Number of blocks: 3221225472 Block size: 4096 Number of clusters: 12582912 Cluster size: 1048576 Number of slots: 4 /dev/md0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. extent.c: I/O error on channel reading extent block at 2321665 in owner 9704867 for verification pass1: I/O error on channel while iterating over the blocks for inode 9704867 fsck.ocfs2: I/O error on channel while performing pass 1 This looks like a straightforward I/O error, right? The only problem is that there is nothing in any log (dmesg, /var/log/messages, event log on the hardware RAID) to indicate any hardware problem. That is, when fsck.ocfs2 reports this I/O error, no other errors are logged anywhere as far as I can tell. Shouldn't the kernel log a message if a block device gets an I/O error? I am using a pair of hardware RAID chassis accessed via iSCSI, and then using Linux md (RAID-0) to stripe between them. Questions: 1) I would like to confirm this I/O error for myself using dd. How do I map the numbers above ("extent block at 2321665 in owner 9704867") to an actual offset on the block device so I can try to read the blocks by hand? 2) Is there any plausible explanation for these errors other than bad hardware? Thanks! - Pat
Patrick J. LoPresti <lopresti at gmail.com> 2010-06-13 19:14:> Hello. I am experimenting with OCFS2 on Suse Linux Enterprise Server > 11 Service Pack 1. > > I am performing various stress tests. My current exercise involves > writing to files using a shared-writable mmap() from two nodes. (Each > node mmaps and writes to different files; I am not trying to access > the same file from multiple nodes.) > > Both nodes are logging messages like these: > > [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443 ERROR: > CRC32 failed: stored: 2715161149, computed 575704001. Applying ECC. > > [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457 ERROR: > Fixed CRC32 failed: stored: 2715161149, computed 2102707465 > > [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903 > ERROR: Checksum failed for extent block 2321665 > > [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status = -5 > > [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status = -5 > > [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655 > ERROR: status = -5 > > [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status = -5 > > [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR: status = -5 > > [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR: > status = -5 > > [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status = -5 > > [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status = -5 > > > ...although the particular extent block number varies somewhat. > > In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O error: > > dp-1:~ # fsck.ocfs2 -y -f /dev/md0 > > fsck.ocfs2 1.4.3 > > Checking OCFS2 filesystem in /dev/md0: > > Label: <NONE> > > UUID: 29BB12B5AA4C449E9DDE906405F5BDE4 > > Number of blocks: 3221225472 > > Block size: 4096 > > Number of clusters: 12582912 > > Cluster size: 1048576 > > Number of slots: 4 > > > > /dev/md0 was run with -f, check forced. > > Pass 0a: Checking cluster allocation chains > > Pass 0b: Checking inode allocation chains > > Pass 0c: Checking extent block allocation chains > > Pass 1: Checking inodes and blocks. > > extent.c: I/O error on channel reading extent block at 2321665 in > owner 9704867 for verification > > pass1: I/O error on channel while iterating over the blocks for inode 9704867 > > fsck.ocfs2: I/O error on channel while performing pass 1 > > > > This looks like a straightforward I/O error, right? The only problem > is that there is nothing in any log (dmesg, /var/log/messages, event > log on the hardware RAID) to indicate any hardware problem. That is, > when fsck.ocfs2 reports this I/O error, no other errors are logged > anywhere as far as I can tell. Shouldn't the kernel log a message if > a block device gets an I/O error? > > I am using a pair of hardware RAID chassis accessed via iSCSI, and > then using Linux md (RAID-0) to stripe between them. > > Questions: > > 1) I would like to confirm this I/O error for myself using dd. How do > I map the numbers above ("extent block at 2321665 in owner 9704867") > to an actual offset on the block device so I can try to read the > blocks by hand? > > 2) Is there any plausible explanation for these errors other than bad hardware? > > Thanks! > > - PatI don't believe OCFS2 can currently support any logical volume manager other than a simple concatenation (and even then it's with extreme caution). The overhead involved in the lower software layer doing striping needs to somehow be coordinated among all the nodes in the cluster else all fs consistency guarantees provided by the SCSI layer are lost. Brian -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100614/091a1718/attachment.bin
----- bpkroth at gmail.com wrote:> Patrick J. LoPresti <lopresti at gmail.com> 2010-06-13 19:14: > > Hello. I am experimenting with OCFS2 on Suse Linux Enterprise > Server > > 11 Service Pack 1. > > > > I am performing various stress tests. My current exercise involves > > writing to files using a shared-writable mmap() from two nodes. > (Each > > node mmaps and writes to different files; I am not trying to access > > the same file from multiple nodes.) > > > > Both nodes are logging messages like these: > > > > [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443 > ERROR: > > CRC32 failed: stored: 2715161149, computed 575704001. Applying > ECC. > > > > [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457 > ERROR: > > Fixed CRC32 failed: stored: 2715161149, computed 2102707465 > > > > [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903 > > ERROR: Checksum failed for extent block 2321665 > > > > [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: > status = -5 > > > > [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status > = -5 > > > > [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655 > > ERROR: status = -5 > > > > [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: > status = -5 > > > > [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR: > status = -5 > > > > [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 > ERROR: > > status = -5 > > > > [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status > = -5 > > > > [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: > status = -5 > > > > > > ...although the particular extent block number varies somewhat. > > > > In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O > error: > > > > dp-1:~ # fsck.ocfs2 -y -f /dev/md0 > > > > fsck.ocfs2 1.4.3 > > > > Checking OCFS2 filesystem in /dev/md0: > > > > Label: <NONE> > > > > UUID: 29BB12B5AA4C449E9DDE906405F5BDE4 > > > > Number of blocks: 3221225472 > > > > Block size: 4096 > > > > Number of clusters: 12582912 > > > > Cluster size: 1048576 > > > > Number of slots: 4 > > > > > > > > /dev/md0 was run with -f, check forced. > > > > Pass 0a: Checking cluster allocation chains > > > > Pass 0b: Checking inode allocation chains > > > > Pass 0c: Checking extent block allocation chains > > > > Pass 1: Checking inodes and blocks. > > > > extent.c: I/O error on channel reading extent block at 2321665 in > > owner 9704867 for verification > > > > pass1: I/O error on channel while iterating over the blocks for > inode 9704867 > > > > fsck.ocfs2: I/O error on channel while performing pass 1 > > > > > > > > This looks like a straightforward I/O error, right? The only > problem > > is that there is nothing in any log (dmesg, /var/log/messages, > event > > log on the hardware RAID) to indicate any hardware problem. That > is, > > when fsck.ocfs2 reports this I/O error, no other errors are logged > > anywhere as far as I can tell. Shouldn't the kernel log a message > if > > a block device gets an I/O error? > > > > I am using a pair of hardware RAID chassis accessed via iSCSI, and > > then using Linux md (RAID-0) to stripe between them. > > > > Questions: > > > > 1) I would like to confirm this I/O error for myself using dd. How > do > > I map the numbers above ("extent block at 2321665 in owner > 9704867") > > to an actual offset on the block device so I can try to read the > > blocks by hand? > > > > 2) Is there any plausible explanation for these errors other than > bad hardware? > > > > Thanks! > > > > - Pat > > I don't believe OCFS2 can currently support any logical volume > manager > other than a simple concatenation (and even then it's with extreme > caution). The overhead involved in the lower software layer doing > striping needs to somehow be coordinated among all the nodes in the > cluster else all fs consistency guarantees provided by the SCSI layer > are lost.Not true. For OCFS2 to work with a LVM, it needs to be (a) cluster-aware, and, (b) use the same cluster stack as the fs. SLES11 has the Pacemaker (pcmk) cluster stack. Just configure both OCFS2 and cLVM2 to use pcmk. Sunil
----- lopresti at gmail.com wrote:> Hello. I am experimenting with OCFS2 on Suse Linux Enterprise Server > 11 Service Pack 1. > > I am performing various stress tests. My current exercise involves > writing to files using a shared-writable mmap() from two nodes. > (Each > node mmaps and writes to different files; I am not trying to access > the same file from multiple nodes.) > > Both nodes are logging messages like these: > > [94355.116255] (ocfs2_wq,5995,6):ocfs2_block_check_validate:443 > ERROR: > CRC32 failed: stored: 2715161149, computed 575704001. Applying ECC. > > [94355.116344] (ocfs2_wq,5995,6):ocfs2_block_check_validate:457 > ERROR: > Fixed CRC32 failed: stored: 2715161149, computed 2102707465 > > [94355.116348] (ocfs2_wq,5995,6):ocfs2_validate_extent_block:903 > ERROR: Checksum failed for extent block 2321665 > > [94355.116352] (ocfs2_wq,5995,6):__ocfs2_find_path:1861 ERROR: status > = -5 > > [94355.116355] (ocfs2_wq,5995,6):ocfs2_find_leaf:1958 ERROR: status > -5 > > [94355.116358] (ocfs2_wq,5995,6):ocfs2_find_new_last_ext_blk:6655 > ERROR: status = -5 > > [94355.116361] (ocfs2_wq,5995,6):ocfs2_do_truncate:6900 ERROR: status > = -5 > > [94355.116364] (ocfs2_wq,5995,6):ocfs2_commit_truncate:7559 ERROR: > status = -5 > > [94355.116370] (ocfs2_wq,5995,6):ocfs2_truncate_for_delete:597 ERROR: > status = -5 > > [94355.116373] (ocfs2_wq,5995,6):ocfs2_wipe_inode:770 ERROR: status > -5 > > [94355.116376] (ocfs2_wq,5995,6):ocfs2_delete_inode:1062 ERROR: status > = -5 > > > ...although the particular extent block number varies somewhat. > > In addition, when I run "fsck.ocfs2 -y -f /dev/md0", I get an I/O > error: > > dp-1:~ # fsck.ocfs2 -y -f /dev/md0 > > fsck.ocfs2 1.4.3 > > Checking OCFS2 filesystem in /dev/md0: > > Label: <NONE> > > UUID: 29BB12B5AA4C449E9DDE906405F5BDE4 > > Number of blocks: 3221225472 > > Block size: 4096 > > Number of clusters: 12582912 > > Cluster size: 1048576 > > Number of slots: 4 > > > > /dev/md0 was run with -f, check forced. > > Pass 0a: Checking cluster allocation chains > > Pass 0b: Checking inode allocation chains > > Pass 0c: Checking extent block allocation chains > > Pass 1: Checking inodes and blocks. > > extent.c: I/O error on channel reading extent block at 2321665 in > owner 9704867 for verification > > pass1: I/O error on channel while iterating over the blocks for inode > 9704867 > > fsck.ocfs2: I/O error on channel while performing pass 1 > > > > This looks like a straightforward I/O error, right? The only problem > is that there is nothing in any log (dmesg, /var/log/messages, event > log on the hardware RAID) to indicate any hardware problem. That is, > when fsck.ocfs2 reports this I/O error, no other errors are logged > anywhere as far as I can tell. Shouldn't the kernel log a message if > a block device gets an I/O error? > > I am using a pair of hardware RAID chassis accessed via iSCSI, and > then using Linux md (RAID-0) to stripe between them. > > Questions: > > 1) I would like to confirm this I/O error for myself using dd. How > do > I map the numbers above ("extent block at 2321665 in owner 9704867") > to an actual offset on the block device so I can try to read the > blocks by hand? > > 2) Is there any plausible explanation for these errors other than bad > hardware? > > Thanks!Well, it is either bad hardware and/or bad software. We will need additional information. And for that, you will have to disable metaecc. You can re-enable it later. Disable metaecc # tunefs.ocfs2 --fs-features=nometaecc /dev/sdX File a bugzilla and attach the three outputs below. Also, just cut-paste your email. # debugfs.ocfs2 -R "stats" /dev/sdX >/tmp/sb.out # debugfs.ocfs2 -R "stat <9704867>" /dev/sdX >inode.out # dd if=/dev/sdX of=/tmp/ext2321665 bs=4K skip=2321665 count=1 Rerun fsck. # fsck.ocfs2 -fy /dev/sdX See if it runs clean. Do you have a test that reproduces this problem? And can be shared. If so, attach that to the bz to. Definitely log this with Novell. If there is a fix, they'll be the one providing the fix for sles11. And, good if you can file on oss.oracle.com too. Sunil