Thomas Walker
2007-Jul-17 18:07 UTC
large ext3 filesystem consistantly locking itself read-only
We have several large ext3 file system partitions. One of them sets itself to read-only after getting journel problems. I understand that's a good thing, but obviously I need to correct the problem so that it will stop locking itself. Here are some details; OS is Redhat EL4 x86_64 running on a SunFire v40z, kernel is 2.6.9-42.0.2.ELsmp. The disk storage in question is external, via fiber cable. The fiber HBA is a Qlogic ISP2312 connected to a Qlogic San Switch connected to four Apple Xserve Raids. There are 8 individual LUN's coming from the four XRaids, they appear on the host as /dev/sd[cdefghij]. Those LUNs are put into two LVM volume groups and then mounted from logical volumes. The partition in question is 8TB, about 92% full at the moment. One oddity about this partition is it has a subdirectory which contains over 2700 symbolic links to other partitions. Here is the output from /var/adm/messages the last time the file system locked itself; Jul 17 09:01:06 kernel: Info fld=0x0, Current sdd: sense key No Sense Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3): ext3_free_blocks_sb: bit already cleared for block 786856796 Jul 17 09:01:06 kernel: Aborting journal on device dm-3. Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in start_transaction: Readonly filesystem Jul 17 09:01:06 kernel: Aborting journal on device dm-3. Jul 17 09:01:06 kernel: ext3_abort called. Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal Jul 17 09:01:06 kernel: Remounting filesystem read-only Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in start_transaction: Journal has aborted Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3): ext3_free_blocks_sb: bit already cleared for block 786856797 Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3): ext3_free_blocks_sb: bit already cleared for block 786856798 Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3): ext3_free_blocks_sb: bit already cleared for block 786856799 Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3): ext3_free_blocks_sb: bit already cleared for block 786856800 Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in ext3_reserve_inode_write: Journal has aborted Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in ext3_truncate: Journal has aborted Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in ext3_reserve_inode_write: Journal has aborted Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in ext3_orphan_del: Journal has aborted Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in ext3_reserve_inode_write: Journal has aborted Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in ext3_delete_inode: Journal has aborted Jul 17 09:01:07 kernel: __journal_remove_journal_head: freeing b_committed_data If I run fsck it does seem to repair bad blocks and clears inodes but of course for 8TB it takes a long time to run and the corruption only comes back later. I have considered upgrading the kernel, it could be done. I think part of the problem is the large number of symbolic links on that partition but without evidence it will be difficult to get people to change it. I also don't like the first line in the messages about device sdd getting a "No Sense" response to a SCSI sense key request. Any good advice on how to proceed would be appreciated. I have looked at the dumpe2fs and debugfs tools but I don't see how to put them to good use in this case. Thomas Walker
Christian Kujau
2007-Jul-17 23:10 UTC
large ext3 filesystem consistantly locking itself read-only
On Tue, 17 Jul 2007, Thomas Walker wrote:> Jul 17 09:01:06 kernel: Info fld=0x0, Current sdd: sense key No Sense > Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3): ext3_free_blocks_sb: > bit already cleared for block 786856796...the rest of the errors seem to stem from ext3, but what about the "sdd: sense key: .." message above? Are there more device related messages? If so, this could be the cause for ext3 to barf.> If I run fsck it does seem to repair bad blocksas in "bad blocks on disk"? No good then. Try to rule out device errors first (HBA driver, cabling, cooling, etc.) before doing more e2fsck work... -- BOFH excuse #191: Just type 'mv * /dev/null'.