Neil Brown
2002-May-21 06:30 UTC
Bad directories appearing in ext3 after upgrade 2.4.16 -> 2.4.18+cvs
Hi, I recently upgraded one of my fileservers from 2.4.16 to 2.4.18 plus the ext3-cvs.patch that Andrew Morton pointed me to for addressing and assertion failure. Since then I have been getting lots of errors like: May 21 14:07:03 glass kernel: EXT3-fs error (device md(9,0)): ext3_add_entry: bad entry in directory #2945366: rec_len %% 4 != 0 - offset=0, inode=1886221359, rec_len=24927, name_len=109 May 21 14:07:23 glass kernel: EXT3-fs error (device md(9,0)): ext3_readdir: bad entry in directory #2945366: rec_len %% 4 != 0 - offset=0, inode=1886221359, rec_len=24927, name_len=109 May 21 14:07:23 glass kernel: EXT3-fs warning (device md(9,0)): empty_dir: bad directory (dir #2945366) - no `.' or `..' If I hunt down the directory (find . -inum ...) and do an "ls -la", it appears empty and reponds reasonably well to "rmdir". So far the directories have mostly been browers caches, so no real data has been lost (I think), but it is worrisome. I will probably revert to 2.4.16 plus the relevant bits of the CVS patch hand-applied. But I wonder if anyone else has this or has any idea what might be happening? The directories haven't been moved from buffercache to pagecache between 2.4.16 and 2.4.18 or anything like that have they? Possibly related... My ext3 filesystem is on a raid5 array, with the journal on a separate raid1 array. (data=journal mode). I get quite a few messages in the logs which say: May 21 14:20:06 glass kernel: raid5: multiple 1 requests for sector 7540536 For a variety of sector numbers. This means that raid5 has received two separate write requests, with two separate buffer heads, for the same sector. This seems like a filesystem error to me. raid5 tries to apply them in the same order that they were received, but I don't feel confident that means that the *right* thing is happening. These were happening with 2.4.16, but the incidence seems to have increased with 2.4.18 (though that isn't a very strong observation). NeilBrown
Neil Brown
2002-May-21 07:13 UTC
Re: Bad directories appearing in ext3 after upgrade 2.4.16 -> 2.4.18+cvs
On Tuesday May 21, neilb@cse.unsw.edu.au wrote:> > > Hi, > I recently upgraded one of my fileservers from 2.4.16 to 2.4.18 plus > the ext3-cvs.patch that Andrew Morton pointed me to for addressing > and assertion failure.I just double checked, and the old kernel wasn't exactly 2.4.16... It was 2.4.17-pre2 plus Andrew Morton 0.9.16-2417p2 patch for ext3. NeilBrown
Stephen C. Tweedie
2002-May-21 09:48 UTC
Re: Bad directories appearing in ext3 after upgrade 2.4.16 -> 2.4.18+cvs
Hi, On Tue, May 21, 2002 at 04:30:49PM +1000, Neil Brown wrote:> The directories haven't been moved from buffercache to > pagecache between 2.4.16 and 2.4.18 or anything like that have they?No, no big changes.> My ext3 filesystem is on a raid5 array, with the journal on > a separate raid1 array. (data=journal mode).Not a configuration I've tested, but I can set up a box here with that config and see if I can reproduce any problems.> I get quite a few messages in the logs which say: > > May 21 14:20:06 glass kernel: raid5: multiple 1 requests for sector 7540536Can you go into debugfs and find out what inode that sector belongs to? You'll need to convert it into a block number first, then use "icheck".> For a variety of sector numbers.> This means that raid5 has received two separate write requests, with > two separate buffer heads, for the same sector. This seems like a > filesystem error to me.Probably, yes. Is this a SMP box? I recently fixed a long-standing bug which could cause that on SMP; it would be worth seeing if the ext3-0.9.18 I posted fixes the problems. Are you running dump(8)? If so, it could actually be a read request from dump colliding with a write request from the fs; add_write_bh() doesn't seem to distinguish between reads and writes when warning about IO collisions. Otherwise, I'd really like to track this down by turning the raid warning into a BUG() or BH_ASSERT(), and enable buffer tracing. (In fact I can probably do that non-destructively, by calling the buffer trace printing manually from add_stripe_bh().) Cheers, Stephen
Andreas Dilger
2002-May-21 17:40 UTC
Re: Bad directories appearing in ext3 after upgrade 2.4.16 -> 2.4.18+cvs
On May 21, 2002 16:30 +1000, Neil Brown wrote:> I recently upgraded one of my fileservers from 2.4.16 to 2.4.18 plus > the ext3-cvs.patch that Andrew Morton pointed me to for addressing > and assertion failure. > > Since then I have been getting lots of errors like: > > May 21 14:07:03 glass kernel: EXT3-fs error (device md(9,0)): ext3_add_entry: bad entry in directory #2945366: rec_len %% 4 != 0 - offset=0, inode=1886221359, rec_len=24927, name_len=109 > May 21 14:07:23 glass kernel: EXT3-fs error (device md(9,0)): ext3_readdir: bad entry in directory #2945366: rec_len %% 4 != 0 - offset=0, inode=1886221359, rec_len=24927, name_len=109 > May 21 14:07:23 glass kernel: EXT3-fs warning (device md(9,0)): empty_dir: bad directory (dir #2945366) - no `.' or `..'> But I wonder if anyone else has this or has any idea what it might be?Well, there were several recent reports of ext3 problems, and two reports had the same message. Two of them definitely were also using RAID (RAID 1 and RAID 5). See thread starting at: http://marc.theaimsgroup.com/?l=linux-kernel&m=102070252100762&w=4 and also an older (but still 2.4.18) thread at: http://marc.theaimsgroup.com/?l=ext3-users&m=101041709332435&w=4 No mention of RAID in the second thread. I can't find the other messages to which I refer in the first thread about "several other reports of corruption with ext3 on MD RAID". Neil, you posted in another message the text from the corrupt directory blocks, which matches with the second thread where the bad data looked like filenames (the first thread's directory data doesn't decode to anything useful in ASCII). Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/
Andreas Dilger
2002-May-21 17:49 UTC
Re: Bad directories appearing in ext3 after upgrade 2.4.16 -> 2.4.18+cvs
On May 21, 2002 16:30 +1000, Neil Brown wrote:> My ext3 filesystem is on a raid5 array, with the journal on > a separate raid1 array. (data=journal mode).Just to follow up on my previous email, the "other ext3/RAID problems" are probably from the thread below (start, and ominous end): http://marc.theaimsgroup.com/?l=ext3-users&m=101489319820233&w=4 http://marc.theaimsgroup.com/?l=ext3-users&m=102011493526790&w=4 In his case, running e2fsck on the MD RAID filesystem actually causes corruption that wasn't there before (see last message)... Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/