thr3ads.net - freebsd stable - 7.2 filesystem corruption [May 2010]

If this information is useful, please help other people find it:
Share via:

Charles Sprickman

2010-May-21 08:04 UTC

7.2 filesystem corruption

Hello all,

Not sure where to go with this post, I've tried -fs and -scsi previously
in trying to track down some panics in the softdep stuff. Perhaps the
more general audience here can shove me in the right direction.

I have a box (Dell PE 2970) running FreeBSD 7.2/amd-64. 6 GB of ECC RAM,
and a Dell-branded LSI RAID controller (mpt driver). It's a mail server
with the active mail server running in a jail and a test version of same
running in another jail (qmail/vpopmail/courier on old,
postfix/pfadmin/dovecot on new).

It passed a few weeks of heavy stress testing where I was putting much
more load on it using an imap/pop/smtp test suite before going into
production with only one panic (which happened during a fairly intense
mstone run) - I figured I was somewhat on the bleeding edge with 7.x
64-bit at that time, so I was not overly concerned since I've run into
softdep panics before. Since then however, there have been a few panics
in "ufsdirhash_lookup".

When this happens, the box reboots, does a background fsck and does not
complain about anything. I decided background fsck was probably not a
good idea, so I disabled it and manually fsck'd on all subsequent panics.
The pattern is similar to this example:

** /dev/mfid0s1g
** Last Mounted on /spool
** Phase 1 - Check Blocks and Sizes
UNKNOWN FILE TYPE I=147718184
UNEXPECTED SOFT UPDATE INCONSISTENCY
CLEAR? yes

PARTIALLY ALLOCATED INODE I=147718185
UNEXPECTED SOFT UPDATE INCONSISTENCY

And in phase 2, lots of this:

UNALLOCATED I=152688468 OWNER=root MODE=0 SIZE=0 MTIME=Dec 31 19:00 1969
NAME=/jails/mailbak.blah.net/home/vpopmail/domains/blah.net/A/spec/Maildir/new/1233549930.73014.blah.bway.net

UNEXPECTED SOFT UPDATE INCONSISTENCY
REMOVE? yes

And in Phase 4, lots of this:

** Phase 4 - Check Reference Counts
UNREF FILE I=147623979 OWNER=88 MODE=100600
SIZE=0 MTIME=Feb 7 00:19 2010
CLEAR? yes

In the manual runs, I tend to run through about 3 or 4 times, since even
though the filesystem gets marked "clean", another run finds more
errors.
Once I get two clean runs in a row, I let the box boot.

Regardless of how "clean" the fs is, I have consistently seen messages
like this in my serial console log:

g_vfs_done():mfid0s1g[READ(offset=2456998070156636160, length=16384)]error
= 5
g_vfs_done():mfid0s1g[READ(offset=2456998070156636160, length=16384)]error
= 5

On the last run, I also turned off soft updates for good measure.

Now I occasinally get these errors:

g_vfs_done():mfid0s1g[READ(offset=5335388948596480000, length=16384)]error
= 5
bad block 838May 18 00:29:14 8bigmail kernel: 3pid 24481 (rm), 0uid 0
inumber 1571657736 on /spoo6l: bad block
76548920427, ino 151657736

In addition, there are some files that now have bizarre flags set, such as
"schg", "sappnd", "opaque", etc. Some can be
changed, others give a "bad
file descriptor" error.

I fear the fs is getting more scrambled.

I started to think that I'm probably dealing with two things - some bug in
64-bit UFS2, plus a perpetually dirty filesystem that causes the box to
panic, which causes more corruption, and so on.

I do have the option of trying to schedule a huge maintenance window and
dumping the fs, newfs'ing it, and then restoring it, but it's a tough
sell
and for various reasons I can't put a ton of time into this (anyone that
knows me, hit me up offlist for a fun story). I'm also quite concerned
that fsck is finding and fixing things, but the fs is still obviously not
quite "right". In short, how can I ensure this won't happen a
week after
a dump/restore?

So that's the story, here's my questions:

-Is there any interest in tracking down what the nature of the initial
panic/corruption is? I know I'm a release behind, but digging through the
PR database, nothing stuck out as far as softdep, mpt, or dirhash bugs
that looked similar to what I'm seeing that got fixed in 7.3.

-Where is the most likely place to look for a problem here? The mpt
driver? The megacli utility and the bios utility both claim the array is
in great shape. The only fs that ever shows the errors with
"g_vfs_done"
and the nonsensical offsets is the partition where the jails reside. Or
is it ufsdirhash thing? I saw some interesting bug reports, but nothing
that quite matched. UFS2/SU itself?

-If I do dump/restore (or pull from backups), should I stick to 7.2 or go
to 7.3 while I'm working on the box? Or gamble on 8.0 (where I've oddly
enough seen much fewer odd thigns of late)?

For reference, here's a few other queries regarding this issue:

http://marc.info/?l=freebsd-stable&m=125901173424554&w=2
http://old.nabble.com/7.2-p4:-panic:-ufsdirhash_lookup:-bad-offset-in-hash-array-td27715632.html

I still have some core dumps sitting here as well.

Any input would be appreciated - I do have more info available, but this
message is already about twice as long as I'd like it to be. Hit me up
with any questions.

Thanks,

Charles

Bob Bishop

2010-May-21 08:14 UTC

head link

7.2 filesystem corruption

Hi,

On 21 May 2010, at 09:04, Charles Sprickman wrote:
> Hello all,
> 
> [...]I have a box (Dell PE 2970) running FreeBSD 7.2/amd-64.  6 GB of ECC
RAM, and a Dell-branded LSI RAID controller (mpt driver). [tale of woe elided]
For any case of spooky behaviour involving SCSI, make completely sure that the
SCSI cabling is above suspicion. If it isn't, your sanity will be the first
casualty.

--
Bob Bishop
rb@gid.co.uk

freebsd stable - May 2010 - 7.2 filesystem corruption

7.2 filesystem corruption

7.2 filesystem corruption