thr3ads.net - freebsd stable - g_vfs_done() failures on 6.2-RC1 [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Jan Mikkelsen

2006-Dec-12 22:28 UTC

g_vfs_done() failures on 6.2-RC1

Hi,

I have seen g_vfs_done() failures with absurd offsets in the face of heavy 
I/O.  Recovery doesn't seem to happen, leading to the need to reboot the 
system.  The problem seems to occur without any underlying disk device 
failure.

An example from yesterday:

This message repeats on the order of 10s of thousands of times, with no 
earlier message:

g_vfs_done():da1s1d[READ(offset=5036583429229836288, length=16384)]error = 5

Bsdlabel correctly reports that /dev/da1s1d has 1748318312 512-byte blocks, 
so the offset is clearly wrong.  The filesystem was using softupdates and 
was doing a few "rm -rf"s on two cvs repositories.  After this error
the rm
didn't die from SIGTERM or SIGKILL.  (Unfortunately, I didn't check
wchan
for the rm process.  Sorry.)

The shutdown took three hours.  I didn't have console access, so I don't
know the console messages at the time.  The machine did respond to pings 
during at least the first hour.  After it came back up, the filesystems were 
all reported as clean.  Attempting to finish off the "rm" produced
this
result:

bad block 8819084429375818952, ino 92865791
pid 49 (softdepflush), uid 0 inumber 92865791 on /work: bad block
bad block -8123569960048088809, ino 92865791
pid 49 (softdepflush), uid 0 inumber 92865791 on /work: bad block
handle_workitem_freeblocks: block count
g_vfs_done():da1s1d[READ(offset=1154660658434844672, length=16384)]error = 5
bad block -9114721846648257515, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=8698001308483434496, length=16384)]error = 5
bad block -8102232258315484873, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=4586979512427630592, length=16384)]error = 5
bad block -3438510379221006390, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=196654394503331840, length=16384)]error = 5
g_vfs_done():da1s1d[READ(offset=26142581273591808, length=16384)]error = 5
bad block 504981533259792482, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
bad block 1538054898336656903, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=249387551018614784, length=16384)]error = 5
bad block 18582847101533720, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=259247319150690304, length=16384)]error = 5
bad block -3429473246997783577, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
bad block -3335830404336954747, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
bad block -1007814018434232494, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
handle_workitem_freeblocks: block count

A reboot to single user mode and an fsck cleaned things up.

In this case it is a machine running 6.2-RC1/amd64 with patches on a 
SuperMicro motherboard, 2 x Xeon 5140 CPUs, 4GB ECC memory and an Areca SATA 
raid controller.  The raid array is RAID-6, with write-through controller 
cache and drive write cache disabled.  The controller reported no I/O 
errors, no volumes are degraded.  I have also seen very similar problems on 
a dual-Opteron machine with ataraid (in that case, 6.1-RELEASE), again 
undegraded and no device I/O errors reported.

The patches:

- Daichi Goto's unionfs-p16 has been applied.
- The Areca driver is 1.20.00.12 from the Areca website.
- sym(4) patch (see PR/89550), but no sym controller present.
- SMP + FAST_IPSEC + SUIDDIR + device crypto.

So:  I've seen this problem on a few machines under heavy I/O load, with 
ataraid and with arcmsr.  I've seen others report similar problems, but
I've
seen no resolution.  Does anyone have any idea what the problem is?  Has 
anyone else seen similar problems?  Where to from here?

Thanks,

Jan Mikkelsen
janm@transactionware.com

Scott Long

2006-Dec-12 23:26 UTC

head link

g_vfs_done() failures on 6.2-RC1

Jan Mikkelsen wrote:
  	> - Daichi Goto's unionfs-p16 has been applied.
> - The Areca driver is 1.20.00.12 from the Areca website.
> - sym(4) patch (see PR/89550), but no sym controller present.
> - SMP + FAST_IPSEC + SUIDDIR + device crypto.
> 
> So:  I've seen this problem on a few machines under heavy I/O load,
with
> ataraid and with arcmsr.  I've seen others report similar problems, but
> I've seen no resolution.  Does anyone have any idea what the problem 
> is?  Has anyone else seen similar problems?  Where to from here?
> 
> Thanks,
> 
You mention that you are using a driver from the Areca website.  Have
you tried using the stock driver that comes with FreeBSD?  I don't know
if it will be better or not, but I was planning on doing a refresh of
the stock driver, and I'd hate to introduce instability that wasn't 
there before.

Scott

Scott Long

2006-Dec-14 08:23 UTC

head link

g_vfs_done() failures on 6.2-RC1

Dmitry Pryanishnikov wrote:> 
> Hello!
> 
> On Wed, 13 Dec 2006, Jan Mikkelsen wrote:
>> I mentioned that I have seen similar problems on machines with 
>> ataraid, like this:
>>
>> DOH! ata_alloc_composite failed! (x5)
>> FAILURE - out of memory in ata_raid_init_request (x6)
>> g_vfs_done():ar0s3f[WRITE(offset=113324673024, length=2048)]error = 5
> 
>    These errors are caused by the kernel memory (specifically,
> "ata_composite_zone" UMA zone) depletion. To get this knowlege,
one
> should Read The F[ine] Sources, as ata(4)/ataraid(4) lack the 
> DIAGNOSTICS section:
> 
> ata-all.h:#define ata_alloc_composite() uma_zalloc(ata_composite_zone,
>         M_NOWAIT | M_ZERO)
> 
> Sincerely, Dmitry
Not being able to perform an I/O due to lack of memory is a very bad
thing, since the way that memory is freed up is by doing I/O.

Scott

freebsd stable - Dec 2006 - g_vfs_done() failures on 6.2-RC1

g_vfs_done() failures on 6.2-RC1

g_vfs_done() failures on 6.2-RC1

g_vfs_done() failures on 6.2-RC1