Hi,
I have seen g_vfs_done() failures with absurd offsets in the face of heavy
I/O. Recovery doesn't seem to happen, leading to the need to reboot the
system. The problem seems to occur without any underlying disk device
failure.
An example from yesterday:
This message repeats on the order of 10s of thousands of times, with no
earlier message:
g_vfs_done():da1s1d[READ(offset=5036583429229836288, length=16384)]error = 5
Bsdlabel correctly reports that /dev/da1s1d has 1748318312 512-byte blocks,
so the offset is clearly wrong. The filesystem was using softupdates and
was doing a few "rm -rf"s on two cvs repositories. After this error
the rm
didn't die from SIGTERM or SIGKILL. (Unfortunately, I didn't check
wchan
for the rm process. Sorry.)
The shutdown took three hours. I didn't have console access, so I don't
know the console messages at the time. The machine did respond to pings
during at least the first hour. After it came back up, the filesystems were
all reported as clean. Attempting to finish off the "rm" produced
this
result:
bad block 8819084429375818952, ino 92865791
pid 49 (softdepflush), uid 0 inumber 92865791 on /work: bad block
bad block -8123569960048088809, ino 92865791
pid 49 (softdepflush), uid 0 inumber 92865791 on /work: bad block
handle_workitem_freeblocks: block count
g_vfs_done():da1s1d[READ(offset=1154660658434844672, length=16384)]error = 5
bad block -9114721846648257515, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=8698001308483434496, length=16384)]error = 5
bad block -8102232258315484873, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=4586979512427630592, length=16384)]error = 5
bad block -3438510379221006390, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=196654394503331840, length=16384)]error = 5
g_vfs_done():da1s1d[READ(offset=26142581273591808, length=16384)]error = 5
bad block 504981533259792482, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
bad block 1538054898336656903, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=249387551018614784, length=16384)]error = 5
bad block 18582847101533720, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
g_vfs_done():da1s1d[READ(offset=259247319150690304, length=16384)]error = 5
bad block -3429473246997783577, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
bad block -3335830404336954747, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
bad block -1007814018434232494, ino 92865789
pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block
handle_workitem_freeblocks: block count
A reboot to single user mode and an fsck cleaned things up.
In this case it is a machine running 6.2-RC1/amd64 with patches on a
SuperMicro motherboard, 2 x Xeon 5140 CPUs, 4GB ECC memory and an Areca SATA
raid controller. The raid array is RAID-6, with write-through controller
cache and drive write cache disabled. The controller reported no I/O
errors, no volumes are degraded. I have also seen very similar problems on
a dual-Opteron machine with ataraid (in that case, 6.1-RELEASE), again
undegraded and no device I/O errors reported.
The patches:
- Daichi Goto's unionfs-p16 has been applied.
- The Areca driver is 1.20.00.12 from the Areca website.
- sym(4) patch (see PR/89550), but no sym controller present.
- SMP + FAST_IPSEC + SUIDDIR + device crypto.
So: I've seen this problem on a few machines under heavy I/O load, with
ataraid and with arcmsr. I've seen others report similar problems, but
I've
seen no resolution. Does anyone have any idea what the problem is? Has
anyone else seen similar problems? Where to from here?
Thanks,
Jan Mikkelsen
janm@transactionware.com