Hello.
Last week I upgraded a 9.3/amd64 box to 10.3: since then, it crashed and
rebooted at least once every night.
The only exception was on Friday, when it locked without rebooting: it
still answered ping request and logins through HTTP would half work; I'm
under the impression that the disk subsystem was hung, so ICMP would
work since it does no I/O and HTTP too worked as far as no disk access
was required.
Today I was able to get a couple of (almost identical) dumps:
> cpuid = 1
> KDB: stack backtrace:
> #0 0xffffffff804ee170 at kdb_backtrace+0x60
> #1 0xffffffff804b4576 at vpanic+0x126
> #2 0xffffffff804b4443 at panic+0x43
> #3 0xffffffff8068fd2a at softdep_deallocate_dependencies+0x6a
> #4 0xffffffff805394b5 at brelse+0x145
> #5 0xffffffff8053793c at bufwrite+0x3c
> #6 0xffffffff806ae20f at ffs_write+0x3df
> #7 0xffffffff8076d519 at VOP_WRITE_APV+0x149
> #8 0xffffffff806ec7c9 at vnode_pager_generic_putpages+0x2a9
> #9 0xffffffff8076f3b7 at VOP_PUTPAGES_APV+0xa7
> #10 0xffffffff806ea6f5 at vnode_pager_putpages+0xc5
> #11 0xffffffff806e17f8 at vm_pageout_flush+0xc8
> #12 0xffffffff806db432 at vm_object_page_collect_flush+0x182
> #13 0xffffffff806db1cd at vm_object_page_clean+0x13d
> #14 0xffffffff806dadbe at vm_object_terminate+0x8e
> #15 0xffffffff806eac60 at vnode_destroy_vobject+0x90
> #16 0xffffffff806b4232 at ufs_reclaim+0x22
> #17 0xffffffff8076e5c7 at VOP_RECLAIM_APV+0xa7
Has anyone any better insight on what might be going on?
The disks are all connected to a SAS RAID adapter running on mfi; I
don't think it might be an hardware issue, since it has worked perfectly
for years until I did the upgrade; also mfiutil says everything is ok
and nothing mfi-related is in the logs.
Some ideas come to mind about which I might use a second opinion:
_ soft-update is broken: that would really surprise me, since I've been
using that for years on this and several other boxes (10.3 too);
_ snapshot creation/deletion is causing this: again I'm using that
almost anywhere, so I don't think this might be the cause alone;
besides, I've been able to do some dumps without trouble and I don't
think anything was messing with snapshots at the time of the last two
panics;
_ mfi driver is broken on 10.3: this is more reasonable to me, since
this is the only machine I have it on and it's the only case where I get
this panics.
I found https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183618, but I
get no "g_vfs_done()..." messages.
Any other hint?
I'd really like to find out what's going on, I'll appreciate any
help
and I'm willing to provide any useful info.
On the other hand, this is a production server, so I have to solve this
really soon.
Some idea comes to mind, like disabling softupdate (knowing which file
system was having trouble would help here; is there any way to know?),
trying to enable journaling, upgrading to 10-STABLE, build a kernel with
INVARIANTS/WITNESS/etc..., but I'd appreciate a second opinion before I
start shooting in the dark.
bye & Thanks
av.