On Tue, Mar 10, 2015 at 02:10:09PM -0400, John Baldwin
wrote:> On Tuesday, March 10, 2015 10:17:07 AM Nick Frampton wrote:
> > Hi,
> >
> > For the past several months, we have had an intermittent problem where
a
> > process calling kvm_openfiles(3) or kvm_getprocs(3) (not sure which)
gets
> > stuck in an infinite loop and goes to 100% cpu. We have just observed
> > "fstat -m" do the same thing and suspect it may be the same
problem.
> >
> > Our environment is a 10.1-RELEASE-p6 amd64 guest running in
VirtualBox, with
> > ufs root and zfs /home.
> >
> > Has anyone else experienced this? Is there anything we can do to
investigate
> > the problem further?
>
> Often loops using libkvm are due to programs using libkvm are trying to
read
> kernel data structures while they are changing. However, if you use
sysctls
> to fetch this data instead, you should be able to get a stable snapshot of
the
> system state without getting stuck in a possible loop. I believe for
libkvm
> to use sysctl instead of /dev/kmem you have to pass a NULL for the kernel
and
> "/dev/null" for the core image. fstat -m should be doing that by
default
> however, so if it is not that, can you ktrace fstat when it is spinning to
see
> if it is spinning userland or in the kernel? If you see no activity via
> ktrace, then it is spinning in one of the two places without making any
system
> calls, etc. You can attach to it with gdb to pause it, then see where gdb
> thinks it is. If gdb hangs attaching to it, then it is stuck in the
kernel.
>
> If gdb attaches to it ok, then it is spinning in userland. Unfortunately,
for
> gdb to be useful, you really need debug symbols. We don't currently
provide
> those for release binaries or binaries provided via freebsd-update (though
> that is being worked on for 11.0). If you build from source, then the
> simplest way to get this is to add 'WITH_DEBUG_FILES=yes' to
/etc/src.conf and
> rebuild your world without NO_CLEAN. If you are building from source and
are
> able to reproduce with those binaries, then after attaching to the process
> with gdb, use 'bt' to see where it is hung and reply with that.
>
> If it is hanging in the kernel, then you will need to use the kernel
debugger
> to see where it is hanging. The simplest way to do this is probably to
force
> a crash via the debug.kdb.panic sysctl (set it to a non-zero value). You
will
> then need to fire up kgdb on the crash dump after it reboots, switch to the
> fstat process via the 'proc <pid>' command and get a
backtrace via 'bt'.
It sounds like this issue might be the one fixed in r272566: if the
KERN_PROC_ALL sysctl is read with an insufficiently large buffer, an
sbuf error return value could bubble up and be treated as ERESTART,
resulting in a loop.
This can be confirmed with something like
dtrace -n 'syscall:::entry /pid == $target/{@[probefunc] = count();}
tick-3s {exit(0);}' -p <pid of looping proc>
If the output consists solely of __sysctl, this bug is likely the
culprit.
-Mark