thr3ads.net - freebsd stable - Help debugging stable/10 [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Shane Ambler

2014-Dec-11 01:38 UTC

Help debugging stable/10

Since upgrading to 10.1 (RC2) I have had trouble getting uptimes greater
than 1 day. I have little experience debugging the OS so could use some
help.

# uname -a
FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275364: Tue Dec
2 08:13:06 ACDT 2014 
root at leader.local:/usr/obj/usr/src-stable/sys/GENERIC amd64

This is on an ASUS P8H61-M LE/USB3 corei5 8GB with 3x 2TB Seagate drives
in raidz.
Full backtraces and dmesg at http://shaneware.biz/freebsddebugdata/

The thing that breaks which forces me to reset the machine is that I am
unable to start new processes. Existing processes continue to work I
just can't start new ones. Some simple commands do work but top ps
procstat usbconfig all fail to start. I have been able use script to
get some backtraces from kgdb before restarting.

Back in RC2 and 3 I had it freeze up leaving me to reset, since RC4 and
now on stable I have been able to close up most processes before resetting.

I don't know how to re-create this issue on command, but I just had 4
days uptime which is the best I've had in the last month.

How do I find what is preventing new processes starting?

What can I do to get more data when things go south?

I noticed an odd thing in the backtrace today -

Thread 16 (Thread 101376):
#0  sched_switch (td=0xfffff8011c1e7000, newtd=<value optimized out>, 
flags=<value optimized out>)
     at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0x0000000000000020 in ?? ()
#2  0xfffffe021e5a8200 in ?? ()
#3  0xffffffff80913166 in malloc (size=<value optimized out>,
mtp=<value
optimized out>, flags=<value optimized out>)
     at /usr/src-stable/sys/kern/kern_malloc.c:352
#4  0xec81485354415541 in ?? ()
#5  0x4dce894d00000268 in ?? ()
#6  0xd58949cc8949c789 in ?? ()
#7  0x9358253c8148fb89 in ?? ()
#8  0x850f000000008164 in ?? ()

<snip for email>

#16890 0x067ee8f68944df89 in ?? ()
#16891 0x8618251c8948fdd7 in ?? ()
#16892 0x415b10c4834882c1 in ?? ()
#16893 0x0000000000c35d5e in ?? ()
#16894 0x0000000000000000 in ?? ()

Yes there are nearly 17,000 lines in that backtrace, no idea what it is
though.

-- 
FreeBSD - the place to B...Software Developing

Shane Ambler

John Baldwin

2014-Dec-16 16:29 UTC

head link

Help debugging stable/10

On Wednesday, December 10, 2014 8:38:21 pm Shane Ambler
wrote:> Since upgrading to 10.1 (RC2) I have had trouble getting uptimes greater
> than 1 day. I have little experience debugging the OS so could use some
> help.
> 
> # uname -a
> FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275364: Tue Dec
> 2 08:13:06 ACDT 2014 
> root at leader.local:/usr/obj/usr/src-stable/sys/GENERIC amd64
> 
> This is on an ASUS P8H61-M LE/USB3 corei5 8GB with 3x 2TB Seagate drives
> in raidz.
> Full backtraces and dmesg at http://shaneware.biz/freebsddebugdata/
> 
> The thing that breaks which forces me to reset the machine is that I am
> unable to start new processes. Existing processes continue to work I
> just can't start new ones. Some simple commands do work but top ps
> procstat usbconfig all fail to start. I have been able use script to
> get some backtraces from kgdb before restarting.
It looks like your processes are hanging on a global lock used by sysctl(3).
That would explain hangs in top/ps/procstat as they all use sysctls.  For
example:

Thread 709 (Thread 101172):
#0  sched_switch (td=0xfffff8006cccd490, newtd=<value optimized out>,
flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at
/usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at
/usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff8157adc0,
tid=18446735279441892496,
    opts=<value optimized out>, file=0x0, line=0) at
/usr/src-stable/sys/kern/kern_sx.c:681
#4  0xffffffff80936004 in userland_sysctl (td=<value optimized out>,
name=0xfffffe021e889930, namelen=3,
    old=<value optimized out>, oldlenp=<value optimized out>,
inkernel=<value optimized out>,
    new=<value optimized out>, newlen=<value optimized out>,
retval=<value optimized out>, flags=0) at sx.h:152
#5  0xffffffff80935e64 in sys___sysctl (td=0xfffff8006cccd490,
uap=0xfffffe021e889a40)
    at /usr/src-stable/sys/kern/kern_sysctl.c:1536
#6  0xffffffff80d29c51 in amd64_syscall (td=0xfffff8006cccd490, traced=0) at
subr_syscall.c:134
#7  0xffffffff80d0eceb in Xfast_syscall () at
/usr/src-stable/sys/amd64/amd64/exception.S:391
#8  0x0000000800b6e64a in ?? ()
Previous frame inner to this frame (corrupt stack?)
---Type <return> to continue, or q <return> to quit---

To hang like this means that some sysctl handler is running for a long time.  

Looking in your traces, this appears to be the problematic sysctl:

Thread 397 (Thread 100422):
#0  sched_switch (td=0xfffff8001049f920, newtd=<value optimized out>,
flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at
/usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097189a in sleepq_timedwait (wchan=0x0, pri=0) at
/usr/src-stable/sys/kern/subr_sleepqueue.c:652
#3  0xffffffff8093320e in _sleep (ident=<value optimized out>,
lock=<value optimized out>,
    priority=<value optimized out>, wmesg=<value optimized out>,
sbt=<value optimized out>, pr=<value optimized out>,
    flags=<value optimized out>) at
/usr/src-stable/sys/kern/kern_synch.c:251
#4  0xffffffff80891ae3 in g_waitfor_event (func=<value optimized out>,
arg=<value optimized out>,
    flag=<value optimized out>) at
/usr/src-stable/sys/geom/geom_event.c:427
#5  0xffffffff808933d9 in sysctl_kern_geom_conftxt (oidp=<value optimized
out>, arg1=<value optimized out>,
    arg2=<value optimized out>, req=0xfffffe021dc13868) at
/usr/src-stable/sys/geom/geom_kern.c:169
#6  0xffffffff80935a9e in sysctl_root (arg1=<value optimized out>,
arg2=<value optimized out>)
    at /usr/src-stable/sys/kern/kern_sysctl.c:1500
#7  0xffffffff80936078 in userland_sysctl (td=<value optimized out>,
name=0xfffffe021dc13930,
    namelen=<value optimized out>, old=<value optimized out>,
oldlenp=<value optimized out>,
    inkernel=<value optimized out>, new=<value optimized out>,
newlen=<value optimized out>,
    retval=<value optimized out>, flags=0) at
/usr/src-stable/sys/kern/kern_sysctl.c:1610
#8  0xffffffff80935e64 in sys___sysctl (td=0xfffff8001049f920,
uap=0xfffffe021dc13a40)
    at /usr/src-stable/sys/kern/kern_sysctl.c:1536
#9  0xffffffff80d29c51 in amd64_syscall (td=0xfffff8001049f920, traced=0) at
subr_syscall.c:134
#10 0xffffffff80d0eceb in Xfast_syscall () at
/usr/src-stable/sys/amd64/amd64/exception.S:391
#11 0x00000008019c864a in ?? ()
Previous frame inner to this frame (corrupt stack?)

(Note that HEAD has some changes by mjg@ (cc'd) to allow concurrent sysctls
that would at least make this less painful.)

The root issue is why is geom hanging.  Hmm, that thread is also stuck on an
sx lock:

Thread 259 (Thread 100015):
#0  sched_switch (td=0xfffff80005170000, newtd=<value optimized out>,
flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at
/usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at
/usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff816451e0,
tid=18446735277701922816,
    opts=<value optimized out>, file=0x0, line=0) at
/usr/src-stable/sys/kern/kern_sx.c:681
#4  0xffffffff808912b2 in g_run_events () at sx.h:152
#5  0xffffffff808fad3a in fork_exit (callout=0xffffffff80893240
<g_event_procbody>, arg=0x0, frame=0xfffffe01ef98aac0)
    at /usr/src-stable/sys/kern/kern_fork.c:996
#6  0xffffffff80d0ef3e in fork_trampoline () at
/usr/src-stable/sys/amd64/amd64/exception.S:606
#7  0x0000000000000000 in ?? ()

That is probably g_topology_lock.  If so, it is held by this thread
(g_dev_open() grabs it around g_access()):

Thread 673 (Thread 101936):
#0  sched_switch (td=0xfffff80162200490, newtd=<value optimized out>,
flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at
/usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at
/usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff80933227 in _sleep (ident=<value optimized out>,
lock=<value optimized out>,
    priority=<value optimized out>, wmesg=<value optimized out>,
sbt=<value optimized out>, pr=<value optimized out>,
    flags=<value optimized out>) at
/usr/src-stable/sys/kern/kern_synch.c:255
#4  0xffffffff8030b9b3 in daopen (dp=<value optimized out>) at
/usr/src-stable/sys/cam/scsi/scsi_da.c:1295
#5  0xffffffff8089013b in g_disk_access (pp=0xfffff800097f4c00, r=<value
optimized out>, w=<value optimized out>,
    e=<value optimized out>) at /usr/src-stable/sys/geom/geom_disk.c:136
#6  0xffffffff80894f3e in g_access (cp=0xfffff800097f2e00, dcr=1, dcw=0, dce=0)
    at /usr/src-stable/sys/geom/geom_subr.c:913
#7  0xffffffff8088df12 in g_dev_open (dev=<value optimized out>,
flags=<value optimized out>,
    fmt=<value optimized out>, td=<value optimized out>) at
/usr/src-stable/sys/geom/geom_dev.c:361
#8  0xffffffff808180f2 in devfs_open (ap=<value optimized out>) at
/usr/src-stable/sys/fs/devfs/devfs_vnops.c:1086
---Type <return> to continue, or q <return> to quit---
#9  0xffffffff80e44fd1 in VOP_OPEN_APV (vop=<value optimized out>,
a=<value optimized out>) at vnode_if.c:469
#10 0xffffffff809d9bb4 in vn_open_vnode (vp=0xfffff8017d14d000, fmode=1,
cred=0xfffff8000c2e4b00,
    td=0xfffff80162200490, fp=0xfffff80081060230) at vnode_if.h:196
#11 0xffffffff809d97ac in vn_open_cred (ndp=0xfffffe021e97b880,
flagp=0xfffffe021e97b95c,
    cmode=<value optimized out>, vn_open_flags=<value optimized
out>, cred=0x0, fp=0xfffff80081060230)
    at /usr/src-stable/sys/kern/vfs_vnops.c:256
#12 0xffffffff809d2def in kern_openat (td=0xfffff80162200490, fd=-100, 
    path=0x7fffffffe99e <Error reading address 0x7fffffffe99e: Bad
address>, pathseg=UIO_USERSPACE, flags=1,
    mode=<value optimized out>) at
/usr/src-stable/sys/kern/vfs_syscalls.c:1091
#13 0xffffffff80d29c51 in amd64_syscall (td=0xfffff80162200490, traced=0) at
subr_syscall.c:134
#14 0xffffffff80d0eceb in Xfast_syscall () at
/usr/src-stable/sys/amd64/amd64/exception.S:391
#15 0x0000000800f8dcea in ?? ()
Previous frame inner to this frame (corrupt stack?)

Line 1295 of scsi_da.c in a stable/10 checkout of today is this:

        /* Wait for the disk size update.  */
        error = cam_periph_sleep(periph, &softc->disk->d_mediasize,
PRIBIO,
            "dareprobe", 0);

To be stuck here means that the request dareprobe() queued is stuck behind
some other request on the device.  Looking in your traces, there are several
threads all waiting for an ioctl() to a passX device to complete:

Thread 432 (Thread 100542):
#0  sched_switch (td=0xfffff801acb8a920, newtd=<value optimized out>,
flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at
/usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at
/usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff80933227 in _sleep (ident=<value optimized out>,
lock=<value optimized out>,
    priority=<value optimized out>, wmesg=<value optimized out>,
sbt=<value optimized out>, pr=<value optimized out>,
    flags=<value optimized out>) at
/usr/src-stable/sys/kern/kern_synch.c:255
---Type <return> to continue, or q <return> to quit---
#4  0xffffffff802ddc0e in cam_periph_runccb (ccb=0xfffff80187888000,
error_routine=0xffffffff8030d2a0 <passerror>,
    camflags=CAM_RETRY_SELTO, sense_flags=18, ds=0xfffff8000984fa20) at
/usr/src-stable/sys/cam/cam_periph.c:969
#5  0xffffffff8030d233 in passdoioctl (dev=<value optimized out>,
cmd=<value optimized out>,
    addr=0xfffff800b50f4800 "", flag=<value optimized out>,
td=<value optimized out>)
    at /usr/src-stable/sys/cam/scsi/scsi_pass.c:680
#6  0xffffffff8030cf32 in passioctl (dev=0xfffff80009835e00, cmd=3302496258,
addr=0xfffff800b50f4800 "", flag=3,
    td=0xfffff801acb8a920) at /usr/src-stable/sys/cam/scsi/scsi_pass.c:533
#7  0xffffffff80819a99 in devfs_ioctl_f (fp=0xfffff800b5d6f320, com=3302496258,
data=0xfffff800b50f4800,
    cred=<value optimized out>, td=0xfffff801acb8a920) at
/usr/src-stable/sys/fs/devfs/devfs_vnops.c:759
#8  0xffffffff8097d865 in kern_ioctl (td=0xfffff801acb8a920, fd=<value
optimized out>, com=0) at file.h:320
#9  0xffffffff8097d560 in sys_ioctl (td=0xfffff801acb8a920,
uap=0xfffffe021de9aa40)
    at /usr/src-stable/sys/kern/sys_generic.c:718
#10 0xffffffff80d29c51 in amd64_syscall (td=0xfffff801acb8a920, traced=0) at
subr_syscall.c:134
#11 0xffffffff80d0eceb in Xfast_syscall () at
/usr/src-stable/sys/amd64/amd64/exception.S:391
#12 0x0000000800ff3f7a in ?? ()
Previous frame inner to this frame (corrupt stack?)

I suspect the daopen() is hanging on the disk associated with the given
passX device.

Are you running smartd or anything else that accesses passX devices?

Also, do you have a coredump from this or were you just running kgdb against
the live system?  If you have a coredump you can run ps against it (or use
some kgdb macros I have) to see the command name of the process doing the
passX ioctls.  We could also see which passX devices were blocking to see if
this is related to either AHCI or USB.  Also, do you possibly have any errors
in your dmesg from the disk devices?

-- 
John Baldwin

freebsd stable - Dec 2014 - Help debugging stable/10

Help debugging stable/10

Help debugging stable/10