truckman at freebsd.org (Don Lewis) writes:
>On 21 Jul, G. Paul Ziemba wrote:
>> GENERIC kernel r321349 results in the following about a minute after
>> multiuser boot completes.
>>
>> What additional information should I provide to assist in debugging?
>>
>> Many thanks!
>>
>> [Extracted from /var/crash/core.txt.NNN]
>>
>> KDB: stack backtrace:
>> #0 0xffffffff810f6ed7 at kdb_backtrace+0xa7
>> #1 0xffffffff810872a9 at vpanic+0x249
>> #2 0xffffffff81087060 at vpanic+0
>> #3 0xffffffff817d9aca at dblfault_handler+0x10a
>> #4 0xffffffff817ae93c at Xdblfault+0xac
>> #5 0xffffffff810cf76e at cpu_search_lowest+0x35e
>> #6 0xffffffff810cf76e at cpu_search_lowest+0x35e
>> #7 0xffffffff810d5b36 at sched_lowest+0x66
>> #8 0xffffffff810d1d92 at sched_pickcpu+0x522
>> #9 0xffffffff810d2b03 at sched_add+0xd3
>> #10 0xffffffff8101df5c at intr_event_schedule_thread+0x18c
>> #11 0xffffffff8101ddb0 at swi_sched+0xa0
>> #12 0xffffffff81261643 at netisr_queue_internal+0x1d3
>> #13 0xffffffff81261212 at netisr_queue_src+0x92
>> #14 0xffffffff81261677 at netisr_queue+0x27
>> #15 0xffffffff8123da5a at if_simloop+0x20a
>> #16 0xffffffff8123d83b at looutput+0x22b
>> #17 0xffffffff8131c4c6 at ip_output+0x1aa6
>>
>> doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:298
>> 298 dumptid = curthread->td_tid;
>> (kgdb) #0 doadump (textdump=1) at
/usr/src/sys/kern/kern_shutdown.c:298
>> #1 0xffffffff810867e8 in kern_reboot (howto=260)
>> at /usr/src/sys/kern/kern_shutdown.c:366
>> #2 0xffffffff810872ff in vpanic (fmt=0xffffffff81e5f7e0 "double
fault",
>> ap=0xfffffe0839778ec0) at /usr/src/sys/kern/kern_shutdown.c:759
>> #3 0xffffffff81087060 in panic (fmt=0xffffffff81e5f7e0 "double
fault")
>> at /usr/src/sys/kern/kern_shutdown.c:690
>> #4 0xffffffff817d9aca in dblfault_handler (frame=0xfffffe0839778f40)
>> at /usr/src/sys/amd64/amd64/trap.c:828
>> #5 <signal handler called>
>> #6 0xffffffff810cf422 in cpu_search_lowest (
>> cg=0xffffffff826ccd98 <group+280>,
>> low=<error reading variable: Cannot access memory at address
0xfffffe085cfa4
>> ff8>) at /usr/src/sys/kern/sched_ule.c:782
>> #7 0xffffffff810cf76e in cpu_search (cg=0xffffffff826cccb8
<group+56>,
>> low=0xfffffe085cfa53b8, high=0x0, match=1)
>> at /usr/src/sys/kern/sched_ule.c:710
>> #8 cpu_search_lowest (cg=0xffffffff826cccb8 <group+56>,
>> low=0xfffffe085cfa53b8) at /usr/src/sys/kern/sched_ule.c:783
>> #9 0xffffffff810cf76e in cpu_search (cg=0xffffffff826ccc80
<group>,
>> low=0xfffffe085cfa5430, high=0x0, match=1)
>> at /usr/src/sys/kern/sched_ule.c:710
>> #10 cpu_search_lowest (cg=0xffffffff826ccc80 <group>,
low=0xfffffe085cfa5430)
>> at /usr/src/sys/kern/sched_ule.c:783
>> #11 0xffffffff810d5b36 in sched_lowest (cg=0xffffffff826ccc80
<group>,
>> mask=..., pri=28, maxload=2147483647, prefer=4)
>> at /usr/src/sys/kern/sched_ule.c:815
>> #12 0xffffffff810d1d92 in sched_pickcpu (td=0xfffff8000a3a9000,
flags=4)
>> at /usr/src/sys/kern/sched_ule.c:1292
>> #13 0xffffffff810d2b03 in sched_add (td=0xfffff8000a3a9000, flags=4)
>> at /usr/src/sys/kern/sched_ule.c:2447
>> #14 0xffffffff8101df5c in intr_event_schedule_thread
(ie=0xfffff80007e7ae00)
>> at /usr/src/sys/kern/kern_intr.c:917
>> #15 0xffffffff8101ddb0 in swi_sched (cookie=0xfffff8000a386880,
flags=0)
>> at /usr/src/sys/kern/kern_intr.c:1163
>> #16 0xffffffff81261643 in netisr_queue_internal (proto=1,
>> m=0xfffff80026d00500, cpuid=0) at /usr/src/sys/net/netisr.c:1022
>> #17 0xffffffff81261212 in netisr_queue_src (proto=1, source=0,
>> m=0xfffff80026d00500) at /usr/src/sys/net/netisr.c:1056
>> #18 0xffffffff81261677 in netisr_queue (proto=1, m=0xfffff80026d00500)
>> at /usr/src/sys/net/netisr.c:1069
>> #19 0xffffffff8123da5a in if_simloop (ifp=0xfffff800116eb000,
>> m=0xfffff80026d00500, af=2, hlen=0) at
/usr/src/sys/net/if_loop.c:358
>> #20 0xffffffff8123d83b in looutput (ifp=0xfffff800116eb000,
>> m=0xfffff80026d00500, dst=0xfffff80026ed6550,
ro=0xfffff80026ed6530)
>> at /usr/src/sys/net/if_loop.c:265
>> #21 0xffffffff8131c4c6 in ip_output (m=0xfffff80026d00500, opt=0x0,
>> ro=0xfffff80026ed6530, flags=0, imo=0x0, inp=0xfffff80026ed63a0)
>> at /usr/src/sys/netinet/ip_output.c:655
>> #22 0xffffffff8142e1c7 in tcp_output (tp=0xfffff80026eb2820)
>> at /usr/src/sys/netinet/tcp_output.c:1447
>> #23 0xffffffff81447700 in tcp_usr_send (so=0xfffff80011ec2360, flags=0,
>> m=0xfffff80026d14d00, nam=0x0, control=0x0, td=0xfffff80063ba1000)
>> at /usr/src/sys/netinet/tcp_usrreq.c:967
>> #24 0xffffffff811776f1 in sosend_generic (so=0xfffff80011ec2360,
addr=0x0,
>> uio=0x0, top=0xfffff80026d14d00, control=0x0, flags=0,
>> td=0xfffff80063ba1000) at /usr/src/sys/kern/uipc_socket.c:1360
>> #25 0xffffffff811779bd in sosend (so=0xfffff80011ec2360, addr=0x0,
uio=0x0,
>> top=0xfffff80026d14d00, control=0x0, flags=0,
td=0xfffff80063ba1000)
>> at /usr/src/sys/kern/uipc_socket.c:1405
>> #26 0xffffffff815276a2 in clnt_vc_call (cl=0xfffff80063ca0980,
>> ext=0xfffffe085cfa6e38, proc=4, args=0xfffff80026c3bc00,
>> resultsp=0xfffffe085cfa7110, utimeout=...)
>> at /usr/src/sys/rpc/clnt_vc.c:413
>> #27 0xffffffff8152391c in clnt_reconnect_call (cl=0xfffff80063ca0c00,
>> ext=0xfffffe085cfa6e38, proc=4, args=0xfffff80026c3bc00,
>> resultsp=0xfffffe085cfa7110, utimeout=...)
>> at /usr/src/sys/rpc/clnt_rc.c:271
>> #28 0xffffffff80e75628 in newnfs_request (nd=0xfffffe085cfa7110,
>> nmp=0xfffff80007e79c00, clp=0x0, nrp=0xfffff80007e79d28,
>> vp=0xfffff80011d9b588, td=0xfffff80063ba1000,
cred=0xfffff800118c0100,
>> prog=100003, vers=3, retsum=0x0, toplevel=1, xidp=0x0, dssep=0x0)
>> at /usr/src/sys/fs/nfs/nfs_commonkrpc.c:760
>> #29 0xffffffff80ee87f1 in nfscl_request (nd=0xfffffe085cfa7110,
>> vp=0xfffff80011d9b588, p=0xfffff80063ba1000,
cred=0xfffff800118c0100,
>> stuff=0x0) at /usr/src/sys/fs/nfsclient/nfs_clport.c:952
>> #30 0xffffffff80ea865c in nfsrpc_accessrpc (vp=0xfffff80011d9b588,
mode=63,
>> cred=0xfffff800118c0100, p=0xfffff80063ba1000,
nap=0xfffffe085cfa72e0,
>> attrflagp=0xfffffe085cfa73c0, rmodep=0xfffffe085cfa73b4, stuff=0x0)
>> at /usr/src/sys/fs/nfsclient/nfs_clrpcops.c:243
>> #31 0xffffffff80ed9ec9 in nfs34_access_otw (vp=0xfffff80011d9b588,
wmode=63,
>> td=0xfffff80063ba1000, cred=0xfffff800118c0100,
>> retmode=0xfffffe085cfa7540) at
/usr/src/sys/fs/nfsclient/nfs_clvnops.c:283
>> #32 0xffffffff80ecfb64 in nfs_access (ap=0xfffffe085cfa75f8)
>> at /usr/src/sys/fs/nfsclient/nfs_clvnops.c:426
>> #33 0xffffffff81a539d4 in VOP_ACCESS_APV (
>> vop=0xffffffff822ff8b8 <newnfs_vnodeops>,
a=0xfffffe085cfa75f8)
>> at vnode_if.c:601
>> #34 0xffffffff80eda726 in VOP_ACCESS (vp=0xfffff80011d9b588,
accmode=64,
>> cred=0xfffff800118c0100, td=0xfffff80063ba1000) at ./vnode_if.h:254
>> #35 0xffffffff80ecb925 in nfs_lookup (ap=0xfffffe085cfa7cf8)
>> at /usr/src/sys/fs/nfsclient/nfs_clvnops.c:1064
>> #36 0xffffffff81a52a44 in VOP_LOOKUP_APV (
>> vop=0xffffffff822ff8b8 <newnfs_vnodeops>,
a=0xfffffe085cfa7cf8)
>> at vnode_if.c:127
>> #37 0xffffffff811c6aad in VOP_LOOKUP (dvp=0xfffff80011d9b588,
>> vpp=0xfffffe085cfa8708, cnp=0xfffffe085cfa8730) at ./vnode_if.h:54
>> #38 0xffffffff811c5b64 in lookup (ndp=0xfffffe085cfa86a8)
>> at /usr/src/sys/kern/vfs_lookup.c:886
>> #39 0xffffffff811c4aa2 in namei (ndp=0xfffffe085cfa86a8)
>> at /usr/src/sys/kern/vfs_lookup.c:448
>> #40 0xffffffff810050f0 in do_execve (td=0xfffff80063ba1000,
>> args=0xfffffe085cfa8838, mac_p=0x0) at
/usr/src/sys/kern/kern_exec.c:446
>> #41 0xffffffff810047fa in kern_execve (td=0xfffff80063ba1000,
>> args=0xfffffe085cfa8838, mac_p=0x0) at
/usr/src/sys/kern/kern_exec.c:347
>> #42 0xffffffff810041e2 in sys_execve (td=0xfffff80063ba1000,
>> uap=0xfffff80063ba1538) at /usr/src/sys/kern/kern_exec.c:221
>> #43 0xffffffff817da5ed in syscallenter (td=0xfffff80063ba1000)
>> at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:131
>> #44 0xffffffff817d9d0b in amd64_syscall (td=0xfffff80063ba1000,
traced=0)
>> at /usr/src/sys/amd64/amd64/trap.c:903
>> #45 <signal handler called>
>> #46 0x0000000800d5285a in ?? ()
>> Backtrace stopped: Cannot access memory at address 0x7fffffffe7d8
>> (kgdb)
>The double fault is a pretty good indication that you overflowed the
>kernel stack. Having ~40 frames on the stack when the fault happened is
>consistent with that.
First, thank you for this answer. I've been tearing my hair out for
several hours (crash, fsck, tweak, repeat). I haven't done intensive
work with this kernel so hope you'll entertain my simple questions
below.
>It looks like you are trying to execute a program from an NFS file
>system that is exported by the same host. This isn't exactly optimal
>...
Perhaps not optimal for the implementation, but I think it's a
common NFS scenario: define a set of NFS-provided paths for files
and use those path names on all hosts, regardless of whether they
happen to be serving the files in question or merely clients.
>Your best bet for a quick workaround for the stack overflow would be to
>rebuild the kernel with a larger value of KSTACK_PAGES. You can find
>teh default in /usr/src/sys/<arch>/conf/NOTES.
By the way, my host is amd64, not i386. I don't know if that matters
to the larger discussion, but I'm not running in a resource-constrained
situation (i.e., the host has Xeon E3-1231 (4 cores, 8 threads) and
32GB ram)
It seems (from /usr/include/machine/param.h) that PAGE_SIZE is 2048,
so the default KSTACK_PAGES=4 yields a stack size of 8192. Is that
right?
It's not clear to me how to determine the correct value of KSTACK_PAGES
from the stack trace above. If I merely subtract frame 44 from frame 0,
I get 817d9d0b - 810f6ed7 = 6e2e34 = 7,220,788 decimal. That can't be
right. I must be misunderstanding something.
>It would probably be a good idea to compute the differences in the stack
>pointer values between adjacent stack frames to see of any of them are
>consuming an excessive amount of stack space.
Hmm. I am assuming the first number for each frame is the stack pointer.
Yet, sometimes that value increases with call depth (see, e.g., when
if_simloop calls netisr_queue above, but also other places). What is
going on?
By the way, I have net.isr.dispatch=deferred (set earlier while I was
chasing this crash).
thanks!
--
G. Paul Ziemba
FreeBSD unix:
8:51AM up 1:05, 4 users, load averages: 0.53, 0.42, 0.35