thr3ads.net - freebsd stable - stable/11 r321349 crashing immediately [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Don Lewis

2017-Jul-22 20:12 UTC

stable/11 r321349 crashing immediately

On 21 Jul, G. Paul Ziemba wrote:> truckman at freebsd.org (Don Lewis) writes:
> 
>>On 21 Jul, G. Paul Ziemba wrote:
>>> GENERIC kernel r321349 results in the following about a minute
after
>>> multiuser boot completes.
>>> 
>>> What additional information should I provide to assist in
debugging?
>>> 
>>> Many thanks!
>>> 
>>> [Extracted from /var/crash/core.txt.NNN]
>>> 
>>> KDB: stack backtrace:
>>> #0 0xffffffff810f6ed7 at kdb_backtrace+0xa7
>>> #1 0xffffffff810872a9 at vpanic+0x249
>>> #2 0xffffffff81087060 at vpanic+0
>>> #3 0xffffffff817d9aca at dblfault_handler+0x10a
>>> #4 0xffffffff817ae93c at Xdblfault+0xac
>>> #5 0xffffffff810cf76e at cpu_search_lowest+0x35e
>>> #6 0xffffffff810cf76e at cpu_search_lowest+0x35e
>>> #7 0xffffffff810d5b36 at sched_lowest+0x66
>>> #8 0xffffffff810d1d92 at sched_pickcpu+0x522
>>> #9 0xffffffff810d2b03 at sched_add+0xd3
>>> #10 0xffffffff8101df5c at intr_event_schedule_thread+0x18c
>>> #11 0xffffffff8101ddb0 at swi_sched+0xa0
>>> #12 0xffffffff81261643 at netisr_queue_internal+0x1d3
>>> #13 0xffffffff81261212 at netisr_queue_src+0x92
>>> #14 0xffffffff81261677 at netisr_queue+0x27
>>> #15 0xffffffff8123da5a at if_simloop+0x20a
>>> #16 0xffffffff8123d83b at looutput+0x22b
>>> #17 0xffffffff8131c4c6 at ip_output+0x1aa6
>>> 
>>> doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:298
>>> 298             dumptid = curthread->td_tid;
>>> (kgdb) #0  doadump (textdump=1) at
/usr/src/sys/kern/kern_shutdown.c:298
>>> #1  0xffffffff810867e8 in kern_reboot (howto=260)
>>>     at /usr/src/sys/kern/kern_shutdown.c:366
>>> #2  0xffffffff810872ff in vpanic (fmt=0xffffffff81e5f7e0
"double fault",
>>>     ap=0xfffffe0839778ec0) at /usr/src/sys/kern/kern_shutdown.c:759
>>> #3  0xffffffff81087060 in panic (fmt=0xffffffff81e5f7e0
"double fault")
>>>     at /usr/src/sys/kern/kern_shutdown.c:690
>>> #4  0xffffffff817d9aca in dblfault_handler
(frame=0xfffffe0839778f40)
>>>     at /usr/src/sys/amd64/amd64/trap.c:828
>>> #5  <signal handler called>
>>> #6  0xffffffff810cf422 in cpu_search_lowest (
>>>     cg=0xffffffff826ccd98 <group+280>, 
>>>     low=<error reading variable: Cannot access memory at address
0xfffffe085cfa4
>>> ff8>) at /usr/src/sys/kern/sched_ule.c:782
>>> #7  0xffffffff810cf76e in cpu_search (cg=0xffffffff826cccb8
<group+56>,
>>>     low=0xfffffe085cfa53b8, high=0x0, match=1)
>>>     at /usr/src/sys/kern/sched_ule.c:710
>>> #8  cpu_search_lowest (cg=0xffffffff826cccb8 <group+56>, 
>>>     low=0xfffffe085cfa53b8) at /usr/src/sys/kern/sched_ule.c:783
>>> #9  0xffffffff810cf76e in cpu_search (cg=0xffffffff826ccc80
<group>,
>>>     low=0xfffffe085cfa5430, high=0x0, match=1)
>>>     at /usr/src/sys/kern/sched_ule.c:710
>>> #10 cpu_search_lowest (cg=0xffffffff826ccc80 <group>,
low=0xfffffe085cfa5430)
>>>     at /usr/src/sys/kern/sched_ule.c:783
>>> #11 0xffffffff810d5b36 in sched_lowest (cg=0xffffffff826ccc80
<group>,
>>>     mask=..., pri=28, maxload=2147483647, prefer=4)
>>>     at /usr/src/sys/kern/sched_ule.c:815
>>> #12 0xffffffff810d1d92 in sched_pickcpu (td=0xfffff8000a3a9000,
flags=4)
>>>     at /usr/src/sys/kern/sched_ule.c:1292
>>> #13 0xffffffff810d2b03 in sched_add (td=0xfffff8000a3a9000,
flags=4)
>>>     at /usr/src/sys/kern/sched_ule.c:2447
>>> #14 0xffffffff8101df5c in intr_event_schedule_thread
(ie=0xfffff80007e7ae00)
>>>     at /usr/src/sys/kern/kern_intr.c:917
>>> #15 0xffffffff8101ddb0 in swi_sched (cookie=0xfffff8000a386880,
flags=0)
>>>     at /usr/src/sys/kern/kern_intr.c:1163
>>> #16 0xffffffff81261643 in netisr_queue_internal (proto=1, 
>>>     m=0xfffff80026d00500, cpuid=0) at
/usr/src/sys/net/netisr.c:1022
>>> #17 0xffffffff81261212 in netisr_queue_src (proto=1, source=0, 
>>>     m=0xfffff80026d00500) at /usr/src/sys/net/netisr.c:1056
>>> #18 0xffffffff81261677 in netisr_queue (proto=1,
m=0xfffff80026d00500)
>>>     at /usr/src/sys/net/netisr.c:1069
>>> #19 0xffffffff8123da5a in if_simloop (ifp=0xfffff800116eb000, 
>>>     m=0xfffff80026d00500, af=2, hlen=0) at
/usr/src/sys/net/if_loop.c:358
>>> #20 0xffffffff8123d83b in looutput (ifp=0xfffff800116eb000, 
>>>     m=0xfffff80026d00500, dst=0xfffff80026ed6550,
ro=0xfffff80026ed6530)
>>>     at /usr/src/sys/net/if_loop.c:265
>>> #21 0xffffffff8131c4c6 in ip_output (m=0xfffff80026d00500, opt=0x0,
>>>     ro=0xfffff80026ed6530, flags=0, imo=0x0,
inp=0xfffff80026ed63a0)
>>>     at /usr/src/sys/netinet/ip_output.c:655
>>> #22 0xffffffff8142e1c7 in tcp_output (tp=0xfffff80026eb2820)
>>>     at /usr/src/sys/netinet/tcp_output.c:1447
>>> #23 0xffffffff81447700 in tcp_usr_send (so=0xfffff80011ec2360,
flags=0,
>>>     m=0xfffff80026d14d00, nam=0x0, control=0x0,
td=0xfffff80063ba1000)
>>>     at /usr/src/sys/netinet/tcp_usrreq.c:967
>>> #24 0xffffffff811776f1 in sosend_generic (so=0xfffff80011ec2360,
addr=0x0,
>>>     uio=0x0, top=0xfffff80026d14d00, control=0x0, flags=0, 
>>>     td=0xfffff80063ba1000) at /usr/src/sys/kern/uipc_socket.c:1360
>>> #25 0xffffffff811779bd in sosend (so=0xfffff80011ec2360, addr=0x0,
uio=0x0,
>>>     top=0xfffff80026d14d00, control=0x0, flags=0,
td=0xfffff80063ba1000)
>>>     at /usr/src/sys/kern/uipc_socket.c:1405
>>> #26 0xffffffff815276a2 in clnt_vc_call (cl=0xfffff80063ca0980, 
>>>     ext=0xfffffe085cfa6e38, proc=4, args=0xfffff80026c3bc00, 
>>>     resultsp=0xfffffe085cfa7110, utimeout=...)
>>>     at /usr/src/sys/rpc/clnt_vc.c:413
>>> #27 0xffffffff8152391c in clnt_reconnect_call
(cl=0xfffff80063ca0c00,
>>>     ext=0xfffffe085cfa6e38, proc=4, args=0xfffff80026c3bc00, 
>>>     resultsp=0xfffffe085cfa7110, utimeout=...)
>>>     at /usr/src/sys/rpc/clnt_rc.c:271
>>> #28 0xffffffff80e75628 in newnfs_request (nd=0xfffffe085cfa7110, 
>>>     nmp=0xfffff80007e79c00, clp=0x0, nrp=0xfffff80007e79d28, 
>>>     vp=0xfffff80011d9b588, td=0xfffff80063ba1000,
cred=0xfffff800118c0100,
>>>     prog=100003, vers=3, retsum=0x0, toplevel=1, xidp=0x0,
dssep=0x0)
>>>     at /usr/src/sys/fs/nfs/nfs_commonkrpc.c:760
>>> #29 0xffffffff80ee87f1 in nfscl_request (nd=0xfffffe085cfa7110, 
>>>     vp=0xfffff80011d9b588, p=0xfffff80063ba1000,
cred=0xfffff800118c0100,
>>>     stuff=0x0) at /usr/src/sys/fs/nfsclient/nfs_clport.c:952
>>> #30 0xffffffff80ea865c in nfsrpc_accessrpc (vp=0xfffff80011d9b588,
mode=63,
>>>     cred=0xfffff800118c0100, p=0xfffff80063ba1000,
nap=0xfffffe085cfa72e0,
>>>     attrflagp=0xfffffe085cfa73c0, rmodep=0xfffffe085cfa73b4,
stuff=0x0)
>>>     at /usr/src/sys/fs/nfsclient/nfs_clrpcops.c:243
>>> #31 0xffffffff80ed9ec9 in nfs34_access_otw (vp=0xfffff80011d9b588,
wmode=63,
>>>     td=0xfffff80063ba1000, cred=0xfffff800118c0100, 
>>>     retmode=0xfffffe085cfa7540) at
/usr/src/sys/fs/nfsclient/nfs_clvnops.c:283
>>> #32 0xffffffff80ecfb64 in nfs_access (ap=0xfffffe085cfa75f8)
>>>     at /usr/src/sys/fs/nfsclient/nfs_clvnops.c:426
>>> #33 0xffffffff81a539d4 in VOP_ACCESS_APV (
>>>     vop=0xffffffff822ff8b8 <newnfs_vnodeops>,
a=0xfffffe085cfa75f8)
>>>     at vnode_if.c:601
>>> #34 0xffffffff80eda726 in VOP_ACCESS (vp=0xfffff80011d9b588,
accmode=64,
>>>     cred=0xfffff800118c0100, td=0xfffff80063ba1000) at
./vnode_if.h:254
>>> #35 0xffffffff80ecb925 in nfs_lookup (ap=0xfffffe085cfa7cf8)
>>>     at /usr/src/sys/fs/nfsclient/nfs_clvnops.c:1064
>>> #36 0xffffffff81a52a44 in VOP_LOOKUP_APV (
>>>     vop=0xffffffff822ff8b8 <newnfs_vnodeops>,
a=0xfffffe085cfa7cf8)
>>>     at vnode_if.c:127
>>> #37 0xffffffff811c6aad in VOP_LOOKUP (dvp=0xfffff80011d9b588, 
>>>     vpp=0xfffffe085cfa8708, cnp=0xfffffe085cfa8730) at
./vnode_if.h:54
>>> #38 0xffffffff811c5b64 in lookup (ndp=0xfffffe085cfa86a8)
>>>     at /usr/src/sys/kern/vfs_lookup.c:886
>>> #39 0xffffffff811c4aa2 in namei (ndp=0xfffffe085cfa86a8)
>>>     at /usr/src/sys/kern/vfs_lookup.c:448
>>> #40 0xffffffff810050f0 in do_execve (td=0xfffff80063ba1000, 
>>>     args=0xfffffe085cfa8838, mac_p=0x0) at
/usr/src/sys/kern/kern_exec.c:446
>>> #41 0xffffffff810047fa in kern_execve (td=0xfffff80063ba1000, 
>>>     args=0xfffffe085cfa8838, mac_p=0x0) at
/usr/src/sys/kern/kern_exec.c:347
>>> #42 0xffffffff810041e2 in sys_execve (td=0xfffff80063ba1000, 
>>>     uap=0xfffff80063ba1538) at /usr/src/sys/kern/kern_exec.c:221
>>> #43 0xffffffff817da5ed in syscallenter (td=0xfffff80063ba1000)
>>>     at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:131
>>> #44 0xffffffff817d9d0b in amd64_syscall (td=0xfffff80063ba1000,
traced=0)
>>>     at /usr/src/sys/amd64/amd64/trap.c:903
>>> #45 <signal handler called>
>>> #46 0x0000000800d5285a in ?? ()
>>> Backtrace stopped: Cannot access memory at address 0x7fffffffe7d8
>>> (kgdb) 
> 
>>The double fault is a pretty good indication that you overflowed the
>>kernel stack.  Having ~40 frames on the stack when the fault happened is
>>consistent with that.
> 
> First, thank you for this answer. I've been tearing my hair out for
> several hours (crash, fsck, tweak, repeat). I haven't done intensive
> work with this kernel so hope you'll entertain my simple questions
> below.
> 
>>It looks like you are trying to execute a program from an NFS file
>>system that is exported by the same host.  This isn't exactly
optimal
>>...
> 
> Perhaps not optimal for the implementation, but I think it's a
> common NFS scenario: define a set of NFS-provided paths for files
> and use those path names on all hosts, regardless of whether they
> happen to be serving the files in question or merely clients.
> 
>>Your best bet for a quick workaround for the stack overflow would be to
>>rebuild the kernel with a larger value of KSTACK_PAGES.  You can find
>>teh default in /usr/src/sys/<arch>/conf/NOTES.
> 
> By the way, my host is amd64, not i386. I don't know if that matters
> to the larger discussion, but I'm not running in a resource-constrained
> situation (i.e., the host has Xeon E3-1231 (4 cores, 8 threads) and
> 32GB ram)
> 
> It seems (from /usr/include/machine/param.h) that PAGE_SIZE is 2048,
> so the default KSTACK_PAGES=4 yields a stack size of 8192. Is that
> right?
Page size is 4096.

There is a read-only sysctl that shows the current value of
kstack_pages:
  # sysctl kern.kstack_pages
  kern.kstack_pages: 4

It's interesting that you are running into this on amd64.  Usually i386
is the problem child.

Since the problem crops up in the scheduler, which is preparing to
switch to another process (with a stack switch), I suspect that you
aren't missing the limit by much.  Try using the tunable to bump
kstack_pages by one.  I didn't see the tunable earlier since it is
handled in a separate .c file from where KSTACK_PAGES and the
kstack_pages sysctl appear in the source.
> It's not clear to me how to determine the correct value of KSTACK_PAGES
> from the stack trace above. If I merely subtract frame 44 from frame 0,
> I get 817d9d0b - 810f6ed7 = 6e2e34 = 7,220,788 decimal. That can't be
> right. I must be misunderstanding something.
> 
>>It would probably be a good idea to compute the differences in the stack
>>pointer values between adjacent stack frames to see of any of them are
>>consuming an excessive amount of stack space.
> 
> Hmm. I am assuming the first number for each frame is the stack pointer.
> Yet, sometimes that value increases with call depth (see, e.g., when
> if_simloop calls netisr_queue above, but also other places). What is
> going on?
Nope, it's the pc value at each call location.

Point kgdb at the vmcore file and execute "i r rsp" (short for info
registers) in each stack frame.  When the stack overflows, I think there
is a trap that causes a switch to an alternate stack for the double
fault handler.
> By the way, I have net.isr.dispatch=deferred (set earlier while I was
> chasing this crash).
> 
> thanks!

G. Paul Ziemba

2017-Jul-23 00:31 UTC

head link

stable/11 r321349 crashing immediately

On Sat, Jul 22, 2017 at 01:12:29PM -0700, Don Lewis
wrote:> On 21 Jul, G. Paul Ziemba wrote:
> >>Your best bet for a quick workaround for the stack overflow would
be to
> >>rebuild the kernel with a larger value of KSTACK_PAGES.  You can
find
> >>teh default in /usr/src/sys/<arch>/conf/NOTES.
I bumped it from the default 4 to 5 in /boot/loader.conf:

	kern.kstack_pages=5

and that prevented this crash. Uptime 5.5 hours at this point (instead of
1.5 minutes).

So what's the down-side of increasing kstack_pages? What if I made it 10?
I see comments elsewhere about reducing space for user-mode threads but I'm
not sure what that means in practical terms, or if there is some other
overarching tuning parameter that should also be increased.
> Page size is 4096.
Ah, I forgot to count the 2^0 bit.
> It's interesting that you are running into this on amd64.  Usually i386
> is the problem child.
Maybe stack frames are bigger due to 64-bit variables? (And of course
we get paid mostly for adding code, not so much for removing it)
> >>It would probably be a good idea to compute the differences in the
stack
> >>pointer values between adjacent stack frames to see of any of them
are
> >>consuming an excessive amount of stack space.
For our collective amusement, I noted the stack pointer for each frame and
calculated frame size and cumulative stack consumption. If there is some
other stack overhead not shown in the trace, I can see it going over 0x4000:

Frame   Stack Pointer           sz  cumu function
-----   -------------           --- ---- ----------------
 44     0xfffffe085cfa8a10               amd64_syscall
 43     0xfffffe085cfa88b0      160  160 syscallenter
 42     0xfffffe085cfa87f0      220  180 sys_execve
 41     0xfffffe085cfa87c0       30  1B0 kern_execve
 40     0xfffffe085cfa8090      730  8E0 do_execve
 39     0xfffffe085cfa7ec0      1D0  AB0 namei
 38     0xfffffe085cfa7d40      180  C30 lookup
 37     0xfffffe085cfa7cf0       50  C80 VOP_LOOKUP
 36     0xfffffe085cfa7c80       70  CF0        VOP_LOOKUP_APV
 35     0xfffffe085cfa7650      630 1320 nfs_lookup
 34     0xfffffe085cfa75f0       60 1380 VOP_ACCESS
 33     0xfffffe085cfa7580       70 13F0 VOP_ACCESS_APV
 32     0xfffffe085cfa7410      170 1560 nfs_access
 31     0xfffffe085cfa7240      1D0 1730 nfs34_access_otw
 30     0xfffffe085cfa7060      1E0 1910 nfsrpc_accessrpc
 29     0xfffffe085cfa6fb0       B0 19C0 nfscl_request
 28     0xfffffe085cfa6b20      490 1E50 newnfs_request
 27     0xfffffe085cfa6980      1A0 1FF0 clnt_reconnect_call
 26     0xfffffe085cfa6520      460 2450 clnt_vc_call
 25     0xfffffe085cfa64c0       60 24B0 sosend
 24     0xfffffe085cfa6280      240 26F0 sosend_generic
 23     0xfffffe085cfa6110      170 2860 tcp_usr_send
 22     0xfffffe085cfa5ca0      470 2CD0 tcp_output
 21     0xfffffe085cfa5900      3A0 3070 ip_output
 20     0xfffffe085cfa5880       80 30F0 looutput
 19     0xfffffe085cfa5800       80 3170 if_simloop
 18     0xfffffe085cfa57d0       30 31A0 netisr_queue
 17     0xfffffe085cfa5780       50 31F0 netisr_queue_src
 16     0xfffffe085cfa56f0       90 3280 netisr_queue_internal
 15     0xfffffe085cfa56a0       50 32D0 swi_sched
 14     0xfffffe085cfa5620       80 3350 intr_event_schedule_thread
 13     0xfffffe085cfa55b0       70 33C0 sched_add
 12     0xfffffe085cfa5490      120 34E0 sched_pickcpu
 11     0xfffffe085cfa5420       70 3550 sched_lowest
 10     0xfffffe085cfa52a0      180 36D0 cpu_search_lowest
  9     0xfffffe085cfa52a0        0 36D0 cpu_search
  8     0xfffffe085cfa5120      180 3850 cpu_search_lowest
  7     0xfffffe085cfa5120        0 3850 cpu_search
  6     0xfffffe085cfa4fa0      180 39D0 cpu_search_lowest
  5     0xfffffe0839778f40      signal handler

-- 
G. Paul Ziemba
FreeBSD unix:
 4:36PM  up  5:28, 8 users, load averages: 6.53, 7.79, 7.94

freebsd stable - Jul 2017 - stable/11 r321349 crashing immediately

stable/11 r321349 crashing immediately

stable/11 r321349 crashing immediately