On 02/12/2013 12:22, Henri Hennebert wrote:> On 01/19/2013 06:58, Brandon Gooch wrote:
>> On Fri, Jan 18, 2013 at 2:56 PM, Xin Li <delphij at delphij.net>
wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA512
>>>
>>> On 01/18/13 12:50, Brandon Gooch wrote:
>>>> On Thu, Jan 10, 2013 at 4:25 PM, Xin Li <delphij at
delphij.net
>>>> <mailto:delphij at delphij.net>> wrote:
>>>>
>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
>>>>
>>>> To all: this became more and more hard to replicate lately.
I've
>>>> tried these options and the most important progress is that
it's
>>>> possible to get a crashdump when debug.debugger_on_panic=0 and
I
>>>> managed to get a backtrace which indicates the panic occur when
>>>> trying to do mtx_lock(&Giant) -> __mtx_lock_sleep ->
turnstile_wait
>>>> -> propagate_priority, but after I've added some
instruments to
>>>> the surrounding code and enabled INVARIANT and/or WITNESS, it
>>>> mysteriously went away.
>>>>
>>>> Reverting my instruments code and update to latest svn makes
the
>>>> issue disappear for one day. I've hit it again today but
>>>> unfortunately didn't get a successful dump and after reboot
I can't
>>>> reproduce it again :(
>>>>
>>>> Still trying...
>>>>
>>>>
>>>> Any updates Xin?
>>>
>>> No, it mysteriously disappeared for now. According to my
>>> understanding to recent svn commits, I didn't see anybody
committing
>>> something that fixes it but I can no longer panic my system, with
or
>>> without debugging code :(
>>>
>>>> I was actually hitting what I believe to be exactly the same
issue
>>>> as you on one of my systems, and, as you've seen, adding
any extra
>>>> debugging or diagnostics seemed to eliminate the issue.
>>>>
>>>> I was able to generate quite a few vmcores and still have these
>>>> sitting around in my filesystem (along with the kernels that
helped
>>>> produce them).
>>>>
>>>> I can recreate this crash on my system by compiling the NVIDIA
>>>> driver with clang at -01 and above. Although it's been
noted that
>>>> this issue has been seen in scenarios without an NIVIDIA driver
in
>>>> the mix, whatever is happening in the kernel to cause the panic
is
>>>> somehow triggered by this, at least on my system.
>>>
>>> I'm not sure if this is the same problem. Could you please try
using
>>> gcc to compile the nVIdia driver and see if that "fixes"
the problem?
>>>
>>> Cheers,
>>> - --
>>> Xin LI <delphij at delphij.net> https://www.delphij.net/
>>> FreeBSD - The Power to Serve! Live free or die
>>>
>>
>> Indeed, a gcc compiled NVIDIA module eliminates the issue, sorry if I
>> hadn't mentioned this earlier.
>>
>> What was happening to me at first was that my system would just hang
while
>> booting. I was able to figure out that it was during
/etc/rc.d/initrandom.
>> I actually got to a point where I removed the call to sysctl -a from
>> 'better_than_nothing()' in /etc/rc.d/initrandom to have a
booting system. I
>> finally had a situation where I could get a panic by adding SW_WATCHDOG
to
>> my kernel and running watchdogd(8).
>>
>> For me, this panic would come and go seemingly at random as well, and I
>> couldn't fumble my way around in the debugger to learn much of
anythingfreebsd-current at freebsd.org
>> when I first started seeing it. I just started a process of
modularizing
>> everything I could in my kernel config, then loading modules 1-by-1 and
>> booting over-and-over until I finally found what appeared to be the
>> problem, which was the NVIDIA module compiled with clang.
>>
>> Oh, another thing: at times it seemed as though it was the number of
>> modules loaded, as I could get the hang with 41 modules loaded, but not
40
>> or 42?! I admit, when I was seeing that behavior, I hadn't
eliminated the
>> NVIDIA driver from my loaded modules. I need to revisit the panic
situation
>> to confirm this particular strangeness.
>>
>> Here's the last panic I had:
>>
>> Unread portion of the kernel message buffer:
>> = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags = interrupt enabled, resume, IOPL = 0
>> current process = 1175 (sysctl)
>>
>> (kgdb) bt
>> #0 doadump (textdump=1694704112) at pcpu.h:229
>> #1 0xffffffff802fab82 in db_fncall (dummy1=<value optimized
out>,
>> dummy2=<value optimized out>, dummy3=<value optimized out>,
dummy4=<value
>> optimized out>) at /usr/src/sys/ddb/db_command.c:578
>> #2 0xffffffff802fa85a in db_command (last_cmdp=<value optimized
out>,
>> cmd_table=<value optimized out>, dopager=1) at
>> /usr/src/sys/ddb/db_command.c:449
>> #3 0xffffffff802fa612 in db_command_loop () at
>> /usr/src/sys/ddb/db_command.c:502
>> #4 0xffffffff802fcf60 in db_trap (type=<value optimized out>,
code=0) at
>> /usr/src/sys/ddb/db_main.c:231
>> #5 0xffffffff804a7b93 in kdb_trap (type=12, code=0, tf=<value
optimized
>> out>) at /usr/src/sys/kern/subr_kdb.c:654
>> #6 0xffffffff807157c5 in trap_fatal (frame=0xffffff8865032670,
eva=<value
>> optimized out>) at /usr/src/sys/amd64/amd64/trap.c:867
>> #7 0xffffffff80715adb in trap_pfault (frame=0x0, usermode=0) at
>> /usr/src/sys/amd64/amd64/trap.c:698
>> #8 0xffffffff8071529b in trap (frame=0xffffff8865032670) at
>> /usr/src/sys/amd64/amd64/trap.c:463
>> #9 0xffffffff806ff382 in calltrap () at exception.S:228
>> #10 0xffffffff8047bd50 in sysctl_sysctl_next_ls (lsp=<value
optimized out>,
>> name=0xffffff8865032a80, namelen=<value optimized out>,
>> next=0xffffff8865032898, len=0xffffff8865032904, level=3) at
>> /usr/src/sys/kern/kern_sysctl.c:759
>> #11 0xffffffff8047be5e in sysctl_sysctl_next_ls
(lsp=0xfffffe000d3f0080,
>> name=0xffffff8865032a7c, namelen=<value optimized out>,
>> next=0xffffff8865032894, len=0xffffff8865032904, level=2) at
>> /usr/src/sys/kern/kern_sysctl.c:786
>> #12 0xffffffff8047be5e in sysctl_sysctl_next_ls
(lsp=0xfffffe000d3f0080,
>> name=0xffffff8865032a78, namelen=<value optimized out>,
>> next=0xffffff8865032890, len=0xffffff8865032904, level=1) at
>> /usr/src/sys/kern/kern_sysctl.c:786
>> #13 0xffffffff8047bca3 in sysctl_sysctl_next (oidp=<value optimized
out>,
>> arg1=0xffffff8865032a78, arg2=4, req=0xffffff88650329a8) at
>> /usr/src/sys/kern/kern_sysctl.c:808
>> #14 0xffffffff8047b03f in sysctl_root (arg1=<value optimized
out>,
>> arg2=<value optimized out>) at
/usr/src/sys/kern/kern_sysctl.c:1513
>> #15 0xffffffff8047b5d8 in userland_sysctl (td=<value optimized
out>,
>> name=0xffffff8865032a70, namelen=<value optimized out>,
old=<value
>> optimized out>, oldlenp=<value optimized out>,
inkernel=<value optimized
>> out>, new=<value optimized out>, newlen=<value optimized
out>,
>> retval=<value optimized out>, flags=1694706064) at
>> /usr/src/sys/kern/kern_sysctl.c:1623
>> #16 0xffffffff8047b3c4 in sys___sysctl (td=0xfffffe001e2d4900,
>> uap=0xffffff8865032b80) at /usr/src/sys/kern/kern_sysctl.c:1549
>> #17 0xffffffff807160f7 in amd64_syscall (td=0xfffffe001e2d4900,
traced=0)
>> at subr_syscall.c:135
>> #18 0xffffffff806ff66b in Xfast_syscall () at exception.S:387
>> #19 0x000000080093697a in ?? ()
>> Previous frame inner to this frame (corrupt stack?)
>> Current language: auto; currently minimal
>>
>> Any ideas on where to look through this vmcore?
>>
>> -Brandon
>
> FWIW
>
> Just going from 9.1-STABLE r245423M to 9.1-STABLE #0 r246457M trigger
> this problem.
>
> I drop sysctl -a from /etc/rc.d/initrandom and all is back to normal.
>
> I have nvidia-driver-304.64 compiled with gcc as for all my ports.
>
> Henri
Just a follow up:
sysctl hw.nvidia generate a page fault:
morzine.restart.bel dumped core - see /var/crash/vmcore.86
Wed Feb 13 17:29:14 CET 2013
FreeBSD morzine.restart.bel 9.1-STABLE FreeBSD 9.1-STABLE #0 r246457M:
Thu Feb 7 15:09:16 CET 2013
root at morzine.restart.bel:/usr/obj/usr/src/sys/MORZINE i386
panic: page fault
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-marcel-freebsd"...
Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address = 0x14
fault code = supervisor read, page not present
instruction pointer = 0x20:0xa07647d4
stack pointer = 0x28:0xfd1f0ac8
frame pointer = 0x28:0xfd1f0aec
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = resume, IOPL = 0
current process = 2369 (sysctl)
trap number = 12
panic: page fault
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper(a0a44a64,0,70617200,46,fc,...) at
db_trace_self_wrapper+0x2d/frame 0xfd1f07c0
kdb_backtrace(a0a760c5,1,a0a0bc3c,fd1f087c,1,...) at
kdb_backtrace+0x30/frame 0xfd1f0828
panic(a0a0bc3c,a0a76eaf,cfcb0d74,1,1,...) at panic+0x1bb/frame 0xfd1f0870
trap_fatal(fd1f0900,a09d2ee1,a0b17130,b5592000,0,...) at
trap_fatal+0x33a/frame 0xfd1f08c0
trap_pfault(14,c,1,ffffffff,fd1f0994,...) at trap_pfault+0x31d/frame
0xfd1f0940
trap(fd1f0a88) at trap+0x4ef/frame 0xfd1f0a7c
calltrap() at calltrap+0x6/frame 0xfd1f0a7c
--- trap 0xc, eip = 0xa07647d4, esp = 0xfd1f0ac8, ebp = 0xfd1f0aec ---
turnstile_broadcast(0,0,a7b80b40,0,fd1f0b38,...) at
turnstile_broadcast+0xa4/frame 0xfd1f0aec
_mtx_unlock_sleep(a0b6a00c,0,0,0,fd1f0b58,...) at
_mtx_unlock_sleep+0x57/frame 0xfd1f0b04
sysctl_root(fd1f0b58,fd1f0b64,4,a09c87fe,bfc0d450,...) at
sysctl_root+0x248/frame 0xfd1f0b38
userland_sysctl(cfcb0bc0,fd1f0bd4,5,0,9fbfca2c,...) at
userland_sysctl+0x1da/frame 0xfd1f0b9c
sys___sysctl(cfcb0bc0,fd1f0cc8,1,fd1f0cb0,0,...) at
sys___sysctl+0x95/frame 0xfd1f0c40
syscall(fd1f0d08) at syscall+0x452/frame 0xfd1f0cfc
Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xfd1f0cfc
--- syscall (202, FreeBSD ELF32, sys___sysctl), eip = 0x33d65f6b, esp
0x9fbfc9e4, ebp = 0x9fbfd2ac ---
Uptime: 5h45m16s
Physical memory: 3046 MB
<CLIP>
(kgdb) #0 doadump (textdump=1) at pcpu.h:249
#1 0xa071b78a in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:448
#2 0xa071bc17 in panic (fmt=<value optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:636
#3 0xa09cc21a in trap_fatal (frame=<value optimized out>, eva=20)
at /usr/src/sys/i386/i386/trap.c:1043
#4 0xa09cc54d in trap_pfault (frame=0x0, usermode=<value optimized out>,
eva=0) at /usr/src/sys/i386/i386/trap.c:858
#5 0xa09cbb3f in trap (frame=0xfd1f0a88) at
/usr/src/sys/i386/i386/trap.c:555
#6 0xa09b5c0c in calltrap () at exception.s:169
#7 0xa07647d4 in turnstile_broadcast (ts=0x0, queue=0)
at /usr/src/sys/kern/subr_turnstile.c:837
#8 0xa0707217 in _mtx_unlock_sleep (m=0xa0b6a00c, opts=-48297228,
file=0xfd1f0af4 "", line=-48297228) at
/usr/src/sys/kern/kern_mutex.c:715
#9 0xa0728418 in sysctl_root (arg1=<value optimized out>,
arg2=<value optimized out>) at /usr/src/sys/kern/kern_sysctl.c:1515
#10 0xa072899a in userland_sysctl (td=0x4, old=<value optimized out>,
oldlenp=<value optimized out>, inkernel=<value optimized out>,
new=<value optimized out>, newlen=<value optimized out>,
retval=<value optimized out>, flags=-1603107360)
at /usr/src/sys/kern/kern_sysctl.c:1623
#11 0xa0728785 in sys___sysctl (uap=0xfd1f0cc8)
at /usr/src/sys/kern/kern_sysctl.c:1549
#12 0xa09ccc22 in syscall (frame=<value optimized out>) at
subr_syscall.c:135
#13 0xa09b5ca1 in Xint0x80_syscall () at exception.s:267
#14 0x00000033 in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language: auto; currently minimal
Henri