We're currently experiencing a large number of kernel panics on FreeBSD 8.2-RELEASE across a large number of machines here. The base stack reported is a double fault with no additional details and CTRL+ALT+ESC fails to break to the debugger as does and NMI, even though it at least tries printing the following many times some quite jumbled:- NMI ... going to debugger We've configured the dump device but that also seems to fail to capture any details just sitting there after panic with Dumping 4465MB: The machines are single disk ZFS root install and the dump device is configured using the gptid, could this be what's preventing the dump happening? The kernel is compiled with:- options KDB # Kernel debugger related code options KDB_TRACE # Print a stack trace for a panic We have remove KVM but not remote serial on the most of the machines. Any advice on how to debug this issue? Regards Steve ===============================================This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.
On Wed, Aug 10, 2011 at 03:22:52PM +0100, Steven Hartland wrote:> The base stack reported is a double fault with no additional > details and CTRL+ALT+ESC fails to break to the debugger as > does and NMI, even though it at least tries printing the > following many times some quite jumbled:- > NMI ... going to debuggerYou may be interested in these system tunables (not sysctls). These come from sys/amd64/amd64/trap.c (i386 has the same): machdep.kdb_on_nmi (defaults to 1) machdep.panic_on_nmi (defaults to 1) If what you're seeing is a hardware NMI that fires, followed by the machine panicing, the above tunables are probably doing that. A hardware NMI could indicate an actual hardware issue of sorts, depending on how the motherboard vendor implemented what they did. For example, on a series of mainboards we have at my workplace, the BIOS can be configured to generate either an NMI or SMI# when different kinds of ECC RAM errors happen (either single-bit or multi-bit parity errors). I don't know if that's what you're seeing. If you're generating the NMI yourself (possibly via the KVM, etc.) then okay, that's different. I'm trying to discern whether or not *you're* generating the NMI, or if the NMI just happens and causes a panic for you and that's what you're worried about. Now to discuss the "jumbled console output": The interspersing of kernel text output has plagued FreeBSD for a very long time (since approximately 6.x). There have been statements from kernel coders that you can decrease the likelihood of it happening by increasing the PRINTF_BUFR_SIZE (not a typo) option in your kernel configuration. The issue is exacerbated by use of SMP (either multi-core or multi-CPU). The default (assuming your kernel configs are based off of GENERIC within the past 4-5 years) is 128. However, the same developers stated that they have great reservations over increasing this number dramatically (meaning, something like 256 will probably work, but larger "may have repercussions which are unknown at this time"). I have stated publicly then, and will do so again now, that this option does not solve the problem. I acknowledge it may make it "less likely to happen" or may decrease the amount of interspersed output, but in my experience neither of those prove true; and more importantly, said option does not solve the problem. I've talked on-list with John Baldwin about this problem in the past, who had some pretty good ideas of how to solve it. I should point out that Solaris 10 and OpenSolaris (not sure about present-day releases) both have this problem as well, especially during kernel panics or MCEs. Linux addressed this issue by implementing a ring-based cyclic buffer for its kernel messages (syslog/klogd), and the model is extremely well-documented (quite clever too): http://www.mjmwired.net/kernel/Documentation/trace/ring-buffer-design.txt I'm still surprised not a single GSoC project has attempted to solve this for FreeBSD. It really is a serious matter, as it makes getting kernel backtraces and crash data a serious pain in the butt. It can also impact real-time debugging. These are the *worst* times to have to tolerate something like this. I can point you to old threads about this, and my old FreeBSD wiki page ("Commonly reported issues") touches on this as well. The point I want to get across is that PRINTF_BUFR_SIZE does not solve the problem.> We've configured the dump device but that also seems to fail > to capture any details just sitting there after panic with > Dumping 4465MB: > > The machines are single disk ZFS root install and the dump > device is configured using the gptid, could this be what's > preventing the dump happening?I can tell you that others have reported this problem where the kernel panic/dump begins but either locks up after showing the first progress metre/amount, or during the dumping itself. I give everyone the same advice: please make sure that you have a swap partition that's large enough to fit your entire memory contents (preferably a swap that's 2x or 1.5x the amount of physical RAM), and please make sure it's on a dedicated slice (e.g. ada0s1b). I do not advise any sort of "abstraction" layer between swap and the rest of the system. It might seem like a great/fun/awesome idea followed by "whatever jdc, it works!" but when a crash happens -- which is when you need it most -- and it doesn't work, I won't sympathise. :-) As for the GPT aspects of things: I'm still not familiar with GPT (as a technology I am, but when it comes to actual usability I am not).> The kernel is compiled with:- > options KDB # Kernel debugger related code > options KDB_TRACE # Print a stack trace for a panic > > We have remove KVM but not remote serial on the most of the > machines.As long as remote KVM provides actual VGA-level redirection, then that's sufficient (though makes copy-pasting output basically impossible). We use serial console and tend to use these options; the DDB and GDB options may be helpful for you, but not if the system is behaving the way you describe. # Debugging options options BREAK_TO_DEBUGGER # Sending a serial BREAK drops to DDB options ALT_BREAK_TO_DEBUGGER # Permit <CR>~<Ctrl-b> to drop to DDB options KDB # Enable kernel debugger support options KDB_TRACE # Print stack trace automatically on panic options DDB # Support DDB options GDB # Support remote GDB In combination with this, we use the following in /etc/rc.conf (the dumpdev line is important, else savecore won't pick up anything): dumpdev="auto" ddb_enable="yes" But we do not use any ddb scripts. I keep it in there Just In Case. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
on 10/08/2011 17:22 Steven Hartland said the following:> The kernel is compiled with:- > options KDB # Kernel debugger related code > options KDB_TRACE # Print a stack trace for a panicYou also have to provide an actual debugger backend like built-in DDB or a stub for remote GDB to get online debugging. No guarantees that that would help you to get the debugging information, but without that the chances are even slimmer. You may also try this patch and see if it provides any improvements for post-panic environment (dumping etc): http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff It might also be a good idea to at least capture a screenshot of whatever information you get on console when the panic happens. -- Andriy Gapon
----- Original Message ----- From: "Steven Hartland" <killing@multiplay.co.uk> To: <freebsd-stable@freebsd.org> Sent: Wednesday, August 10, 2011 3:22 PM Subject: debugging frequent kernel panics on 8.2-RELEASE> We're currently experiencing a large number of kernel panics > on FreeBSD 8.2-RELEASE across a large number of machines here. > > The base stack reported is a double fault with no additional > details and CTRL+ALT+ESC fails to break to the debugger as > does and NMI, even though it at least tries printing the > following many times some quite jumbled:- > NMI ... going to debugger > > We've configured the dump device but that also seems to fail > to capture any details just sitting there after panic with > Dumping 4465MB: > > The machines are single disk ZFS root install and the dump > device is configured using the gptid, could this be what's > preventing the dump happening? > > The kernel is compiled with:- > options KDB # Kernel debugger related code > options KDB_TRACE # Print a stack trace for a panic > > We have remove KVM but not remote serial on the most of the > machines. > > Any advice on how to debug this issue?ldn32.multiplay.co.uk dumped core - see /var/crash/vmcore.0 Wed Aug 10 14:02:07 UTC 2011 FreeBSD crash 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Jul 21 11:05:52 BST 2011 root@crash:/usr/obj/usr/src/sys/MULTIPLAY amd64 panic: double fault GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: Fatal double fault rip = 0xffffffff8052f6f1 rsp = 0xffffff86ce600fb0 rbp = 0xffffff86ce601210 cpuid = 0; apic id = 00 panic: double fault cpuid = 0 KDB: stack backtrace: #0 0xffffffff803af91e at kdb_backtrace+0x5e #1 0xffffffff8037d817 at panic+0x187 #2 0xffffffff80574316 at dblfault_handler+0x96 #3 0xffffffff8055d06d at Xdblfault+0xad Uptime: 13d20h53m31s Physical memory: 24555 MB Dumping 3283 MB: 3268 3252 3236 3220 3204 3188 3172 3156 3140 3124 3108 3092 3076 3060 3044 3028 3012 2996 2980 2964 2948 2932 2916 2900 2884 2868 2852 2836 2820 2804 2788 2772 2756 2740 272 4 2708 2692 2676 2660 2644 2628 2612 2596 2580 2564 2548 2532 2516 2500 2484 2468 2452 2436 2420 2404 2388 2372 2356 2340 2324 2308 2292 2276 2260 2244 2228 2212 2196 2180 2164 2148 2132 211 6 2100 2084 2068 2052 2036 2020 2004 1988 1972 1956 1940 1924 1908 1892 1876 1860 1844 1828 1812 1796 1780 1764 1748 1732 1716 1700 1684 1668 1652 1636 1620 1604 1588 1572 1556 1540 1524 150 8 1492 1476 1460 1444 1428 1412 1396 1380 1364 1348 1332 1316 1300 1284 1268 1252 1236 1220 1204 1188 1172 1156 1140 1124 1108 1092 1076 1060 1044 1028 1012 996 980 964 948 932 916 900 884 8 68 852 836 820 804 788 772 756 740 724 708 692 676 660 644 628 612 596 580 564 548 532 516 500 484 468 452 436 420 404 388 372 356 340 324 308 292 276 260 244 228 212 196 180 164 148 132 116 100 84 68 52 36 20 4 Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /boot/kernel/opensolaris.ko.symbols...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols from /boot/kernel/linprocfs.ko.symbols...done. done. Loaded symbols for /boot/kernel/linprocfs.ko Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /boot/kernel/nullfs.ko.symbols...done. done. Loaded symbols for /boot/kernel/nullfs.ko One of the machines has managed to dump where all the others have failed to do so here's the stack from core.txt.0 #0 sched_switch (td=0xffffffff80830bc0, newtd=0xffffff000a73f8c0, flags=Variable "flags" is not available.) at /usr/src/sys/kern/sched_ule.c:1858 1858 cpuid = PCPU_GET(cpuid); (kgdb) #0 sched_switch (td=0xffffffff80830bc0, newtd=0xffffff000a73f8c0, flags=Variable "flags" is not available.) at /usr/src/sys/kern/sched_ule.c:1858 #1 0xffffffff80385c86 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:449 #2 0xffffffff803b92d2 in sleepq_timedwait (wchan=0xffffffff80830760, pri=68) at /usr/src/sys/kern/subr_sleepqueue.c:644 #3 0xffffffff803861e1 in _sleep (ident=0xffffffff80830760, lock=0x0, priority=Variable "priority" is not available. ) at /usr/src/sys/kern/kern_synch.c:230 #4 0xffffffff80532c29 in scheduler (dummy=Variable "dummy" is not available. ) at /usr/src/sys/vm/vm_glue.c:807 #5 0xffffffff80335d67 in mi_startup () at /usr/src/sys/kern/init_main.c:254 #6 0xffffffff8016efac in btext () at /usr/src/sys/amd64/amd64/locore.S:81 #7 0xffffffff808556e0 in sleepq_chains () #8 0xffffffff8083b1e0 in cpu_top () #9 0x0000000000000000 in ?? () #10 0xffffffff80830bc0 in proc0 () #11 0xffffffff80ba4b90 in ?? () #12 0xffffffff80ba4b38 in ?? () #13 0xffffff000a73f8c0 in ?? () #14 0xffffffff803a2cc9 in sched_switch (td=0x0, newtd=0x0, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1852 Previous frame inner to this frame (corrupt stack?) (kgdb) Not sure this really points to the cause, but we have the crash dump so can do more digging if someone would point me in the correct direction. Regards Steve ===============================================This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.
on 18/08/2011 02:15 Steven Hartland said the following:> ----- Original Message ----- From: "Andriy Gapon" <avg@FreeBSD.org> > >> Thanks to the debug that Steven provided and to the help that I received from >> Kostik, I think that now I understand the basic mechanics of this panic, but, >> unfortunately, not the details of its root cause. >> >> It seems like everything starts with some kind of a race between terminating >> processes in a jail and termination of the jail itself. This is where the >> details are very thin so far. What we see is that a process (http) is in >> exit(2) syscall, in exit1() function actually, and past the place where P_WEXIT >> flag is set and even past the place where p_limit is freed and reset to NULL. >> At that place the thread calls prison_proc_free(), which calls prison_deref(). >> Then, we see that in prison_deref() the thread gets a page fault because of what >> seems like a NULL pointer dereference. That's just the start of the problem and >> its root cause. > > Thats interesting, are you using http as an example or is that something thats > been gleaned from the debugging of our output? I ask as there's only one process > running in each of our jails and thats a single java process.It's from the debug data: p_comm = "httpd" I also would like to ask you to revert the last patch that I sent you (with tf_rip comparisons) and try the patch from Kostik instead. Given what we suspect about the problem, can please also try to provoke the problem by e.g. doing frequent jail restarts or something else that supposedly should hit the bug. -- Andriy Gapon
on 18/08/2011 02:15 Steven Hartland said the following:> In a nutshell the jail manager we're using will attempt to resurrect the jail > from a dieing state in a few specific scenarios. > > Here's an exmaple:- > 1. jail restart requested > 2. jail is stopped, so the java processes is killed off, but active tcp sessions > may prevent the timely full shutdown of the jail. > 3. if an existing jail is detected, i.e. a dieing jail from #2, instead of > starting a new jail we attach to the old one and exec the new java process. > 4. if an existing jail isnt detected, i.e. where there where not hanging tcp > sessions and #2 cleanly shutdown the jail, a new jail is created, attached to > and the java exec'ed. > > The system uses static jailid's so its possible to determine if an existing > jail for this "service" exists or not. This prevents duplicate services as > well as making services easy to identify by their jailid. > > So what we could be seeing is a race between the jail shutdown and the attach > of the new process?Not a jail expert at all, but a few suggestions... First, wouldn't the 'persist' jail option simplify your life a little bit? Second, you may want to try to monitor value of prison0.pr_uref variable (e.g. via kgdb) while executing various scenarios of what you do now. If after finishing a certain scenario you end up with a value lower than at the start of scenario, then this is the troublesome one. Please note that prison0.pr_uref is composed from a number of non-jailed processes plus a number of top-level jails. So take this into account when comparing prison0.pr_uref values - it's better to record the initial value when no jails are started and it's important to keep the number of non-jailed processes the same (or to account for its changes). -- Andriy Gapon