Mark Johnston
2017-Jul-24 01:44 UTC
stable/11 debugging kernel unable to produce crashdump again
On Sun, Jul 23, 2017 at 04:26:45PM +0700, Eugene Grosbein wrote:> On 14.01.2017 18:40, Eugene Grosbein wrote: > > > >> I suspect that this is because we only stop the scheduler upon a panic > >> if SMP is configured. Can you retest with the patch below applied? > >> > >> Index: sys/kern/kern_shutdown.c > >> ==================================================================> >> --- sys/kern/kern_shutdown.c (revision 312082) > >> +++ sys/kern/kern_shutdown.c (working copy) > >> @@ -713,6 +713,7 @@ > >> CPU_CLR(PCPU_GET(cpuid), &other_cpus); > >> stop_cpus_hard(other_cpus); > >> } > >> +#endif > >> > >> /* > >> * Ensure that the scheduler is stopped while panicking, even if panic > >> @@ -719,7 +720,6 @@ > >> * has been entered from kdb. > >> */ > >> td->td_stopsched = 1; > >> -#endif > >> > >> bootopt = RB_AUTOBOOT; > >> newpanic = 0; > >> > >> > > > > Indeed, my router is uniprocessor system and your patch really solves the problem. > > Now kernel generates crashdump just fine in case of panic. Please commit the fix, thanks! > > Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump:Is this amd64 GENERIC, or something else?> > - "call doadump" from DDB prompt works just fine; > - "shutdown -r now" reboots the system without problems; > - "sysctl debug.kdb.panic=1" triggers a panic just fine but system hangs just afer showing uptime > instead of continuing with crashdump generation; same if "real" panic occurs. > > Same for debug.minidump set to 1 or 0. How do I debug this?I'm not able to reproduce the problem in bhyve using r321401. Looking at the code, the culprits might be cngrab(), or one of the shutdown_post_sync eventhandlers. Since you're apparently able to see the console output at the time of the panic, I guess it's probably the latter. Could you try your test with the patch below applied? It'll print a bunch of "entering post_sync"/"leaving post_sync" messages with addresses that can be resolved using kgdb. That'll help determine where we're getting stuck. Index: sys/sys/eventhandler.h ==================================================================--- sys/sys/eventhandler.h (revision 321401) +++ sys/sys/eventhandler.h (working copy) @@ -85,7 +85,11 @@ _t = (struct eventhandler_entry_ ## name *)_ep; \ CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \ (void *)_t->eh_func); \ + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \ + printf("entering post_sync %p\n", (void *)_t->eh_func); \ _t->eh_func(_ep->ee_arg , ## __VA_ARGS__); \ + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \ + printf("leaving post_sync %p\n", (void *)_t->eh_func); \ EHL_LOCK((list)); \ } \ } \
Eugene Grosbein
2017-Jul-24 07:11 UTC
stable/11 debugging kernel unable to produce crashdump again
On 24.07.2017 08:44, Mark Johnston wrote:>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump: > > Is this amd64 GENERIC, or something else?Custom kernel, amd64.> >> >> - "call doadump" from DDB prompt works just fine; >> - "shutdown -r now" reboots the system without problems; >> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system hangs just afer showing uptime >> instead of continuing with crashdump generation; same if "real" panic occurs. >> >> Same for debug.minidump set to 1 or 0. How do I debug this? > > I'm not able to reproduce the problem in bhyve using r321401. Looking > at the code, the culprits might be cngrab(), or one of the > shutdown_post_sync eventhandlers. Since you're apparently able to see > the console output at the time of the panic, I guess it's probably the > latter. Could you try your test with the patch below applied? It'll > print a bunch of "entering post_sync"/"leaving post_sync" messages with > addresses that can be resolved using kgdb. That'll help determine where > we're getting stuck. > > Index: sys/sys/eventhandler.h > ==================================================================> --- sys/sys/eventhandler.h (revision 321401) > +++ sys/sys/eventhandler.h (working copy) > @@ -85,7 +85,11 @@ > _t = (struct eventhandler_entry_ ## name *)_ep; \ > CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \ > (void *)_t->eh_func); \ > + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \ > + printf("entering post_sync %p\n", (void *)_t->eh_func); \ > _t->eh_func(_ep->ee_arg , ## __VA_ARGS__); \ > + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \ > + printf("leaving post_sync %p\n", (void *)_t->eh_func); \ > EHL_LOCK((list)); \ > } \ > } \ >I'l try this evening.
Eugene Grosbein
2017-Jul-24 17:03 UTC
stable/11 debugging kernel unable to produce crashdump again
CCing mav@ as graid expert. On 24.07.2017 08:44, Mark Johnston wrote:>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump: >> >> - "call doadump" from DDB prompt works just fine; >> - "shutdown -r now" reboots the system without problems; >> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system hangs just afer showing uptime >> instead of continuing with crashdump generation; same if "real" panic occurs. >> >> Same for debug.minidump set to 1 or 0. How do I debug this? > > I'm not able to reproduce the problem in bhyve using r321401. Looking > at the code, the culprits might be cngrab(), or one of the > shutdown_post_sync eventhandlers. Since you're apparently able to see > the console output at the time of the panic, I guess it's probably the > latter. Could you try your test with the patch below applied? It'll > print a bunch of "entering post_sync"/"leaving post_sync" messages with > addresses that can be resolved using kgdb. That'll help determine where > we're getting stuck. > > Index: sys/sys/eventhandler.h > ==================================================================> --- sys/sys/eventhandler.h (revision 321401) > +++ sys/sys/eventhandler.h (working copy) > @@ -85,7 +85,11 @@ > _t = (struct eventhandler_entry_ ## name *)_ep; \ > CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \ > (void *)_t->eh_func); \ > + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \ > + printf("entering post_sync %p\n", (void *)_t->eh_func); \ > _t->eh_func(_ep->ee_arg , ## __VA_ARGS__); \ > + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \ > + printf("leaving post_sync %p\n", (void *)_t->eh_func); \ > EHL_LOCK((list)); \ > } \ > } \ >Thanks, this helped: $ addr2line -f -e kernel.debug 0xffffffff80919c00 g_raid_shutdown_post_sync /home/src/sys/geom/raid/g_raid.c:2458 That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called just before crashdump generation but works just fine during normal system shutdown. I should note my graid's RAID1 is running in degraded state currently due to dead SSD module that does not respond. Here is part of boot log: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) ahcich5: Poll timeout on slot 2 port 0 ahcich5: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 80 serr 00000000 cmd 0000c217 (aprobe2:ahcich5:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe2:ahcich5:0:0:0): CAM status: Command timeout (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config ahcich5: Poll timeout on slot 3 port 0 ahcich5: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd 80 serr 00000000 cmd 0000c317 (aprobe2:ahcich5:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe2:ahcich5:0:0:0): CAM status: Command timeout (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted [skip] Trying to mount root from ufs:/dev/raid/r0s4a [rw,noatime]... Root mount waiting for: GRAID-Intel Root mount waiting for: GRAID-Intel Root mount waiting for: GRAID-Intel Root mount waiting for: GRAID-Intel Root mount waiting for: GRAID-Intel GEOM_RAID: Intel-c291fe96: Force array start due to timeout. GEOM_RAID: Intel-c291fe96: Disk ada0 state changed from NONE to ACTIVE. GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from NONE to STALE. GEOM_RAID: Intel-c291fe96: Array started. GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from STALE to ACTIVE. GEOM_RAID: Intel-c291fe96: Volume r0 state changed from STARTING to DEGRADED. GEOM_RAID: Intel-c291fe96: Provider raid/r0 for volume r0 created.