thr3ads.net - freebsd stable - stable/11 debugging kernel unable to produce crashdump again [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Mark Johnston

2017-Jul-24 01:44 UTC

stable/11 debugging kernel unable to produce crashdump again

On Sun, Jul 23, 2017 at 04:26:45PM +0700, Eugene Grosbein
wrote:> On 14.01.2017 18:40, Eugene Grosbein wrote:
> > 
> >> I suspect that this is because we only stop the scheduler upon a
panic
> >> if SMP is configured. Can you retest with the patch below applied?
> >>
> >> Index: sys/kern/kern_shutdown.c
> >>
==================================================================> >>
--- sys/kern/kern_shutdown.c	(revision 312082)
> >> +++ sys/kern/kern_shutdown.c	(working copy)
> >> @@ -713,6 +713,7 @@
> >>  		CPU_CLR(PCPU_GET(cpuid), &other_cpus);
> >>  		stop_cpus_hard(other_cpus);
> >>  	}
> >> +#endif
> >>  
> >>  	/*
> >>  	 * Ensure that the scheduler is stopped while panicking, even if
panic
> >> @@ -719,7 +720,6 @@
> >>  	 * has been entered from kdb.
> >>  	 */
> >>  	td->td_stopsched = 1;
> >> -#endif
> >>  
> >>  	bootopt = RB_AUTOBOOT;
> >>  	newpanic = 0;
> >>
> >>
> > 
> > Indeed, my router is uniprocessor system and your patch really solves
the problem.
> > Now kernel generates crashdump just fine in case of panic. Please
commit the fix, thanks!
> 
> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump:
Is this amd64 GENERIC, or something else?
> 
> - "call doadump" from DDB prompt works just fine;
> - "shutdown -r now" reboots the system without problems;
> - "sysctl debug.kdb.panic=1" triggers a panic just fine but
system hangs just afer showing uptime
> instead of continuing with crashdump generation; same if "real"
panic occurs.
> 
> Same for debug.minidump set to 1 or 0. How do I debug this?
I'm not able to reproduce the problem in bhyve using r321401. Looking
at the code, the culprits might be cngrab(), or one of the
shutdown_post_sync eventhandlers. Since you're apparently able to see
the console output at the time of the panic, I guess it's probably the
latter. Could you try your test with the patch below applied? It'll
print a bunch of "entering post_sync"/"leaving post_sync"
messages with
addresses that can be resolved using kgdb. That'll help determine where
we're getting stuck.

Index: sys/sys/eventhandler.h
==================================================================---
sys/sys/eventhandler.h	(revision 321401)
+++ sys/sys/eventhandler.h	(working copy)
@@ -85,7 +85,11 @@
 			_t = (struct eventhandler_entry_ ## name *)_ep;	\
 			CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \
  			    (void *)_t->eh_func);			\
+			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
+				printf("entering post_sync %p\n", (void *)_t->eh_func); \
 			_t->eh_func(_ep->ee_arg , ## __VA_ARGS__);	\
+			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
+				printf("leaving post_sync %p\n", (void *)_t->eh_func); \
 			EHL_LOCK((list));				\
 		}							\
 	}								\

Eugene Grosbein

2017-Jul-24 07:11 UTC

head link

stable/11 debugging kernel unable to produce crashdump again

On 24.07.2017 08:44, Mark Johnston wrote:
>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing
crashdump:
> 
> Is this amd64 GENERIC, or something else?
Custom kernel, amd64.
> 
>>
>> - "call doadump" from DDB prompt works just fine;
>> - "shutdown -r now" reboots the system without problems;
>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but
system hangs just afer showing uptime
>> instead of continuing with crashdump generation; same if
"real" panic occurs.
>>
>> Same for debug.minidump set to 1 or 0. How do I debug this?
> 
> I'm not able to reproduce the problem in bhyve using r321401. Looking
> at the code, the culprits might be cngrab(), or one of the
> shutdown_post_sync eventhandlers. Since you're apparently able to see
> the console output at the time of the panic, I guess it's probably the
> latter. Could you try your test with the patch below applied? It'll
> print a bunch of "entering post_sync"/"leaving
post_sync" messages with
> addresses that can be resolved using kgdb. That'll help determine where
> we're getting stuck.
> 
> Index: sys/sys/eventhandler.h
> ==================================================================> ---
sys/sys/eventhandler.h	(revision 321401)
> +++ sys/sys/eventhandler.h	(working copy)
> @@ -85,7 +85,11 @@
>  			_t = (struct eventhandler_entry_ ## name *)_ep;	\
>  			CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \
>   			    (void *)_t->eh_func);			\
> +			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> +				printf("entering post_sync %p\n", (void *)_t->eh_func); \
>  			_t->eh_func(_ep->ee_arg , ## __VA_ARGS__);	\
> +			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> +				printf("leaving post_sync %p\n", (void *)_t->eh_func); \
>  			EHL_LOCK((list));				\
>  		}							\
>  	}								\
> 
I'l try this evening.

Eugene Grosbein

2017-Jul-24 17:03 UTC

head link

stable/11 debugging kernel unable to produce crashdump again

CCing mav@ as graid expert.

On 24.07.2017 08:44, Mark Johnston wrote:
>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing
crashdump:
>>
>> - "call doadump" from DDB prompt works just fine;
>> - "shutdown -r now" reboots the system without problems;
>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but
system hangs just afer showing uptime
>> instead of continuing with crashdump generation; same if
"real" panic occurs.
>>
>> Same for debug.minidump set to 1 or 0. How do I debug this?
> 
> I'm not able to reproduce the problem in bhyve using r321401. Looking
> at the code, the culprits might be cngrab(), or one of the
> shutdown_post_sync eventhandlers. Since you're apparently able to see
> the console output at the time of the panic, I guess it's probably the
> latter. Could you try your test with the patch below applied? It'll
> print a bunch of "entering post_sync"/"leaving
post_sync" messages with
> addresses that can be resolved using kgdb. That'll help determine where
> we're getting stuck.
> 
> Index: sys/sys/eventhandler.h
> ==================================================================> ---
sys/sys/eventhandler.h	(revision 321401)
> +++ sys/sys/eventhandler.h	(working copy)
> @@ -85,7 +85,11 @@
>  			_t = (struct eventhandler_entry_ ## name *)_ep;	\
>  			CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \
>   			    (void *)_t->eh_func);			\
> +			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> +				printf("entering post_sync %p\n", (void *)_t->eh_func); \
>  			_t->eh_func(_ep->ee_arg , ## __VA_ARGS__);	\
> +			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> +				printf("leaving post_sync %p\n", (void *)_t->eh_func); \
>  			EHL_LOCK((list));				\
>  		}							\
>  	}								\
> 
Thanks, this helped:

$ addr2line -f -e kernel.debug 0xffffffff80919c00
g_raid_shutdown_post_sync
/home/src/sys/geom/raid/g_raid.c:2458

That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called just
before
crashdump generation but works just fine during normal system shutdown.

I should note my graid's RAID1 is running in degraded state currently
due to dead SSD module that does not respond. Here is part of boot log:

ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich5: Poll timeout on slot 2 port 0
ahcich5: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 80 serr 00000000
cmd 0000c217
(aprobe2:ahcich5:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00
00
(aprobe2:ahcich5:0:0:0): CAM status: Command timeout
(aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
ahcich5: Poll timeout on slot 3 port 0
ahcich5: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd 80 serr 00000000
cmd 0000c317
(aprobe2:ahcich5:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe2:ahcich5:0:0:0): CAM status: Command timeout
(aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
[skip]
Trying to mount root from ufs:/dev/raid/r0s4a [rw,noatime]...
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
GEOM_RAID: Intel-c291fe96: Force array start due to timeout.
GEOM_RAID: Intel-c291fe96: Disk ada0 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from NONE to STALE.
GEOM_RAID: Intel-c291fe96: Array started.
GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from STALE to ACTIVE.
GEOM_RAID: Intel-c291fe96: Volume r0 state changed from STARTING to DEGRADED.
GEOM_RAID: Intel-c291fe96: Provider raid/r0 for volume r0 created.

freebsd stable - Jul 2017 - stable/11 debugging kernel unable to produce crashdump again

stable/11 debugging kernel unable to produce crashdump again

stable/11 debugging kernel unable to produce crashdump again

stable/11 debugging kernel unable to produce crashdump again