Mark Martinec
2017-Jul-20  13:45 UTC
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
2017-07-20 02:03, Mark Johnston wrote:> One thing to try at this point would be to disable EARLY_AP_STARTUP in > the kernel config. That is, take a configuration with which you're able > to reproduce the hang during boot, and remove "options > EARLY_AP_STARTUP".Done. And it avoids the problem altogether! Thanks. Tried a reboot several times and it succeeds every time. Here is all that I had in a config file for building a kernel, i.e. I took away the 'options DDB' which also seemingly avoided the problem: include GENERIC ident NELI nooptions EARLY_AP_STARTUP> This feature has a fairly large impact on the bootup process and has > had a few problems that manifested as hangs during boot. There was at > least one other case where an innocuous change to the kernel > configuration "fixed" the problem by introducing some second-order > effect (causing kernel threads to be scheduled in a different > order, for instance).> Regardless of whether the suggestion above makes a difference, it would > be helpful to see verbose dmesgs from both a clean boot and a boot that > hangs. If disabling EARLY_AP_STARTUP helps, then we can try adding some > assertions that will cause the system to panic when the hang occurs, > making it easier to see what's going on.Hmmm. I have now saved a couple of versions of /var/run/dmesg.boot (in boot_verbose mode) when EARLY_AP_STARTUP is disabled and the boot is successful. However, I don't know how to capture such log when booting hangs, as I have no serial interface and the boot never completes. All I have is a screen photo of the last state when a hang occurs (showing ada disks successfully attached, followed immediately by the attempt to attach a da disk, which hangs). Mark
Mark Johnston
2017-Jul-24  02:15 UTC
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
On Thu, Jul 20, 2017 at 03:45:39PM +0200, Mark Martinec wrote:> 2017-07-20 02:03, Mark Johnston wrote: > > One thing to try at this point would be to disable EARLY_AP_STARTUP in > > the kernel config. That is, take a configuration with which you're able > > to reproduce the hang during boot, and remove "options > > EARLY_AP_STARTUP". > > Done. And it avoids the problem altogether! Thanks. > Tried a reboot several times and it succeeds every time.Thanks. Sorry for the delayed follow-up.> > Here is all that I had in a config file for building a kernel, > i.e. I took away the 'options DDB' which also seemingly avoided > the problem: > include GENERIC > ident NELI > nooptions EARLY_AP_STARTUPCould you try re-enabling EARLY_AP_STARTUP, applying the patch at the end of this email, and see if the message "sleeping before eventtimer init" appears in the boot output? If it does, it'll be followed by a backtrace that might be useful for tracking down the hang. It might produce false positives, but we'll see.> > > This feature has a fairly large impact on the bootup process and has > > had a few problems that manifested as hangs during boot. There was at > > least one other case where an innocuous change to the kernel > > configuration "fixed" the problem by introducing some second-order > > effect (causing kernel threads to be scheduled in a different > > order, for instance). > > > Regardless of whether the suggestion above makes a difference, it would > > be helpful to see verbose dmesgs from both a clean boot and a boot that > > hangs. If disabling EARLY_AP_STARTUP helps, then we can try adding some > > assertions that will cause the system to panic when the hang occurs, > > making it easier to see what's going on. > > Hmmm. > I have now saved a couple of versions of /var/run/dmesg.boot > (in boot_verbose mode) when EARLY_AP_STARTUP is disabled and > the boot is successful. However, I don't know how to capture > such log when booting hangs, as I have no serial interface > and the boot never completes. All I have is a screen photo > of the last state when a hang occurs (showing ada disks > successfully attached, followed immediately by the attempt > to attach a da disk, which hangs).Ok, let's not worry about this for now. Index: sys/kern/kern_clock.c ==================================================================--- sys/kern/kern_clock.c (revision 321401) +++ sys/kern/kern_clock.c (working copy) @@ -385,6 +385,8 @@ static int devpoll_run = 0; #endif +bool inited_clocks = false; + /* * Initialize clock frequencies and start both clocks running. */ @@ -412,6 +414,8 @@ #ifdef SW_WATCHDOG EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0); #endif + + inited_clocks = true; } /* Index: sys/kern/kern_synch.c ==================================================================--- sys/kern/kern_synch.c (revision 321401) +++ sys/kern/kern_synch.c (working copy) @@ -298,6 +298,8 @@ return (rval); } +extern bool inited_clocks; + /* * pause() delays the calling thread by the given number of system ticks. * During cold bootup, pause() uses the DELAY() function instead of @@ -330,6 +332,10 @@ DELAY(sbt); return (0); } + if (cold && !inited_clocks) { + printf("%s: sleeping before eventtimer init\n", curthread->td_name); + kdb_backtrace(); + } return (_sleep(&pause_wchan[curcpu], NULL, 0, wmesg, sbt, pr, flags)); }
Mark Martinec
2018-May-24  17:12 UTC
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
Just a short report to a thread I started when 11.1 came out. This machine would stall in a busy loop while attaching disks during boot. Rebuilding a kernel with EARLY_AP_STARTUP disabled avoided the problem. This was a situation through the whole 11.1 life cycle (i.e. patch releases did not help). Today I have upgraded this host to 11.2-BETA2, and it is no longer necessary to disable EARLY_AP_STARTUP. Good, thanks! Mark> 2017-07-20 02:03, Mark Johnston wrote: >> One thing to try at this point would be to disable EARLY_AP_STARTUP in >> the kernel config. That is, take a configuration with which you're >> able >> to reproduce the hang during boot, and remove "options >> EARLY_AP_STARTUP". > > 2017-07-20 15:45, Mark Martinec wrote: > Done. And it avoids the problem altogether! Thanks. > Tried a reboot several times and it succeeds every time. > > Here is all that I had in a config file for building a kernel, > i.e. I took away the 'options DDB' which also seemingly avoided > the problem: > include GENERIC > ident NELI > nooptions EARLY_AP_STARTUP > >> This feature has a fairly large impact on the bootup process and has >> had a few problems that manifested as hangs during boot. There was at >> least one other case where an innocuous change to the kernel >> configuration "fixed" the problem by introducing some second-order >> effect (causing kernel threads to be scheduled in a different >> order, for instance).[...]