Mark Johnston
2017-Jul-20 00:03 UTC
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
On Thu, Jul 20, 2017 at 01:46:33AM +0200, Mark Martinec wrote:> More news on the matter. As reported yesterday the locally built > kernel with options INVARIANTS and DDB works fine and somehow avoids > the trouble at attaching the da (mps) disks on an LSI controller, so > today I wanted to get back to a reproducible hang - and sure enough, > reverting to the generic kernel as distributed brings back the hang. > > So I tried rebuilding the kernel while experimenting with options > like DDB and INVARIANTS. > > A locally built GENERIC kernel behaves the same as the original > kernel from the distribution (as installed by freebsd-upgrade), > so no surprises there. It hangs trying to attach the first of the > da disks (after first successfully attaching all the ada disks). > The alt ctrl esc is unable to enter debugger when the hang occurs > (possibly due to an unresponsive USB keyboard at that time), > even though the debug.kdb.break_to_debugger was set to 1 at a > loader prompt. It needs loader "Safe mode" to be able to boot. > > Next, a locally built kernel with DDB and INVARIANTS works well > (the remaining options come from an included GENERIC). > > Now the funny part: a locally built kernel with just the DDB > option (and the rest included from GENERIC) *also* works well. > Somehow the DDB option makes a difference, even though kernel > debugger is never activated.One thing to try at this point would be to disable EARLY_AP_STARTUP in the kernel config. That is, take a configuration with which you're able to reproduce the hang during boot, and remove "options EARLY_AP_STARTUP". This feature has a fairly large impact on the bootup process and has had a few problems that manifested as hangs during boot. There was at least one other case where an innocuous change to the kernel configuration "fixed" the problem by introducing some second-order effect (causing kernel threads to be scheduled in a different order, for instance). Regardless of whether the suggestion above makes a difference, it would be helpful to see verbose dmesgs from both a clean boot and a boot that hangs. If disabling EARLY_AP_STARTUP helps, then we can try adding some assertions that will cause the system to panic when the hang occurs, making it easier to see what's going on.> > To re-assert: at the time of a hang the CPU fan starts revving up, > and the USB keyboard is unresponsive (<scroll> does not enter scroll > mode, caps lock and num lock do not toggle their LED indicators, > alt ctrl esc do not activate kernel debugger. Loader "Safe mode" > avoids the problem (presumably by disabling SMP). > > Meanwhile I have successfully upgraded two other similar > hosts from 11.0 to 11.1-RC3, no surprises there (but they do not > have the same disk controller). > > Not sure what to try next. > > Mark
Mark Martinec
2017-Jul-20 13:45 UTC
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
2017-07-20 02:03, Mark Johnston wrote:> One thing to try at this point would be to disable EARLY_AP_STARTUP in > the kernel config. That is, take a configuration with which you're able > to reproduce the hang during boot, and remove "options > EARLY_AP_STARTUP".Done. And it avoids the problem altogether! Thanks. Tried a reboot several times and it succeeds every time. Here is all that I had in a config file for building a kernel, i.e. I took away the 'options DDB' which also seemingly avoided the problem: include GENERIC ident NELI nooptions EARLY_AP_STARTUP> This feature has a fairly large impact on the bootup process and has > had a few problems that manifested as hangs during boot. There was at > least one other case where an innocuous change to the kernel > configuration "fixed" the problem by introducing some second-order > effect (causing kernel threads to be scheduled in a different > order, for instance).> Regardless of whether the suggestion above makes a difference, it would > be helpful to see verbose dmesgs from both a clean boot and a boot that > hangs. If disabling EARLY_AP_STARTUP helps, then we can try adding some > assertions that will cause the system to panic when the hang occurs, > making it easier to see what's going on.Hmmm. I have now saved a couple of versions of /var/run/dmesg.boot (in boot_verbose mode) when EARLY_AP_STARTUP is disabled and the boot is successful. However, I don't know how to capture such log when booting hangs, as I have no serial interface and the boot never completes. All I have is a screen photo of the last state when a hang occurs (showing ada disks successfully attached, followed immediately by the attempt to attach a da disk, which hangs). Mark