Mark Martinec
2017-Jul-19 23:46 UTC
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
More news on the matter. As reported yesterday the locally built kernel with options INVARIANTS and DDB works fine and somehow avoids the trouble at attaching the da (mps) disks on an LSI controller, so today I wanted to get back to a reproducible hang - and sure enough, reverting to the generic kernel as distributed brings back the hang. So I tried rebuilding the kernel while experimenting with options like DDB and INVARIANTS. A locally built GENERIC kernel behaves the same as the original kernel from the distribution (as installed by freebsd-upgrade), so no surprises there. It hangs trying to attach the first of the da disks (after first successfully attaching all the ada disks). The alt ctrl esc is unable to enter debugger when the hang occurs (possibly due to an unresponsive USB keyboard at that time), even though the debug.kdb.break_to_debugger was set to 1 at a loader prompt. It needs loader "Safe mode" to be able to boot. Next, a locally built kernel with DDB and INVARIANTS works well (the remaining options come from an included GENERIC). Now the funny part: a locally built kernel with just the DDB option (and the rest included from GENERIC) *also* works well. Somehow the DDB option makes a difference, even though kernel debugger is never activated. To re-assert: at the time of a hang the CPU fan starts revving up, and the USB keyboard is unresponsive (<scroll> does not enter scroll mode, caps lock and num lock do not toggle their LED indicators, alt ctrl esc do not activate kernel debugger. Loader "Safe mode" avoids the problem (presumably by disabling SMP). Meanwhile I have successfully upgraded two other similar hosts from 11.0 to 11.1-RC3, no surprises there (but they do not have the same disk controller). Not sure what to try next. Mark 2017-07-19 01:18, Mark Martinec wrote:> 2017-07-18 01:24, Mark Johnston wrote: >> Are you able to break into the debugger at this point? Try setting >> debug.kdb.break_to_debugger=1 and debug.kdb.alt_break_to_debugger=1 at >> the loader prompt, and hit the break key, or the key sequence >> <CR> ~ ctrl-b once the hang occurs. At the debugger prompt, try >> "bt" and "show allpcpu" to start. > > Thank you for a prompt and good suggestion! I spent an afternoon > fiddling with the machine, with mixed results. Your suggestion to > break into debugger did not work, there was no reaction to <break> > or to <CR> ~ ctrl-b. > > So I embarked on rebuilding the RC3 kernel with > options KDB > options DDB > options BREAK_TO_DEBUGGER > options ALT_BREAK_TO_DEBUGGER > options INVARIANTS > options INVARIANT_SUPPORT > options WITNESS > options WITNESS_SKIPSPIN > but then I realized the <debug> key is mapped-to by: alt ctrl <esc>, > which now does break into debugger - but not so early where the > holdup occurs. > > The WITNESS produced some LOR warnings, but that is probably ok. > I came across a trace just before the problem area, but it flows > by so fast on a vt console and only the last 40 or so lines > remain on the screen (I have a photo), which do not look like > revealing much. Unfortunately this machine does not have a serial > interface. > > So in my last attempt I rebuilt a kernel with INVARIANTS but > without WITNESS - and now I cannot reproduce the problem, with > or without a "safe mode". What is interesting here that now > the da0..da3 disks are attached first, and only then the ada > disks - and even within the group of disks on the same > controller their order has been shuffled - no idea what could > have caused it - and it may have avoided the problem by doing so. > > Will play some more with this tomorrow... > > Mark > > >> On Tue, Jul 18, 2017 at 01:01:16AM +0200, Mark Martinec wrote: >>> Upgrading 11.0-RELEASE-p11 to 11.1-RC3 using the usual freebsd-update >>> upgrade >>> method I ended up with a system which gets stuck while trying to >>> attach >>> the second set of disks. This happened already after the first phase >>> of >>> the upgrade procedure (installing and re-booting with a new kernel). >>> >>> The first set of disks (ada0 .. ada2) are attached successfully, also >>> a >>> cd0, but then when the first of the set of four (a regular spinning >>> disk) >>> on an LSI controller is to be attached, the boot procedure just gets >>> stuck there: >>> kernel: ada1: 300.000MB/s transfers (SATA 2.x, PIO4, PIO >>> 8192bytes) >>> kernel: ada1: Command Queueing enabled >>> kernel: ada1: 305245MB (625142448 512 byte sectors) >>> kernel: ada2 at ahcich6 bus 0 scbus8 target 0 lun 0 >>> kernel: ada2: <OCZ-VERTEX3 2.25> ATA8-ACS SATA 3.x device >>> kernel: ada2: Serial Number OCZ-O1L6RF591R09Z5C8 >>> kernel: ada2: 300.000MB/s transfers (SATA 2.x, PIO4, PIO >>> 8192bytes) >>> kernel: ada2: Command Queueing enabled >>> kernel: ada2: 114473MB (234441648 512 byte sectors) >>> kernel: ada2: quirks=0x1<4K> >>> kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0 >>> >>> (stuck here, keyboard not responding, fans rising their pitch, >>> presumably CPU is spinning) > [...] > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe at freebsd.org"
Mark Johnston
2017-Jul-20 00:03 UTC
The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching
On Thu, Jul 20, 2017 at 01:46:33AM +0200, Mark Martinec wrote:> More news on the matter. As reported yesterday the locally built > kernel with options INVARIANTS and DDB works fine and somehow avoids > the trouble at attaching the da (mps) disks on an LSI controller, so > today I wanted to get back to a reproducible hang - and sure enough, > reverting to the generic kernel as distributed brings back the hang. > > So I tried rebuilding the kernel while experimenting with options > like DDB and INVARIANTS. > > A locally built GENERIC kernel behaves the same as the original > kernel from the distribution (as installed by freebsd-upgrade), > so no surprises there. It hangs trying to attach the first of the > da disks (after first successfully attaching all the ada disks). > The alt ctrl esc is unable to enter debugger when the hang occurs > (possibly due to an unresponsive USB keyboard at that time), > even though the debug.kdb.break_to_debugger was set to 1 at a > loader prompt. It needs loader "Safe mode" to be able to boot. > > Next, a locally built kernel with DDB and INVARIANTS works well > (the remaining options come from an included GENERIC). > > Now the funny part: a locally built kernel with just the DDB > option (and the rest included from GENERIC) *also* works well. > Somehow the DDB option makes a difference, even though kernel > debugger is never activated.One thing to try at this point would be to disable EARLY_AP_STARTUP in the kernel config. That is, take a configuration with which you're able to reproduce the hang during boot, and remove "options EARLY_AP_STARTUP". This feature has a fairly large impact on the bootup process and has had a few problems that manifested as hangs during boot. There was at least one other case where an innocuous change to the kernel configuration "fixed" the problem by introducing some second-order effect (causing kernel threads to be scheduled in a different order, for instance). Regardless of whether the suggestion above makes a difference, it would be helpful to see verbose dmesgs from both a clean boot and a boot that hangs. If disabling EARLY_AP_STARTUP helps, then we can try adding some assertions that will cause the system to panic when the hang occurs, making it easier to see what's going on.> > To re-assert: at the time of a hang the CPU fan starts revving up, > and the USB keyboard is unresponsive (<scroll> does not enter scroll > mode, caps lock and num lock do not toggle their LED indicators, > alt ctrl esc do not activate kernel debugger. Loader "Safe mode" > avoids the problem (presumably by disabling SMP). > > Meanwhile I have successfully upgraded two other similar > hosts from 11.0 to 11.1-RC3, no surprises there (but they do not > have the same disk controller). > > Not sure what to try next. > > Mark