Terry Kennedy
2019-Sep-01 04:06 UTC
mpr causing a boot hang sometime after r348368 - NUMA related?
TL;DR - mpr controller becomes increasingly likely to hang boot when on the 2nd CPU as FreeBSD 12.0-STABLE moves forward. I have a Dell PowerEdge R730 (configuration details available if needed) with a PERC H730 mini (mrsas driver) and a "12Gbps external HBA", Dell part number T93GD (mpr driver). There is an external Dell LTO4 drive attached to the external HBA and is the only thing connected to it. r348368 boots normally, and the HBA and tape are recognized as: mpr0: <Avago Technologies (LSI) SAS3008> port 0x8000-0x80ff mem 0xc9100000-0xc910ffff,0xc8000000-0xc80fffff irq 64 at device 0.0 numa-domain 1 on pci17 mpr0: Firmware: 16.00.04.00, Driver: 18.03.00.00-fbsd mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray> mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7 mpr0: At enclosure level 0 and connector name (1 ) sa0 at mpr0 bus 0 scbus14 target 7 lun 0 The next revision I tried was r350268. That boots most of the time, but sometimes hangs with various messages, not in any particular order, such as (forgive any typos, I could only get these as screen grabs): mpr_config_get_dpm_pg0: request for page completed with error 60 mpr0: Out of chain frames, consider increasing hw.mpr.max_chains (probe0:mpr0:0:7:0): Down reving Protocol Version from 4 to 0? mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60) mpr0: Reinit success run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config This all happens whether or not the external tape drive is plugged into the system (unplugged at the system end, so no dangling cables). The problem goes away (with unacceptable loss of performance) if I boot in safe mode. Setting hw.mpr.disable_msi=1 and hw.mpr.disable_msix=1 has no effect. r350970 behaves in much the same way, working sometimes but needing safe mode to have a 100% successful chance of booting. r351637 seems to never boot unless I boot in safe mode, then works 100% of the time. Dell has replaced the controller and the problem persists. Since it still happens with the tape drive disconnected, I didn't have them replace the drive and cable. The one thing I noted when Dell had the chassis open was that the slot this card is in is labeled "CPU 2", which would seem to be confirmed by the "numa-domain 1" in the working dmesg output. Unfortunately, all of the low-profile slots in this chassis are on CPU 2, and the part number of my card (and the Dell spare) is a low-profile-only card. I had the tech put the card in one of the full-height CPU 1 slots (which involved removing the card bracket and installing it "naked", which he wasn't comfortable with). Lo and behold, it boots when the card is in numa-domain 0: mpr0: <Avago Technologies (LSI) SAS3008> port 0x2000-0x20ff mem 0x93600000-0x9360ffff,0x92500000-0x925fffff irq 32 at device 0.0 numa-domain 0 on pci4 mpr0: Firmware: 16.00.04.00, Driver: 18.03.00.00-fbsd mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray> mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7 mpr0: At enclosure level 0 and connector name (1 ) sa0 at mpr0 bus 0 scbus2 target 7 lun 0 I was able to do 4 consecutive working boots before the tech got antsy and wanted to either put the card back in a low-profile slot or start the meter for billable time. Based on this, it seems to be a timing-related issue when the mpr card is on the 2nd CPU (and when SMP is enabled) Any suggestions for further diagnostic information, other things to try, or (preferably) "here. try this patch"? Terry Kennedy http://www.glaver.org New York, NY USA