Mark Martinec
2018-Dec-27 15:10 UTC
mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting
2018-12-26 22:26, Terry Kennedy wrote:> The earlier LSI P20 releases were pretty flakey in some cases - try > flashing 20.00.07.00.Indeed. I have upgraded LSI SAS2308 firmware from 20.00.02.00 to 20.00.07.00 a week ago, left it running for a while with 11.2, then upgraded again to 12.0, and the controller is stable now, even with the new mps driver that came with 12.0. To recap: - mps driver from FreeBSD 11.2 and earlier is stable with SAS2308 firmware 20.00.02.00 _and_ 20.00.07.00 - mps driver from FreeBSD 12.0 causes frequent controller resets with SAS2308 firmware 20.00.02.00 (and ZFS can't cope with that), but is stable with 20.00.07.00. Mark 2018-12-17 16:52, je Mark Martinec napisal> One of our servers that was upgraded from 11.2 to 12.0 (to RC2 > initially, then to RC3 > and lastly to a 12.0-RELEASE) is suffering severe instability of a > disk controller, > resetting itself a couple of times a day, usually associated with high > disk usage > (like poudriere buils or zfs scrub or nightly file system scans). The > same setup > was rock-solid under 11.2 (and still/again is). > > The disk controller is LSI SAS2308. It has four disks attached as > JBODs, > one pair of SSDs and one pair of hard disks, each pair forming its own > zpool. > A controller reset can occur regardless of which pair is in heavy use. > > The following can be found in logs, just before machine becomes > unusable > (although not logged always, as disks may be dropped before syslog has > a chance > of writing anything): > > xxx kernel: [2382] mps0: IOC Fault 0x40000d04, Resetting > xxx kernel: [2382] mps0: Reinitializing controller > xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: > 21.02.00.00-fbsd > xxx kernel: [2383] mps0: IOCCapabilities: > 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> > xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack > > The IOC Fault location is always the same. Apparently the disk > controller resets, > all disk devices are dropped and ZFS finds itself with no disks. The > machine still > responds to ping, and if logged-in during the event and running zpool > status -v 1, > zfs reports loss of all devices for each pool: > > pool: data0 > state: UNAVAIL > status: One or more devices are faulted in response to IO failures. > action: Make sure the affected devices are connected, then run 'zpool > clear'. > see: http://illumos.org/msg/ZFS-8000-HC > scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov > 17 00:22:38 2018 > config: > > NAME STATE READ WRITE CKSUM > data0 UNAVAIL 0 0 0 > mirror-0 UNAVAIL 0 24 0 > 2396428274137360341 REMOVED 0 0 0 was > /dev/gpt/da2-PN1334PCKAKD4S > 16738407333921736610 REMOVED 0 0 0 was > /dev/gpt/da3-PN2338P4GJ1XYC > > (and similar for the other pool) > > At this point the machine is unusable and needs to be hard-reset. > > My guess is that after the controller resets, disk devices come up > again > (according to the report seen on the console, stating 'periph > destroyed' > first, then listing full info on each disk) - but zfs ignores them. > > I don't see any mention of changes of the mps driver in the 12.0 > release notes, > although diff-ing its sources between 11.2 and 12.0 shows plenty of > nontrivial > changes. > > After suffering this instability for some time, I finally downgraded > the OS > to 11.2, and things are back to normal again! > > This downgrade path was nontrivial, as I have foolishly upgraded pool > features > to what comes with 12.0, so downgrading involved hacking with > dismantling > both zfs mirror pools, recreating pools without the two new features, > zfs send/receive copying, while having a machine hang during some of > these operations. Not something for the faint at heart. I know, foolish > of me to upgrade pools after just one day of uptime with 12.0. > > Some info on the controller: > > kernel: mps0: <Avago Technologies (LSI) SAS2308> port 0xf000-0xf0ff > mem 0xfbe40000- > 0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 64 at device 0.0 numa-domain 1 > on pci11 > kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd > > mpsutil shows: > > mps0 Adapter: > Board Name: LSI2308-IT > Board Assembly: > Chip Name: LSISAS2308 > Chip Revision: ALL > BIOS Revision: 7.39.00.00 > Firmware Revision: 20.00.02.00 > Integrated RAID: no > > > So, what has changed in the mps driver for this to be happening? > Would it be possible to take mps driver sources from 11.2, transplant > them to 12.0, recompile, and use that? Could the new mps driver be > using some new feature of the controller and hits a firmware bug? > I have resisted upgrading SAS2308 firmware and its BIOS, as it is > working very well under 11.2. > > Anyone else seen problems with mps driver and LSI SAS2308 controller? > > (btw, on another machine the mps driver with LSI SAS2004 is working > just fine under 12.0) > > Mark