Andriy Gapon
2019-Apr-10 13:45 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On 10/04/2019 04:09, Karl Denninger wrote:> Specifically, I *explicitly* OFFLINE the disk in question, which is a > controlled operation and *should* result in a cache flush out of the ZFS > code into the drive before it is OFFLINE'd. > > This should result in the "last written" TXG that the remaining online > members have, and the one in the offline member, being consistent. > > Then I "camcontrol standby" the involved drive, which forces a writeback > cache flush and a spindown; in other words, re-ordered or not, the > on-platter data *should* be consistent with what the system thinks > happened before I yank the physical device.This may not be enough for a specific [RAID] controller and a specific configuration. It should be enough for a dumb HBA. But, for example, mrsas(9) can simply ignore the synchronize cache command (meaning neither the on-board cache is flushed nor the command is propagated to a disk). So, if you use some advanced controller it would make sense to use its own management tool to offline a disk before pulling it. I do not preclude a possibility of an issue in ZFS. But it's not the only possibility either. -- Andriy Gapon
Karl Denninger
2019-Apr-10 14:38 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On 4/10/2019 08:45, Andriy Gapon wrote:> On 10/04/2019 04:09, Karl Denninger wrote: >> Specifically, I *explicitly* OFFLINE the disk in question, which is a >> controlled operation and *should* result in a cache flush out of the ZFS >> code into the drive before it is OFFLINE'd. >> >> This should result in the "last written" TXG that the remaining online >> members have, and the one in the offline member, being consistent. >> >> Then I "camcontrol standby" the involved drive, which forces a writeback >> cache flush and a spindown; in other words, re-ordered or not, the >> on-platter data *should* be consistent with what the system thinks >> happened before I yank the physical device. > This may not be enough for a specific [RAID] controller and a specific > configuration. It should be enough for a dumb HBA. But, for example, mrsas(9) > can simply ignore the synchronize cache command (meaning neither the on-board > cache is flushed nor the command is propagated to a disk). So, if you use some > advanced controller it would make sense to use its own management tool to > offline a disk before pulling it. > > I do not preclude a possibility of an issue in ZFS. But it's not the only > possibility either.In this specific case the adapter in question is... mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects his drives via dumb on-MoBo direct SATA connections. What I don't know (yet) is if the update to firmware 20.00.07.00 in the HBA has fixed it.? The 11.2 and 12.0 revs of FreeBSD through some mechanism changed timing quite materially in the mps driver; prior to 11.2 I ran with a Lenovo SAS expander connected to SATA disks without any problems at all, even across actual disk failures through the years, but in 11.2 and 12.0 doing this resulted in spurious retries out of the CAM layer that allegedly came from timeouts on individual units (which looked very much like a lost command sent to the disk), but only on mirrored volume sets -- yet there were no errors reported by the drive itself, nor did either of my RaidZ2 pools (one spinning rust, one SSD) experience problems of any sort.?? Flashing the HBA forward to 20.00.07.00 with the expander in resulted in the? *driver* (mps) taking disconnects and resets instead of the targets, which in turn caused random drive fault events across all of the pools.? For obvious reasons that got backed out *fast*. Without the expander 19.00.00.00 has been stable over the last few months *except* for this circumstance, where an intentionally OFFLINE'd disk in a mirror that is brought back online after some reasonably long period of time (days to a week) results in a successful resilver but then a small number of checksum errors on that drive -- always on the one that was OFFLINEd, never on the one(s) not taken OFFLINE -- appear and are corrected when a scrub is subsequently performed.? I am now on 20.00.07.00 and so far -- no problems.? But I've yet to do the backup disk swap on 20.00.07.00 (scheduled for late week or Monday) so I do not know if the 20.00.07.00 roll-forward addresses the scrub issue or not.? I have no reason to believe it is involved, but given the previous "iffy" nature of 11.2 and 12.0 on 19.0 with the expander it very well might be due to what appear to be timing changes in the driver architecture. -- Karl Denninger karl at denninger.net <mailto:karl at denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4897 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190410/5a5b859f/attachment.bin>