Karl Denninger
2019-Apr-09 19:01 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
I've run into something often -- and repeatably -- enough since updating to 12-STABLE that I suspect there may be a code problem lurking in the ZFS stack or in the driver and firmware compatibility with various HBAs based on the LSI/Avago devices. The scenario is this -- I have data sets that are RaidZ2 that are my "normal" working set; one is comprised of SSD volumes and one of spinning rust volumes.? These all are normal and scrubs never show problems.? I've had physical failures with them over the years (although none since moving to 12-STABLE as of yet) and have never had trouble with resilvers or other misbehavior. I also have a "backup" pool that is a 3-member mirror, to which the volatile (that is, the zfs filesystems not set read-only) has zfs send's done to.? Call them backup-i, backup-e1 and backup-e2. All disks in these pools are geli-encrypted running on top of a freebsd-zfs partition inside a GPT partition table using -s 4096 (4k) geli "sectors". Two of the backup mirror members are always in the machine; backup-i (the base internal drive) is never removed.? The third is in a bank vault.? Every week the vault drive is exchanged with the other, so that the "first" member is never removed from the host, but the other two (-e1 and -e2) alternate.? If the building burns I have a full copy of all the volatile data in the vault.? (I also have mirrored copies, 2 each, of all the datasets that are operationally read-only in the vault too; those get updated quarterly if there are changes to the operationally read-only portion of the data store.)? The drive in the vault is swapped weekly, so a problem should be detected almost immediately before it can bugger me. Before removing the disk intended to go to the vault I "offline" it, spin it down (camcontrol standby) which issues a standby immediate to the drive insuring that its cache is flushed and the spindle spun down and then pull it.? I go exchange them at the bank, insert the other one, and "zpool online...." it, which automatically resilvers it. The disk resilvers and all is well -- no errors. Or is it all ok? If I run a scrub on the pool as soon as the resilver completes the disk I just inserted will /invariably /have a few checksum errors on it that the scrub fixes.? It's not a large number, anywhere from a couple dozen to a hundred or so, but it's not zero -- and it damn well should be as the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S IN USE AREA was examined, compared, and blocks not on the "new member" or changed copied over.? The "-i" disk (the one that is never pulled) NEVER is the one with the checksum errors on it -- it's ALWAYS the one I just inserted and which was resilvered to. If I zpool clear the errors and scrub again all is fine -- no errors.? If I scrub again before pulling the disk the next time to do the swap all is fine as well.? I swap the two, resilver, and I'll get a few more errors on the next scrub, ALWAYS on the disk I just put in. Smartctl shows NO errors on the disk.? No ECC, no reallocated sectors, no interface errors, no resets, nothing.? Smartd is running and never posts any real-time complaints, other than the expected one a minute or two after I yank the drive to take it to the bank.? There are no CAM-related errors printing on the console either.? So ZFS says there's a *silent* data error (bad checksum; never a read or write error) in a handful of blocks but the disk says there have been no errors, the driver does not report any errors, there have been no power failures as the disk was in a bank vault and thus it COULDN'T have had a write-back cache corruption event or similar occur. I never had trouble with this under 11.1 or before and have been using this paradigm for something on the order of five years running on this specific machine without incident.? Now I'm seeing it repeatedly and *reliably* under 12.0-STABLE.? I swapped the first disk that did it, thinking it was physically defective -- the replacement did it on the next swap.? In fact I've yet to record a swap-out on 12-STABLE that *hasn't* done this and yet it NEVER happened under 11.1.? At the same time I can run scrubs until the cows come home on the multiple Raidz2 packs on the same controller and never get any checksum errors on any of them. The firmware in the card was 19.00.00.00 -- again, this firmware *has been stable for years.*? I have just rolled the firmware on the card forward to 20.00.07.00, which is the "latest" available.? I had previously not moved to 20.x because earlier versions had known issues (some severe and potentially fatal to data integrity) and 19 had been working without problem -- I thus had no reason to move to 20.00.07.00. But there apparently are some fairly significant timing differences between the driver code in 11.1 and 11.2/12.0, as I discovered when the SAS expander I used to have in these boxes started returning timeout errors that were false.? Again -- this same configuration was completely stable under 11.1 and previous over a period of years. With 20.00.07.00 I have yet to have this situation recur -- so far -- but I have limited time with 20.00.07.00 and as such my confidence that the issue is in fact resolved by the card firmware change is only modest at this point.? Over the next month or so, if it doesn't happen again, my confidence will of course improve. Checksum errors on ZFS volumes are extraordinarily uncool for the obvious reason -- they imply the disk thinks the data is fine (since it is not recording any errors on the interface or at the drive level) BUT ZFS thinks the data off that particular record was corrupt as the checksum fails.? Silent corruption is the worst sort in that it can hide for months or even years before being discovered, long after your redundant copies have been re-used or overwritten. Assuming I do not see a recurrence with the 20.00.07.00 firmware I would suggest that UPDATING, the Release Notes or Errata have an entry added that for 12.x forward card firmware revisions prior to 20.00.07.00 carry *strong* cautions and that those with these HBAs be strongly urged to flash the card forward to 20.00.07.00 before upgrading or installing.? If you get a surprise of this sort and have no second copy that is not impacted you could find yourself severely hosed. -- Karl Denninger karl at denninger.net <mailto:karl at denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4897 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190409/08ae901b/attachment.bin>
Andriy Gapon
2019-Apr-09 20:04 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On 09/04/2019 22:01, Karl Denninger wrote:> the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S > IN USE AREA was examined, compared, and blocks not on the "new member" > or changed copied over.I think that that's not entirely correct. ZFS maintains something called DTL, a dirty-time log, for a missing / offlined / removed device. When the device re-appears and gets resilvered, ZFS walks only those blocks that were born within the TXG range(s) when the device was missing. In any case, I do not have an explanation for what you are seeing. -- Andriy Gapon