Karl Denninger
2019-Apr-20 14:39 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On 4/13/2019 06:00, Karl Denninger wrote:> On 4/11/2019 13:57, Karl Denninger wrote: >> On 4/11/2019 13:52, Zaphod Beeblebrox wrote: >>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger <karl at denninger.net> wrote: >>> >>> >>>> In this specific case the adapter in question is... >>>> >>>> mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem >>>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3 >>>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd >>>> mps0: IOCCapabilities: >>>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> >>>> >>>> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects >>>> his drives via dumb on-MoBo direct SATA connections. >>>> >>> Maybe I'm in good company. My current setup has 8 of the disks connected >>> to: >>> >>> mps0: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem >>> 0xfe240000-0xfe24ffff,0xfe200000-0xfe23ffff irq 32 at device 0.0 on pci6 >>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd >>> mps0: IOCCapabilities: >>> 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> >>> >>> ... just with a cable that breaks out each of the 2 connectors into 4 >>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD >>> cache/log) connected to ports on... >>> >>> - ahci0: <ASMedia ASM1062 AHCI SATA controller> port >>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem >>> 0xfe900000-0xfe9001ff irq 44 at device 0.0 on pci2 >>> - ahci2: <Marvell 88SE9230 AHCI SATA controller> port >>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem >>> 0xfe610000-0xfe6107ff irq 40 at device 0.0 on pci7 >>> - ahci3: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port >>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem >>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0 >>> >>> ... each drive connected to a single port. >>> >>> I can actually reproduce this at will. Because I have 16 drives, when one >>> fails, I need to find it. I pull the sata cable for a drive, determine if >>> it's the drive in question, if not, reconnect, "ONLINE" it and wait for >>> resilver to stop... usually only a minute or two. >>> >>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general, >>> that a drive is part of the SAS controller or the SATA controllers... so >>> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive. >>> More often than not, the a scrub will find a few problems. In fact, it >>> appears that the most recent scrub is an example: >>> >>> [1:7:306]dgilbert at vr:~> zpool status >>> pool: vr1 >>> state: ONLINE >>> scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr 1 23:12:03 >>> 2019 >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> vr1 ONLINE 0 0 0 >>> raidz2-0 ONLINE 0 0 0 >>> gpt/v1-d0 ONLINE 0 0 0 >>> gpt/v1-d1 ONLINE 0 0 0 >>> gpt/v1-d2 ONLINE 0 0 0 >>> gpt/v1-d3 ONLINE 0 0 0 >>> gpt/v1-d4 ONLINE 0 0 0 >>> gpt/v1-d5 ONLINE 0 0 0 >>> gpt/v1-d6 ONLINE 0 0 0 >>> gpt/v1-d7 ONLINE 0 0 0 >>> raidz2-2 ONLINE 0 0 0 >>> gpt/v1-e0c ONLINE 0 0 0 >>> gpt/v1-e1b ONLINE 0 0 0 >>> gpt/v1-e2b ONLINE 0 0 0 >>> gpt/v1-e3b ONLINE 0 0 0 >>> gpt/v1-e4b ONLINE 0 0 0 >>> gpt/v1-e5a ONLINE 0 0 0 >>> gpt/v1-e6a ONLINE 0 0 0 >>> gpt/v1-e7c ONLINE 0 0 0 >>> logs >>> gpt/vr1log ONLINE 0 0 0 >>> cache >>> gpt/vr1cache ONLINE 0 0 0 >>> >>> errors: No known data errors >>> >>> ... it doesn't say it now, but there were 5 CKSUM errors on one of the >>> drives that I had trial-removed (and not on the one replaced). >>> _______________________________________________ >> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that, >> after a scrub, comes up with the checksum errors.? It does *not* flag >> any errors during the resilver and the drives *not* taken offline do not >> (ever) show checksum errors either. >> >> Interestingly enough you have 19.00.00.00 firmware on your card as well >> -- which is what was on mine. >> >> I have flashed my card forward to 20.00.07.00 -- we'll see if it still >> does it when I do the next swap of the backup set. > Verrrrrryyyyy interesting. > > This drive was last written/read under 19.00.00.00.? Yesterday I swapped > it back in.? Note that right now I am running: > > mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem > 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3 > mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd > mps0: IOCCapabilities: > 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> > > And, after the scrub completed overnight.... > > [karl at NewFS ~]$ zpool status backup > ? pool: backup > ?state: DEGRADED > status: One or more devices has experienced an unrecoverable error.? An > ??????? attempt was made to correct the error.? Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > ??????? using 'zpool clear' or replace the device with 'zpool replace'. > ?? see: http://illumos.org/msg/ZFS-8000-9P > ? scan: scrub repaired 4K in 0 days 06:30:55 with 0 errors on Sat Apr 13 > 01:42:04 2019 > config: > > ??????? NAME???????????????????? STATE???? READ WRITE CKSUM > ??????? backup?????????????????? DEGRADED???? 0???? 0???? 0 > ????????? mirror-0?????????????? DEGRADED???? 0???? 0???? 0 > ??????????? gpt/backup61.eli???? ONLINE?????? 0???? 0???? 0 > ??????????? 2650799076683778414? OFFLINE????? 0???? 0???? 0? was > /dev/gpt/backup62-1.eli > ??????????? gpt/backup62-2.eli?? ONLINE?????? 0???? 0???? 1 > > errors: No known data errors > > The OTHER interesting data point is that the resilver *also* posted one > checksum error, which I cleared before doing the scrub.? Both on the > 62-2 device. > > That would be one block in both cases.? The expected was several (maybe > a half-dozen) checksum errors on 19.00.00.00 during the scrub but *zero* > during the resilver. > > The unit which was put *into* the vault and is now offline was written > and scrubbed under 20.00.07.00.? The behavior change certainly implies > that there are some differences and again, none of these OFFLINE state > situations are uncontrolled -- in each case the drive is taken offline > intentionally, the geli provider is detached and then the unit has > "camcontrol standby" executed against it before it is yanked, so in > theory at least there should be no way for a unflushed but write-cached > block to be lost or damaged. > > I smell a rat but it may well be in the 19.00.00.00 firmware on the card...I can confirm that 20.00.07.00 does *not* stop this. The previous write/scrub on this device was on 20.00.07.00.? It was swapped back in from the vault yesterday, resilvered without incident, but a scrub says.... root at NewFS:/home/karl # zpool status backup ? pool: backup ?state: DEGRADED status: One or more devices has experienced an unrecoverable error.? An ??????? attempt was made to correct the error.? Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors ??????? using 'zpool clear' or replace the device with 'zpool replace'. ?? see: http://illumos.org/msg/ZFS-8000-9P ? scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr 20 08:45:09 2019 config: ??????? NAME????????????????????? STATE???? READ WRITE CKSUM ??????? backup??????????????????? DEGRADED???? 0???? 0???? 0 ????????? mirror-0??????????????? DEGRADED???? 0???? 0???? 0 ??????????? gpt/backup61.eli????? ONLINE?????? 0???? 0???? 0 ??????????? gpt/backup62-1.eli??? ONLINE?????? 0???? 0??? 47 ??????????? 13282812295755460479? OFFLINE????? 0???? 0???? 0? was /dev/gpt/backup62-2.eli errors: No known data errors So this is firmware-invariant (at least between 19.00.00.00 and 20.00.07.00); the issue persists. Again, in my instance these devices are never removed "unsolicited" so there can't be (or at least shouldn't be able to) unflushed data in the device or kernel cache.? The procedure is and remains: zpool offline ..... geli detach ..... camcontrol standby ... Wait a few seconds for the spindle to spin down. Remove disk. Then of course on the other side after insertion and the kernel has reported "finding" the device: geli attach ... zpool online .... Wait... If this is a boogered TXG that's held in the metadata for the "offline"'d device (maybe "off by one"?) that's potentially bad in that if there is an unknown failure in the other mirror component the resilver will complete but data has been irrevocably destroyed. Granted, this is a very low probability scenario (the area where the bad checksums are has to be where the corruption hits, and it has to happen between the resilver and access to that data.)? Those are long odds but nonetheless a window of "you're hosed" does appear to exist. -- Karl Denninger karl at denninger.net <mailto:karl at denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4897 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190420/e8ea732f/attachment.bin>
Steven Hartland
2019-Apr-20 15:50 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
Have you eliminated geli as possible source? I've just setup an old server which has a LSI 2008 running and old FW (11.0) so was going to have a go at reproducing this. Apart from the disconnect steps below is there anything else needed e.g. read / write workload during disconnect? mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3 mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> ??? Regards ??? Steve On 20/04/2019 15:39, Karl Denninger wrote:> I can confirm that 20.00.07.00 does *not* stop this. > The previous write/scrub on this device was on 20.00.07.00.? It was > swapped back in from the vault yesterday, resilvered without incident, > but a scrub says.... > > root at NewFS:/home/karl # zpool status backup > ? pool: backup > ?state: DEGRADED > status: One or more devices has experienced an unrecoverable error.? An > ??????? attempt was made to correct the error.? Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > ??????? using 'zpool clear' or replace the device with 'zpool replace'. > ?? see: http://illumos.org/msg/ZFS-8000-9P > ? scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr > 20 08:45:09 2019 > config: > > ??????? NAME????????????????????? STATE???? READ WRITE CKSUM > ??????? backup??????????????????? DEGRADED???? 0???? 0???? 0 > ????????? mirror-0??????????????? DEGRADED???? 0???? 0???? 0 > ??????????? gpt/backup61.eli????? ONLINE?????? 0???? 0???? 0 > ??????????? gpt/backup62-1.eli??? ONLINE?????? 0???? 0??? 47 > ??????????? 13282812295755460479? OFFLINE????? 0???? 0???? 0? was > /dev/gpt/backup62-2.eli > > errors: No known data errors > > So this is firmware-invariant (at least between 19.00.00.00 and > 20.00.07.00); the issue persists. > > Again, in my instance these devices are never removed "unsolicited" so > there can't be (or at least shouldn't be able to) unflushed data in the > device or kernel cache.? The procedure is and remains: > > zpool offline ..... > geli detach ..... > camcontrol standby ... > > Wait a few seconds for the spindle to spin down. > > Remove disk. > > Then of course on the other side after insertion and the kernel has > reported "finding" the device: > > geli attach ... > zpool online .... > > Wait... > > If this is a boogered TXG that's held in the metadata for the > "offline"'d device (maybe "off by one"?) that's potentially bad in that > if there is an unknown failure in the other mirror component the > resilver will complete but data has been irrevocably destroyed. > > Granted, this is a very low probability scenario (the area where the bad > checksums are has to be where the corruption hits, and it has to happen > between the resilver and access to that data.)? Those are long odds but > nonetheless a window of "you're hosed" does appear to exist. >