thr3ads.net - freebsd stable - Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Karl Denninger

2019-Apr-13 11:00 UTC

Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

On 4/11/2019 13:57, Karl Denninger wrote:> On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger <karl at
denninger.net> wrote:
>>
>>
>>> In this specific case the adapter in question is...
>>>
>>> mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff
mem
>>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on
pci3
>>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>>
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>>>
>>> Which is indeed a "dumb" HBA (in IT mode), and Zeephod
says he connects
>>> his drives via dumb on-MoBo direct SATA connections.
>>>
>> Maybe I'm in good company.  My current setup has 8 of the disks
connected
>> to:
>>
>> mps0: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem
>> 0xfe240000-0xfe24ffff,0xfe200000-0xfe23ffff irq 32 at device 0.0 on
pci6
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>>
5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
>>
>> ... just with a cable that breaks out each of the 2 connectors into 4
>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
>> cache/log) connected to ports on...
>>
>> - ahci0: <ASMedia ASM1062 AHCI SATA controller> port
>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f
mem
>> 0xfe900000-0xfe9001ff irq 44 at device 0.0 on pci2
>> - ahci2: <Marvell 88SE9230 AHCI SATA controller> port
>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f
mem
>> 0xfe610000-0xfe6107ff irq 40 at device 0.0 on pci7
>> - ahci3: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port
>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f
mem
>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>>
>> ... each drive connected to a single port.
>>
>> I can actually reproduce this at will.  Because I have 16 drives, when
one
>> fails, I need to find it.  I pull the sata cable for a drive, determine
if
>> it's the drive in question, if not, reconnect, "ONLINE"
it and wait for
>> resilver to stop... usually only a minute or two.
>>
>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in
general,
>> that a drive is part of the SAS controller or the SATA controllers...
so
>> I'm only looking among 8, ever) ... then I "REPLACE" the
problem drive.
>> More often than not, the a scrub will find a few problems.  In fact, it
>> appears that the most recent scrub is an example:
>>
>> [1:7:306]dgilbert at vr:~> zpool status
>>   pool: vr1
>>  state: ONLINE
>>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1
23:12:03
>> 2019
>> config:
>>
>>         NAME            STATE     READ WRITE CKSUM
>>         vr1             ONLINE       0     0     0
>>           raidz2-0      ONLINE       0     0     0
>>             gpt/v1-d0   ONLINE       0     0     0
>>             gpt/v1-d1   ONLINE       0     0     0
>>             gpt/v1-d2   ONLINE       0     0     0
>>             gpt/v1-d3   ONLINE       0     0     0
>>             gpt/v1-d4   ONLINE       0     0     0
>>             gpt/v1-d5   ONLINE       0     0     0
>>             gpt/v1-d6   ONLINE       0     0     0
>>             gpt/v1-d7   ONLINE       0     0     0
>>           raidz2-2      ONLINE       0     0     0
>>             gpt/v1-e0c  ONLINE       0     0     0
>>             gpt/v1-e1b  ONLINE       0     0     0
>>             gpt/v1-e2b  ONLINE       0     0     0
>>             gpt/v1-e3b  ONLINE       0     0     0
>>             gpt/v1-e4b  ONLINE       0     0     0
>>             gpt/v1-e5a  ONLINE       0     0     0
>>             gpt/v1-e6a  ONLINE       0     0     0
>>             gpt/v1-e7c  ONLINE       0     0     0
>>         logs
>>           gpt/vr1log    ONLINE       0     0     0
>>         cache
>>           gpt/vr1cache  ONLINE       0     0     0
>>
>> errors: No known data errors
>>
>> ... it doesn't say it now, but there were 5 CKSUM errors on one of
the
>> drives that I had trial-removed (and not on the one replaced).
>> _______________________________________________
> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is
the one that,
> after a scrub, comes up with the checksum errors.? It does *not* flag
> any errors during the resilver and the drives *not* taken offline do not
> (ever) show checksum errors either.
>
> Interestingly enough you have 19.00.00.00 firmware on your card as well
> -- which is what was on mine.
>
> I have flashed my card forward to 20.00.07.00 -- we'll see if it still
> does it when I do the next swap of the backup set.
Verrrrrryyyyy interesting.

This drive was last written/read under 19.00.00.00.? Yesterday I swapped
it back in.? Note that right now I am running:

mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem
0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

And, after the scrub completed overnight....

[karl at NewFS ~]$ zpool status backup
? pool: backup
?state: DEGRADED
status: One or more devices has experienced an unrecoverable error.? An
??????? attempt was made to correct the error.? Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
??????? using 'zpool clear' or replace the device with 'zpool
replace'.
?? see: http://illumos.org/msg/ZFS-8000-9P
? scan: scrub repaired 4K in 0 days 06:30:55 with 0 errors on Sat Apr 13
01:42:04 2019
config:

??????? NAME???????????????????? STATE???? READ WRITE CKSUM
??????? backup?????????????????? DEGRADED???? 0???? 0???? 0
????????? mirror-0?????????????? DEGRADED???? 0???? 0???? 0
??????????? gpt/backup61.eli???? ONLINE?????? 0???? 0???? 0
??????????? 2650799076683778414? OFFLINE????? 0???? 0???? 0? was
/dev/gpt/backup62-1.eli
??????????? gpt/backup62-2.eli?? ONLINE?????? 0???? 0???? 1

errors: No known data errors

The OTHER interesting data point is that the resilver *also* posted one
checksum error, which I cleared before doing the scrub.? Both on the
62-2 device.

That would be one block in both cases.? The expected was several (maybe
a half-dozen) checksum errors on 19.00.00.00 during the scrub but *zero*
during the resilver.

The unit which was put *into* the vault and is now offline was written
and scrubbed under 20.00.07.00.? The behavior change certainly implies
that there are some differences and again, none of these OFFLINE state
situations are uncontrolled -- in each case the drive is taken offline
intentionally, the geli provider is detached and then the unit has
"camcontrol standby" executed against it before it is yanked, so in
theory at least there should be no way for a unflushed but write-cached
block to be lost or damaged.

I smell a rat but it may well be in the 19.00.00.00 firmware on the card...

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190413/e2e823aa/attachment.bin>

Karl Denninger

2019-Apr-20 14:39 UTC

head link

Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

On 4/13/2019 06:00, Karl Denninger wrote:> On 4/11/2019 13:57, Karl Denninger wrote:
>> On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
>>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger <karl at
denninger.net> wrote:
>>>
>>>
>>>> In this specific case the adapter in question is...
>>>>
>>>> mps0: <Avago Technologies (LSI) SAS2116> port
0xc000-0xc0ff mem
>>>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device
0.0 on pci3
>>>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
>>>> mps0: IOCCapabilities:
>>>>
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>>>>
>>>> Which is indeed a "dumb" HBA (in IT mode), and
Zeephod says he connects
>>>> his drives via dumb on-MoBo direct SATA connections.
>>>>
>>> Maybe I'm in good company.  My current setup has 8 of the disks
connected
>>> to:
>>>
>>> mps0: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff
mem
>>> 0xfe240000-0xfe24ffff,0xfe200000-0xfe23ffff irq 32 at device 0.0 on
pci6
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>>
5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
>>>
>>> ... just with a cable that breaks out each of the 2 connectors into
4
>>> SATA-style connectors, and the other 8 disks (plus boot disks and
SSD
>>> cache/log) connected to ports on...
>>>
>>> - ahci0: <ASMedia ASM1062 AHCI SATA controller> port
>>>
0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
>>> 0xfe900000-0xfe9001ff irq 44 at device 0.0 on pci2
>>> - ahci2: <Marvell 88SE9230 AHCI SATA controller> port
>>>
0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
>>> 0xfe610000-0xfe6107ff irq 40 at device 0.0 on pci7
>>> - ahci3: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port
>>>
0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
>>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>>>
>>> ... each drive connected to a single port.
>>>
>>> I can actually reproduce this at will.  Because I have 16 drives,
when one
>>> fails, I need to find it.  I pull the sata cable for a drive,
determine if
>>> it's the drive in question, if not, reconnect,
"ONLINE" it and wait for
>>> resilver to stop... usually only a minute or two.
>>>
>>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in
general,
>>> that a drive is part of the SAS controller or the SATA
controllers... so
>>> I'm only looking among 8, ever) ... then I "REPLACE"
the problem drive.
>>> More often than not, the a scrub will find a few problems.  In
fact, it
>>> appears that the most recent scrub is an example:
>>>
>>> [1:7:306]dgilbert at vr:~> zpool status
>>>   pool: vr1
>>>  state: ONLINE
>>>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1
23:12:03
>>> 2019
>>> config:
>>>
>>>         NAME            STATE     READ WRITE CKSUM
>>>         vr1             ONLINE       0     0     0
>>>           raidz2-0      ONLINE       0     0     0
>>>             gpt/v1-d0   ONLINE       0     0     0
>>>             gpt/v1-d1   ONLINE       0     0     0
>>>             gpt/v1-d2   ONLINE       0     0     0
>>>             gpt/v1-d3   ONLINE       0     0     0
>>>             gpt/v1-d4   ONLINE       0     0     0
>>>             gpt/v1-d5   ONLINE       0     0     0
>>>             gpt/v1-d6   ONLINE       0     0     0
>>>             gpt/v1-d7   ONLINE       0     0     0
>>>           raidz2-2      ONLINE       0     0     0
>>>             gpt/v1-e0c  ONLINE       0     0     0
>>>             gpt/v1-e1b  ONLINE       0     0     0
>>>             gpt/v1-e2b  ONLINE       0     0     0
>>>             gpt/v1-e3b  ONLINE       0     0     0
>>>             gpt/v1-e4b  ONLINE       0     0     0
>>>             gpt/v1-e5a  ONLINE       0     0     0
>>>             gpt/v1-e6a  ONLINE       0     0     0
>>>             gpt/v1-e7c  ONLINE       0     0     0
>>>         logs
>>>           gpt/vr1log    ONLINE       0     0     0
>>>         cache
>>>           gpt/vr1cache  ONLINE       0     0     0
>>>
>>> errors: No known data errors
>>>
>>> ... it doesn't say it now, but there were 5 CKSUM errors on one
of the
>>> drives that I had trial-removed (and not on the one replaced).
>>> _______________________________________________
>> That is EXACTLY what I'm seeing; the "OFFLINE'd"
drive is the one that,
>> after a scrub, comes up with the checksum errors.? It does *not* flag
>> any errors during the resilver and the drives *not* taken offline do
not
>> (ever) show checksum errors either.
>>
>> Interestingly enough you have 19.00.00.00 firmware on your card as well
>> -- which is what was on mine.
>>
>> I have flashed my card forward to 20.00.07.00 -- we'll see if it
still
>> does it when I do the next swap of the backup set.
> Verrrrrryyyyy interesting.
>
> This drive was last written/read under 19.00.00.00.? Yesterday I swapped
> it back in.? Note that right now I am running:
>
> mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
>
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>
> And, after the scrub completed overnight....
>
> [karl at NewFS ~]$ zpool status backup
> ? pool: backup
> ?state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.? An
> ??????? attempt was made to correct the error.? Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> ??????? using 'zpool clear' or replace the device with 'zpool
replace'.
> ?? see: http://illumos.org/msg/ZFS-8000-9P
> ? scan: scrub repaired 4K in 0 days 06:30:55 with 0 errors on Sat Apr 13
> 01:42:04 2019
> config:
>
> ??????? NAME???????????????????? STATE???? READ WRITE CKSUM
> ??????? backup?????????????????? DEGRADED???? 0???? 0???? 0
> ????????? mirror-0?????????????? DEGRADED???? 0???? 0???? 0
> ??????????? gpt/backup61.eli???? ONLINE?????? 0???? 0???? 0
> ??????????? 2650799076683778414? OFFLINE????? 0???? 0???? 0? was
> /dev/gpt/backup62-1.eli
> ??????????? gpt/backup62-2.eli?? ONLINE?????? 0???? 0???? 1
>
> errors: No known data errors
>
> The OTHER interesting data point is that the resilver *also* posted one
> checksum error, which I cleared before doing the scrub.? Both on the
> 62-2 device.
>
> That would be one block in both cases.? The expected was several (maybe
> a half-dozen) checksum errors on 19.00.00.00 during the scrub but *zero*
> during the resilver.
>
> The unit which was put *into* the vault and is now offline was written
> and scrubbed under 20.00.07.00.? The behavior change certainly implies
> that there are some differences and again, none of these OFFLINE state
> situations are uncontrolled -- in each case the drive is taken offline
> intentionally, the geli provider is detached and then the unit has
> "camcontrol standby" executed against it before it is yanked, so
in
> theory at least there should be no way for a unflushed but write-cached
> block to be lost or damaged.
>
> I smell a rat but it may well be in the 19.00.00.00 firmware on the card...
I can confirm that 20.00.07.00 does *not* stop this.

The previous write/scrub on this device was on 20.00.07.00.? It was
swapped back in from the vault yesterday, resilvered without incident,
but a scrub says....

root at NewFS:/home/karl # zpool status backup
? pool: backup
?state: DEGRADED
status: One or more devices has experienced an unrecoverable error.? An
??????? attempt was made to correct the error.? Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
??????? using 'zpool clear' or replace the device with 'zpool
replace'.
?? see: http://illumos.org/msg/ZFS-8000-9P
? scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

??????? NAME????????????????????? STATE???? READ WRITE CKSUM
??????? backup??????????????????? DEGRADED???? 0???? 0???? 0
????????? mirror-0??????????????? DEGRADED???? 0???? 0???? 0
??????????? gpt/backup61.eli????? ONLINE?????? 0???? 0???? 0
??????????? gpt/backup62-1.eli??? ONLINE?????? 0???? 0??? 47
??????????? 13282812295755460479? OFFLINE????? 0???? 0???? 0? was
/dev/gpt/backup62-2.eli

errors: No known data errors

So this is firmware-invariant (at least between 19.00.00.00 and
20.00.07.00); the issue persists.

Again, in my instance these devices are never removed "unsolicited" so
there can't be (or at least shouldn't be able to) unflushed data in the
device or kernel cache.? The procedure is and remains:

zpool offline .....
geli detach .....
camcontrol standby ...

Wait a few seconds for the spindle to spin down.

Remove disk.

Then of course on the other side after insertion and the kernel has
reported "finding" the device:

geli attach ...
zpool online ....

Wait...

If this is a boogered TXG that's held in the metadata for the
"offline"'d device (maybe "off by one"?) that's
potentially bad in that
if there is an unknown failure in the other mirror component the
resilver will complete but data has been irrevocably destroyed.

Granted, this is a very low probability scenario (the area where the bad
checksums are has to be where the corruption hits, and it has to happen
between the resilver and access to that data.)? Those are long odds but
nonetheless a window of "you're hosed" does appear to exist.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190420/e8ea732f/attachment.bin>

freebsd stable - Apr 2019 - Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)