Zaphod Beeblebrox
2019-Apr-11 18:52 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger <karl at denninger.net> wrote:> In this specific case the adapter in question is... > > mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem > 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3 > mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd > mps0: IOCCapabilities: > 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> > > Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects > his drives via dumb on-MoBo direct SATA connections. >Maybe I'm in good company. My current setup has 8 of the disks connected to: mps0: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem 0xfe240000-0xfe24ffff,0xfe200000-0xfe23ffff irq 32 at device 0.0 on pci6 mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> ... just with a cable that breaks out each of the 2 connectors into 4 SATA-style connectors, and the other 8 disks (plus boot disks and SSD cache/log) connected to ports on... - ahci0: <ASMedia ASM1062 AHCI SATA controller> port 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem 0xfe900000-0xfe9001ff irq 44 at device 0.0 on pci2 - ahci2: <Marvell 88SE9230 AHCI SATA controller> port 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem 0xfe610000-0xfe6107ff irq 40 at device 0.0 on pci7 - ahci3: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0 ... each drive connected to a single port. I can actually reproduce this at will. Because I have 16 drives, when one fails, I need to find it. I pull the sata cable for a drive, determine if it's the drive in question, if not, reconnect, "ONLINE" it and wait for resilver to stop... usually only a minute or two. ... if I do this 4 to 6 odd times to find a drive (I can tell, in general, that a drive is part of the SAS controller or the SATA controllers... so I'm only looking among 8, ever) ... then I "REPLACE" the problem drive. More often than not, the a scrub will find a few problems. In fact, it appears that the most recent scrub is an example: [1:7:306]dgilbert at vr:~> zpool status pool: vr1 state: ONLINE scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr 1 23:12:03 2019 config: NAME STATE READ WRITE CKSUM vr1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gpt/v1-d0 ONLINE 0 0 0 gpt/v1-d1 ONLINE 0 0 0 gpt/v1-d2 ONLINE 0 0 0 gpt/v1-d3 ONLINE 0 0 0 gpt/v1-d4 ONLINE 0 0 0 gpt/v1-d5 ONLINE 0 0 0 gpt/v1-d6 ONLINE 0 0 0 gpt/v1-d7 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 gpt/v1-e0c ONLINE 0 0 0 gpt/v1-e1b ONLINE 0 0 0 gpt/v1-e2b ONLINE 0 0 0 gpt/v1-e3b ONLINE 0 0 0 gpt/v1-e4b ONLINE 0 0 0 gpt/v1-e5a ONLINE 0 0 0 gpt/v1-e6a ONLINE 0 0 0 gpt/v1-e7c ONLINE 0 0 0 logs gpt/vr1log ONLINE 0 0 0 cache gpt/vr1cache ONLINE 0 0 0 errors: No known data errors ... it doesn't say it now, but there were 5 CKSUM errors on one of the drives that I had trial-removed (and not on the one replaced).
Karl Denninger
2019-Apr-11 18:57 UTC
Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On 4/11/2019 13:52, Zaphod Beeblebrox wrote:> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger <karl at denninger.net> wrote: > > >> In this specific case the adapter in question is... >> >> mps0: <Avago Technologies (LSI) SAS2116> port 0xc000-0xc0ff mem >> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3 >> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd >> mps0: IOCCapabilities: >> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> >> >> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects >> his drives via dumb on-MoBo direct SATA connections. >> > Maybe I'm in good company. My current setup has 8 of the disks connected > to: > > mps0: <Avago Technologies (LSI) SAS2308> port 0xb000-0xb0ff mem > 0xfe240000-0xfe24ffff,0xfe200000-0xfe23ffff irq 32 at device 0.0 on pci6 > mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd > mps0: IOCCapabilities: > 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> > > ... just with a cable that breaks out each of the 2 connectors into 4 > SATA-style connectors, and the other 8 disks (plus boot disks and SSD > cache/log) connected to ports on... > > - ahci0: <ASMedia ASM1062 AHCI SATA controller> port > 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem > 0xfe900000-0xfe9001ff irq 44 at device 0.0 on pci2 > - ahci2: <Marvell 88SE9230 AHCI SATA controller> port > 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem > 0xfe610000-0xfe6107ff irq 40 at device 0.0 on pci7 > - ahci3: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port > 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem > 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0 > > ... each drive connected to a single port. > > I can actually reproduce this at will. Because I have 16 drives, when one > fails, I need to find it. I pull the sata cable for a drive, determine if > it's the drive in question, if not, reconnect, "ONLINE" it and wait for > resilver to stop... usually only a minute or two. > > ... if I do this 4 to 6 odd times to find a drive (I can tell, in general, > that a drive is part of the SAS controller or the SATA controllers... so > I'm only looking among 8, ever) ... then I "REPLACE" the problem drive. > More often than not, the a scrub will find a few problems. In fact, it > appears that the most recent scrub is an example: > > [1:7:306]dgilbert at vr:~> zpool status > pool: vr1 > state: ONLINE > scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr 1 23:12:03 > 2019 > config: > > NAME STATE READ WRITE CKSUM > vr1 ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > gpt/v1-d0 ONLINE 0 0 0 > gpt/v1-d1 ONLINE 0 0 0 > gpt/v1-d2 ONLINE 0 0 0 > gpt/v1-d3 ONLINE 0 0 0 > gpt/v1-d4 ONLINE 0 0 0 > gpt/v1-d5 ONLINE 0 0 0 > gpt/v1-d6 ONLINE 0 0 0 > gpt/v1-d7 ONLINE 0 0 0 > raidz2-2 ONLINE 0 0 0 > gpt/v1-e0c ONLINE 0 0 0 > gpt/v1-e1b ONLINE 0 0 0 > gpt/v1-e2b ONLINE 0 0 0 > gpt/v1-e3b ONLINE 0 0 0 > gpt/v1-e4b ONLINE 0 0 0 > gpt/v1-e5a ONLINE 0 0 0 > gpt/v1-e6a ONLINE 0 0 0 > gpt/v1-e7c ONLINE 0 0 0 > logs > gpt/vr1log ONLINE 0 0 0 > cache > gpt/vr1cache ONLINE 0 0 0 > > errors: No known data errors > > ... it doesn't say it now, but there were 5 CKSUM errors on one of the > drives that I had trial-removed (and not on the one replaced). > _______________________________________________That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that, after a scrub, comes up with the checksum errors.? It does *not* flag any errors during the resilver and the drives *not* taken offline do not (ever) show checksum errors either. Interestingly enough you have 19.00.00.00 firmware on your card as well -- which is what was on mine. I have flashed my card forward to 20.00.07.00 -- we'll see if it still does it when I do the next swap of the backup set. -- Karl Denninger karl at denninger.net <mailto:karl at denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4897 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190411/bfd6b8b7/attachment.bin>