Kai Gallasch
2014-Dec-09 08:34 UTC
10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
Am Thu, 06 Nov 2014 01:20:47 +0000 schrieb Steven Hartland <killing at multiplay.co.uk>:> Try recabling and re-seating, if it still happens try to identify if > its the disk or backplane by moving it in the chassis. We had a > machine here recently where it was backplane issue and simply > replacing it fixed the issue.Steven. In the last weeks I took some time to single out the reason for the AHCI timeouts with the two Samsung SSD drives. Just for the record, my original post on the FreeBSD mailing list archive: http://lists.freebsd.org/pipermail/freebsd-stable/2014-November/080914.html I changed / tried the following to get rid of the AHCI timouts, but no chance, they still show :-/ Hardware: - Changed all four SATA cables with cables of an identical spare server - Changed all four SATA cables with certified SATA3 cables - Replaced the 2.5" -> 3.5" drive converters with ones of another manufacturer - Replaced the drive backplane of the server - Directly hooking the two SSDs up to the SATA connectors on the mainboard - Experimentally put an LSI 9212-4i4e PCIe SATA/SAS Controller into the server and and connected the SATA cables to it. - Same as before, but using the certified SATA3 cables - Same as before, but this time connecting the two SSDs directly to the 9212-4i4e - Same as before, connecting the two SSD directly to the 9212-4i4e, but this time with the original SATA cables BIOS: - Temporarily disabled Power Management - Tried disabling "Enable Hot Plug" Option The difference between using the SATA connectors of the mainboard and using the LSI 9212-4i4e is, that the LSI controller seems to be more picky about CRC errors on the SATA bus and bus problems even show without starting a zfs scrub. When doing a scrub using the LSI controller there are plenty of timeouts and in one test, one of the SSD drives even disappeard from the SATA bus. Of course all the time during testing the two Hitachi non-SSD SATA drives did not show any problems at all - although also connected to the mainboard or the LSI controller during the testing. So I now think the whole problem centers around the Samsung 850 PRO 512GB SSDs. Too bad I do not have the budget to just buy two Intel (or other) SSDs of similar size and see if the timeouts disappear.. I wonder if this is a firmware issue with the drive or just some misguided fancy energy saving feature of this particular drive model causing the whole trouble. Both drives have serial numbers not far apart and smartctl claims there are no errors on the SSDs. Any ideas (left) ? Regards, Kai. -- PGP-KeyID = 0xE401B671927D4A5C I am not a robot. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: not available URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20141209/d0461fbf/attachment.sig>
Ganael LAPLANCHE
2014-Dec-09 09:04 UTC
10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
On Tue, 9 Dec 2014 09:34:05 +0100, Kai Gallasch wrote Hi Kai,> Any ideas (left) ?There is a PR for AHCI timeouts : https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195349 I don't know if it is related to your problem but maybe you can try the suggested workaround ? Regards, -- Ganael LAPLANCHE <ganael.laplanche at martymac.org> http://www.martymac.org | http://contribs.martymac.org FreeBSD: martymac <martymac at FreeBSD.org>, http://www.FreeBSD.org
Steven Hartland
2014-Dec-09 09:25 UTC
10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
On 09/12/2014 08:34, Kai Gallasch wrote:> Am Thu, 06 Nov 2014 01:20:47 +0000 > schrieb Steven Hartland <killing at multiplay.co.uk>: > >> Try recabling and re-seating, if it still happens try to identify if >> its the disk or backplane by moving it in the chassis. We had a >> machine here recently where it was backplane issue and simply >> replacing it fixed the issue. > Steven. > > In the last weeks I took some time to single out the reason for the AHCI > timeouts with the two Samsung SSD drives. > > Just for the record, my original post on the FreeBSD mailing > list archive: > > http://lists.freebsd.org/pipermail/freebsd-stable/2014-November/080914.html > > > I changed / tried the following to get rid of the AHCI timouts, but no > chance, they still show :-/ > > Hardware: > > - Changed all four SATA cables with cables of an identical spare server > - Changed all four SATA cables with certified SATA3 cables > - Replaced the 2.5" -> 3.5" drive converters with ones of another > manufacturer > - Replaced the drive backplane of the server > - Directly hooking the two SSDs up to the SATA connectors on the > mainboard > - Experimentally put an LSI 9212-4i4e PCIe SATA/SAS Controller into the > server and and connected the SATA cables to it. > - Same as before, but using the certified SATA3 cables > - Same as before, but this time connecting the two SSDs directly to the > 9212-4i4e > - Same as before, connecting the two SSD directly to the 9212-4i4e, but > this time with the original SATA cables > > > BIOS: > - Temporarily disabled Power Management > - Tried disabling "Enable Hot Plug" Option > > > The difference between using the SATA connectors of the mainboard and > using the LSI 9212-4i4e is, that the LSI controller seems to be more > picky about CRC errors on the SATA bus and bus problems even show > without starting a zfs scrub. When doing a scrub using the LSI > controller there are plenty of timeouts and in one test, one of the SSD > drives even disappeard from the SATA bus. > > Of course all the time during testing the two Hitachi non-SSD SATA > drives did not show any problems at all - although also connected to > the mainboard or the LSI controller during the testing. > > So I now think the whole problem centers around the Samsung 850 PRO > 512GB SSDs. Too bad I do not have the budget to just buy two Intel (or > other) SSDs of similar size and see if the timeouts disappear.. > > I wonder if this is a firmware issue with the drive or just some > misguided fancy energy saving feature of this particular drive > model causing the whole trouble. > > Both drives have serial numbers not far apart and smartctl claims there > are no errors on the SSDs. > > Any ideas (left) ? >Have you tried dropping the speed on the ahci controller e.g. hint.ahcich.0.sata_rev="2" I can't say I've used Samsung 850 Pro's but we do have plenty of 840's, which are attached to LSI controllers in service here and never had an issue. The other issue might have is bad MB, where the issue is actually occurring in memory. Given you have replaced the controller and cables this might be your issue. So try replacing the Memory, MB, CPU etc. The last time I had really bad corruption issues the problem turned out to be dodgy Intel CPU. Also someone posted on the list not yesterday they had constant CKSUM errors from ZFS and it turned out to be their power causing the issue and running the server of a sign wave based UPS made the problem go away. Regards Steve