Il 18/01/2016 12:09, Chris Murphy ha scritto:> What is the result for each drive? > > smartctl -l scterc <dev> > > > Chris Murphy > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos > . >SCT Error Recovery Control command not supported
That's strange, I expected the SMART test to show some issues. Personally, I'm still not confident in that drive. Can you check cabling? Another possibility is that there is a cable that has vibrated into a marginal state. Probably a long shot, but if it's easy to get physical access to the machine, and you can afford the downtime to shut it down, open up the chassis and re-seat the drive and cables. Every now and then I have PCIe cards that work fine for years, then suddenly disappear after a reboot. I re-seat them and they go back to being fine for years. So I believe vibration does sometimes play a role in mysterious problems that creep up from time to time. On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi <alessandro.baggi at gmail.com> wrote:> Il 18/01/2016 12:09, Chris Murphy ha scritto: >> >> What is the result for each drive? >> >> smartctl -l scterc <dev> >> >> >> Chris Murphy >> _______________________________________________ >> CentOS mailing list >> CentOS at centos.org >> https://lists.centos.org/mailman/listinfo/centos >> . >> > SCT Error Recovery Control command not supported > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos
On 01/18/2016 07:47 AM, Matt Garman wrote:> Another possibility is that there is a cable that has > vibrated into a marginal state.That wouldn't explain the SMART data reporting pending sectors. According to spec, a drive may not reallocate sectors after a read error if it's later able to read the sector successfully. That's probably what happened here. Drives are consumable items in computing. They have to be replaced eventually. Read errors are often an early sign of failure. The drive may continue to work for a while before it fails. The only question is: is the value of whatever amount of time it has left greater than the cost of replacing it?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Not new: I can remember seeing DEC engineers cleaning up the contacts on memory boards for a VAX 11/782 with a pencil eraser c.1985. It's still a pretty standard first fix to reseat a card or connector. On 18/01/16 15:47, Matt Garman wrote:> That's strange, I expected the SMART test to show some issues. > Personally, I'm still not confident in that drive. Can you check > cabling? Another possibility is that there is a cable that has > vibrated into a marginal state. Probably a long shot, but if it's > easy to get physical access to the machine, and you can afford the > downtime to shut it down, open up the chassis and re-seat the > drive and cables. > > Every now and then I have PCIe cards that work fine for years, > then suddenly disappear after a reboot. I re-seat them and they go > back to being fine for years. So I believe vibration does > sometimes play a role in mysterious problems that creep up from > time to time. > > > > On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi > <alessandro.baggi at gmail.com> wrote: >> Il 18/01/2016 12:09, Chris Murphy ha scritto: >>> >>> What is the result for each drive? >>> >>> smartctl -l scterc <dev> >>> >>> >>> Chris Murphy _______________________________________________ >>> CentOS mailing list CentOS at centos.org >>> https://lists.centos.org/mailman/listinfo/centos . >>> >> SCT Error Recovery Control command not supported >> >> _______________________________________________ CentOS mailing >> list CentOS at centos.org >> https://lists.centos.org/mailman/listinfo/centos > _______________________________________________ CentOS mailing > list CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJWnXaVAAoJEAF3yXsqtyBlQJ0P/i92NZYQvNiwK3a/jUDJpwcV 7lHGPJzdAFbR2VRTblrvtxWifLle8FhDde7O4zh+3j1R/Jt49f61764eEXAjsP7M xb9JtaPvVxpTNFygqfh9n9/wZkJCmokYFvd8KLWqQuZDqa8R89z/KRM1IxR4W3Ux s+bk5BYTvybRcV+tmhlSOQC0GcZj108b/4Ki2AuHEVTCJQ6TlY/J3cSN/bhmiNcc Tmj3mamgnjmOEdKbtNpbrA3tTvfY1/OY7wqqBYtojaqPKB38RIFhqr0z1bEhkLQy oB3Y4Nw1nW/r+KrFuHE2siBI/qTRR0Pf/RwPU7LLGrsjUgTwygVhp4tivb6wOFgQ YLVJNC8+XdNxYuSrdyvfkCrU1LyW/4HLmaANj78ZjlakB80WNkWmocoJrGBGnp3E 2akAUJV7CS/+xkXMyJuWhkKFjMkjzn+o2TFD9Fw9Re+NNtvmtRSQ54C4zlyXWKOI xxPajRRmHfXQObi0kkGHABZqSUAwXt60YQmalZfKXO8bWE0ySALc0OE9GFjvNh4V tX+PUoKfgtCEoSRMcFIytMJxc46prgS0OakHew0jlBCDOEEl9Kyyo0OsEOy1+jpd hKeVQ66h5+Xv+FqXf/JUQmNO3xo+zUCjIDNIPeQbyLjYNQHicy/WIqZ2kLRKdu1q ZZE5IlmRmnALqLxE5MZd =zUh6 -----END PGP SIGNATURE-----
Il 18/01/2016 16:47, Matt Garman ha scritto:> That's strange, I expected the SMART test to show some issues. > Personally, I'm still not confident in that drive. Can you check > cabling? Another possibility is that there is a cable that has > vibrated into a marginal state. Probably a long shot, but if it's > easy to get physical access to the machine, and you can afford the > downtime to shut it down, open up the chassis and re-seat the drive > and cables. > > Every now and then I have PCIe cards that work fine for years, then > suddenly disappear after a reboot. I re-seat them and they go back to > being fine for years. So I believe vibration does sometimes play a > role in mysterious problems that creep up from time to time. > > > > On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi > <alessandro.baggi at gmail.com> wrote: >> Il 18/01/2016 12:09, Chris Murphy ha scritto: >>> >>> What is the result for each drive? >>> >>> smartctl -l scterc <dev> >>> >>> >>> Chris Murphy >>> _______________________________________________ >>> CentOS mailing list >>> CentOS at centos.org >>> https://lists.centos.org/mailman/listinfo/centos >>> . >>> >> SCT Error Recovery Control command not supported >> >> _______________________________________________ >> CentOS mailing list >> CentOS at centos.org >> https://lists.centos.org/mailman/listinfo/centos > _______________________________________________ > CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >This is a notebook.
On Mon, Jan 18, 2016, 4:39 AM Alessandro Baggi <alessandro.baggi at gmail.com> wrote:> Il 18/01/2016 12:09, Chris Murphy ha scritto: > > What is the result for each drive? > > > > smartctl -l scterc <dev> > > > > > > Chris Murphy > > _______________________________________________ > > CentOS mailing list > > CentOS at centos.org > > https://lists.centos.org/mailman/listinfo/centos > > . > > > SCT Error Recovery Control command not supported >The drive is disqualified unless your usecase can tolerate the possibly very high error recovery time for these drives. Do a search for Red Hat documentation on the SCSI Command Timer. By default this is 30 seconds. You'll have to raise this to 120 out maybe even 180 depending on the maximum time the drive attempts to recover. The SCSI Command Timer is a kernel seeing per block device. Basically it's giving up, and resetting the link to drive because while the drive is in deep recovery it doesn't respond to anything. Chris Murphy _______________________________________________> CentOS mailing list > CentOS at centos.org > https://lists.centos.org/mailman/listinfo/centos >
Chris Murphy wrote:> On Mon, Jan 18, 2016, 4:39 AM Alessandro Baggi > <alessandro.baggi at gmail.com> > wrote: >> Il 18/01/2016 12:09, Chris Murphy ha scritto: >> > What is the result for each drive? >> > >> > smartctl -l scterc <dev> >> > >> SCT Error Recovery Control command not supported >> > The drive is disqualified unless your usecase can tolerate the possibly > very high error recovery time for these drives. > > Do a search for Red Hat documentation on the SCSI Command Timer. By > default > this is 30 seconds. You'll have to raise this to 120 out maybe even 180 > depending on the maximum time the drive attempts to recover. The SCSI > Command Timer is a kernel seeing per block device. Basically it's giving > up, and resetting the link to drive because while the drive is in deep > recovery it doesn't respond to anything. >Replace the drive. Yesterday. mark