thr3ads.net - CentOS - [CentOS] HDD badblocks [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Alessandro Baggi

2016-Jan-18 11:39 UTC

[CentOS] HDD badblocks

Il 18/01/2016 12:09, Chris Murphy ha scritto:> What is the result for each drive?
>
> smartctl -l scterc <dev>
>
>
> Chris Murphy
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
> .
>SCT Error Recovery Control command not supported

Matt Garman

2016-Jan-18 15:47 UTC

head link

[CentOS] HDD badblocks

That's strange, I expected the SMART test to show some issues.
Personally, I'm still not confident in that drive.  Can you check
cabling?  Another possibility is that there is a cable that has
vibrated into a marginal state.  Probably a long shot, but if it's
easy to get physical access to the machine, and you can afford the
downtime to shut it down, open up the chassis and re-seat the drive
and cables.

Every now and then I have PCIe cards that work fine for years, then
suddenly disappear after a reboot.  I re-seat them and they go back to
being fine for years.  So I believe vibration does sometimes play a
role in mysterious problems that creep up from time to time.

On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi
<alessandro.baggi at gmail.com> wrote:> Il 18/01/2016 12:09, Chris Murphy ha scritto:
>>
>> What is the result for each drive?
>>
>> smartctl -l scterc <dev>
>>
>>
>> Chris Murphy
>> _______________________________________________
>> CentOS mailing list
>> CentOS at centos.org
>> https://lists.centos.org/mailman/listinfo/centos
>> .
>>
> SCT Error Recovery Control command not supported
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos

Gordon Messmer

2016-Jan-18 17:19 UTC

head link

[CentOS] HDD badblocks

On 01/18/2016 07:47 AM, Matt Garman wrote:> Another possibility is that there is a cable that has
> vibrated into a marginal state.
That wouldn't explain the SMART data reporting pending sectors.

According to spec, a drive may not reallocate sectors after a read error 
if it's later able to read the sector successfully.  That's probably 
what happened here.

Drives are consumable items in computing.  They have to be replaced 
eventually.  Read errors are often an early sign of failure.  The drive 
may continue to work for a while before it fails.  The only question is: 
is the value of whatever amount of time it has left greater than the 
cost of replacing it?

J Martin Rushton

2016-Jan-18 23:34 UTC

head link

[CentOS] HDD badblocks

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Not new: I can remember seeing DEC engineers cleaning up the contacts
on memory boards for a VAX 11/782 with a pencil eraser c.1985.  It's
still a pretty standard first fix to reseat a card or connector.

On 18/01/16 15:47, Matt Garman wrote:> That's strange, I expected the SMART test to show some issues. 
> Personally, I'm still not confident in that drive.  Can you check 
> cabling?  Another possibility is that there is a cable that has 
> vibrated into a marginal state.  Probably a long shot, but if it's 
> easy to get physical access to the machine, and you can afford the 
> downtime to shut it down, open up the chassis and re-seat the
> drive and cables.
> 
> Every now and then I have PCIe cards that work fine for years,
> then suddenly disappear after a reboot.  I re-seat them and they go
> back to being fine for years.  So I believe vibration does
> sometimes play a role in mysterious problems that creep up from
> time to time.
> 
> 
> 
> On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi 
> <alessandro.baggi at gmail.com> wrote:
>> Il 18/01/2016 12:09, Chris Murphy ha scritto:
>>> 
>>> What is the result for each drive?
>>> 
>>> smartctl -l scterc <dev>
>>> 
>>> 
>>> Chris Murphy _______________________________________________ 
>>> CentOS mailing list CentOS at centos.org 
>>> https://lists.centos.org/mailman/listinfo/centos .
>>> 
>> SCT Error Recovery Control command not supported
>> 
>> _______________________________________________ CentOS mailing
>> list CentOS at centos.org 
>> https://lists.centos.org/mailman/listinfo/centos
> _______________________________________________ CentOS mailing
> list CentOS at centos.org 
> https://lists.centos.org/mailman/listinfo/centos
> -----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJWnXaVAAoJEAF3yXsqtyBlQJ0P/i92NZYQvNiwK3a/jUDJpwcV
7lHGPJzdAFbR2VRTblrvtxWifLle8FhDde7O4zh+3j1R/Jt49f61764eEXAjsP7M
xb9JtaPvVxpTNFygqfh9n9/wZkJCmokYFvd8KLWqQuZDqa8R89z/KRM1IxR4W3Ux
s+bk5BYTvybRcV+tmhlSOQC0GcZj108b/4Ki2AuHEVTCJQ6TlY/J3cSN/bhmiNcc
Tmj3mamgnjmOEdKbtNpbrA3tTvfY1/OY7wqqBYtojaqPKB38RIFhqr0z1bEhkLQy
oB3Y4Nw1nW/r+KrFuHE2siBI/qTRR0Pf/RwPU7LLGrsjUgTwygVhp4tivb6wOFgQ
YLVJNC8+XdNxYuSrdyvfkCrU1LyW/4HLmaANj78ZjlakB80WNkWmocoJrGBGnp3E
2akAUJV7CS/+xkXMyJuWhkKFjMkjzn+o2TFD9Fw9Re+NNtvmtRSQ54C4zlyXWKOI
xxPajRRmHfXQObi0kkGHABZqSUAwXt60YQmalZfKXO8bWE0ySALc0OE9GFjvNh4V
tX+PUoKfgtCEoSRMcFIytMJxc46prgS0OakHew0jlBCDOEEl9Kyyo0OsEOy1+jpd
hKeVQ66h5+Xv+FqXf/JUQmNO3xo+zUCjIDNIPeQbyLjYNQHicy/WIqZ2kLRKdu1q
ZZE5IlmRmnALqLxE5MZd
=zUh6
-----END PGP SIGNATURE-----

Alessandro Baggi

2016-Jan-19 14:06 UTC

head link

[CentOS] HDD badblocks

Il 18/01/2016 16:47, Matt Garman ha scritto:> That's strange, I expected the SMART test to show some issues.
> Personally, I'm still not confident in that drive.  Can you check
> cabling?  Another possibility is that there is a cable that has
> vibrated into a marginal state.  Probably a long shot, but if it's
> easy to get physical access to the machine, and you can afford the
> downtime to shut it down, open up the chassis and re-seat the drive
> and cables.
>
> Every now and then I have PCIe cards that work fine for years, then
> suddenly disappear after a reboot.  I re-seat them and they go back to
> being fine for years.  So I believe vibration does sometimes play a
> role in mysterious problems that creep up from time to time.
>
>
>
> On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi
> <alessandro.baggi at gmail.com> wrote:
>> Il 18/01/2016 12:09, Chris Murphy ha scritto:
>>>
>>> What is the result for each drive?
>>>
>>> smartctl -l scterc <dev>
>>>
>>>
>>> Chris Murphy
>>> _______________________________________________
>>> CentOS mailing list
>>> CentOS at centos.org
>>> https://lists.centos.org/mailman/listinfo/centos
>>> .
>>>
>> SCT Error Recovery Control command not supported
>>
>> _______________________________________________
>> CentOS mailing list
>> CentOS at centos.org
>> https://lists.centos.org/mailman/listinfo/centos
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>This is a notebook.

Chris Murphy

2016-Jan-19 22:28 UTC

head link

[CentOS] HDD badblocks

On Mon, Jan 18, 2016, 4:39 AM Alessandro Baggi <alessandro.baggi at
gmail.com>
wrote:
> Il 18/01/2016 12:09, Chris Murphy ha scritto:
> > What is the result for each drive?
> >
> > smartctl -l scterc <dev>
> >
> >
> > Chris Murphy
> > _______________________________________________
> > CentOS mailing list
> > CentOS at centos.org
> > https://lists.centos.org/mailman/listinfo/centos
> > .
> >
> SCT Error Recovery Control command not supported
>

The drive is disqualified unless your usecase can tolerate the possibly
very high error recovery time for these drives.

Do a search for Red Hat documentation on the SCSI Command Timer. By default
this is 30 seconds. You'll have to raise this to 120 out maybe even 180
depending on the maximum time the drive attempts to recover. The SCSI
Command Timer is a kernel seeing per block device. Basically it's giving
up, and resetting the link to drive because while the drive is in deep
recovery it doesn't respond to anything.

Chris Murphy

_______________________________________________> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>

m.roth at 5-cent.us

2016-Jan-19 22:30 UTC

head link

[CentOS] HDD badblocks

Chris Murphy wrote:> On Mon, Jan 18, 2016, 4:39 AM Alessandro Baggi
> <alessandro.baggi at gmail.com>
> wrote:
>> Il 18/01/2016 12:09, Chris Murphy ha scritto:
>> > What is the result for each drive?
>> >
>> > smartctl -l scterc <dev>
>> >
>> SCT Error Recovery Control command not supported
>>
> The drive is disqualified unless your usecase can tolerate the possibly
> very high error recovery time for these drives.
>
> Do a search for Red Hat documentation on the SCSI Command Timer. By
> default
> this is 30 seconds. You'll have to raise this to 120 out maybe even 180
> depending on the maximum time the drive attempts to recover. The SCSI
> Command Timer is a kernel seeing per block device. Basically it's
giving
> up, and resetting the link to drive because while the drive is in deep
> recovery it doesn't respond to anything.
>Replace the drive. Yesterday.

         mark

Reasonably Related Threads

Search for more possibly parallel threads

CentOS - Jan 2016 - HDD badblocks

[CentOS] HDD badblocks

[CentOS] HDD badblocks

[CentOS] HDD badblocks

[CentOS] HDD badblocks

[CentOS] HDD badblocks

[CentOS] HDD badblocks

[CentOS] HDD badblocks

Reasonably Related Threads