thr3ads.net - CentOS - [CentOS] reboot - is there a timeout on filesystem flush? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Valeri Galtsev

2015-Jan-07 16:43 UTC

[CentOS] reboot - is there a timeout on filesystem flush?

On Wed, January 7, 2015 10:33 am, Les Mikesell wrote:> On Wed, Jan 7, 2015 at 9:52 AM, Gordon Messmer <gordon.messmer at
gmail.com>
> wrote:
>>
>> Every regular file's directory entry on your system is a hard link.
>> There's
>> nothing particular about links (files) that make a filesystem fragile.
>
> Agreed, although when there are millions, the fsck fixing it is somewhat
> slow.
>
>>> It is mostly on aging hardware, so it
>>> is possible that there are underlying controller issues.  I also
see
>>> some rare cases on similar machines where a filesystem will go
>>> read-only with some scsi errors logged, but didn't look for
that yet
>>> in this case.
>>
>>
>> It's probably a similar cause in all cases.  I don't know how
many times
>> I've seen you on this list defending running old hardware /
obsolete
>> hardware.  Corruption and failure are more or less what I'd expect
if
>> your
>> hardware is junk.
>
> Not junk - these are mostly IBM 3550/3650 boxes - pretty much top of
> the line in their day (before the M2/3/4 versions),  They have
> Adaptec raid contollers,
I never had Adaptec in _my_ list of good RAID hardware... But certainly I
can note be the one to offer judgement on hardware I avoid to the best of
my ability. If you can afford, I would do the test: replace Adaptec with
something else (in my list it would be either 3ware or LSI or areca),
leaving the rest of hardware as it is. And see it the problems continue. I
do realize that there is more to it than just pulling one card and
sticking another in its place (that's why I said if you can
"afford" it
meaning in more general sense, not just monetary).

Valeri
> SAS drives, mostly configured as RAID1
> mirrors.  I realize that hardware isn't perfect and this is not
> happening on a large percentage of them.   But, I don't see anything
> that looks like scsi errors in this log and I'm surprised that after
> running apparently error-free there would be problems detected after a
> software reboot.
>
> I think the newer M2 and later models went to a different RAID
> controller, though.   Maybe there was a reason.
>
> --
>    Les Mikesell
>       lesmikesell at gmail.com
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++

Les Mikesell

2015-Jan-07 16:54 UTC

head link

[CentOS] reboot - is there a timeout on filesystem flush?

On Wed, Jan 7, 2015 at 10:43 AM, Valeri Galtsev
<galtsev at kicp.uchicago.edu> wrote:>>
>> Not junk - these are mostly IBM 3550/3650 boxes - pretty much top of
>> the line in their day (before the M2/3/4 versions),  They have
>> Adaptec raid contollers,
>
> I never had Adaptec in _my_ list of good RAID hardware... But certainly I
> can note be the one to offer judgement on hardware I avoid to the best of
> my ability. If you can afford, I would do the test: replace Adaptec with
> something else (in my list it would be either 3ware or LSI or areca),
> leaving the rest of hardware as it is. And see it the problems continue. I
> do realize that there is more to it than just pulling one card and
> sticking another in its place (that's why I said if you can
"afford" it
> meaning in more general sense, not just monetary).
It's not something happening as a repeatable thing or that I could
consider better/worse after replacing something.  Maybe 3 times a year
across a few hundred machines and generally not repeating on the same
ones. But if there is anything in common it is on very 'active'
filesystems.

-- 
    Les Mikesell
       lesmikesell at gmail.com

Valeri Galtsev

2015-Jan-07 17:10 UTC

head link

[CentOS] reboot - is there a timeout on filesystem flush?

On Wed, January 7, 2015 10:54 am, Les Mikesell wrote:> On Wed, Jan 7, 2015 at 10:43 AM, Valeri Galtsev
> <galtsev at kicp.uchicago.edu> wrote:
>>>
>>> Not junk - these are mostly IBM 3550/3650 boxes - pretty much top
of
>>> the line in their day (before the M2/3/4 versions),  They have
>>> Adaptec raid contollers,
>>
>> I never had Adaptec in _my_ list of good RAID hardware... But certainly
>> I
>> can note be the one to offer judgement on hardware I avoid to the best
>> of
>> my ability. If you can afford, I would do the test: replace Adaptec
with
>> something else (in my list it would be either 3ware or LSI or areca),
>> leaving the rest of hardware as it is. And see it the problems
continue.
>> I
>> do realize that there is more to it than just pulling one card and
>> sticking another in its place (that's why I said if you can
"afford" it
>> meaning in more general sense, not just monetary).
>
> It's not something happening as a repeatable thing or that I could
> consider better/worse after replacing something.  Maybe 3 times a year
> across a few hundred machines and generally not repeating on the same
> ones. But if there is anything in common it is on very 'active'
> filesystems.
>
Too bad... Reminds me one of my 32 node clusters in which one of the nodes
crashed in a crashed once a month (always different node, so probability
of run is 32 Month before crash ;-( Too bad for troubleshooting. Only
after 6 Months I pinpointed particular brand of RAM mixed in into each
node - when I got rid of it, the trouble ended... I would bet on Adaptec
cards in your case... though ideally I shouldn't be offering judgement on
hardware of the brand I almost never use. Good luck!

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++

Seemingly Similar Threads

Search for more apparently analagous threads

CentOS - Jan 2015 - reboot - is there a timeout on filesystem flush?

[CentOS] reboot - is there a timeout on filesystem flush?

[CentOS] reboot - is there a timeout on filesystem flush?

[CentOS] reboot - is there a timeout on filesystem flush?

Seemingly Similar Threads