Gordon Messmer
2015-Jan-07 15:52 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On 01/07/2015 05:53 AM, Les Mikesell wrote:> > Yes - the unattended fsck fails.In that case, there should be logs indicating the cause of the error when it was detected by the kernel. There's probably something wrong with your controller or other hardware.> Personally, I'd prefer for the > default run to use '-y' in the first place. It's not like I'm more > likely than fsck to know how to fix it and it is very inconvenient on > remote machines. The recent case was an opennms system updating a > lot of rrd files, but I've also seen it on backuppc archives with lots > of files and lots of hard links.Every regular file's directory entry on your system is a hard link. There's nothing particular about links (files) that make a filesystem fragile.> It is mostly on aging hardware, so it > is possible that there are underlying controller issues. I also see > some rare cases on similar machines where a filesystem will go > read-only with some scsi errors logged, but didn't look for that yet > in this case.It's probably a similar cause in all cases. I don't know how many times I've seen you on this list defending running old hardware / obsolete hardware. Corruption and failure are more or less what I'd expect if your hardware is junk.
Les Mikesell
2015-Jan-07 16:33 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On Wed, Jan 7, 2015 at 9:52 AM, Gordon Messmer <gordon.messmer at gmail.com> wrote:> > Every regular file's directory entry on your system is a hard link. There's > nothing particular about links (files) that make a filesystem fragile.Agreed, although when there are millions, the fsck fixing it is somewhat slow.>> It is mostly on aging hardware, so it >> is possible that there are underlying controller issues. I also see >> some rare cases on similar machines where a filesystem will go >> read-only with some scsi errors logged, but didn't look for that yet >> in this case. > > > It's probably a similar cause in all cases. I don't know how many times > I've seen you on this list defending running old hardware / obsolete > hardware. Corruption and failure are more or less what I'd expect if your > hardware is junk.Not junk - these are mostly IBM 3550/3650 boxes - pretty much top of the line in their day (before the M2/3/4 versions), They have Adaptec raid contollers, SAS drives, mostly configured as RAID1 mirrors. I realize that hardware isn't perfect and this is not happening on a large percentage of them. But, I don't see anything that looks like scsi errors in this log and I'm surprised that after running apparently error-free there would be problems detected after a software reboot. I think the newer M2 and later models went to a different RAID controller, though. Maybe there was a reason. -- Les Mikesell lesmikesell at gmail.com
Valeri Galtsev
2015-Jan-07 16:43 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On Wed, January 7, 2015 10:33 am, Les Mikesell wrote:> On Wed, Jan 7, 2015 at 9:52 AM, Gordon Messmer <gordon.messmer at gmail.com> > wrote: >> >> Every regular file's directory entry on your system is a hard link. >> There's >> nothing particular about links (files) that make a filesystem fragile. > > Agreed, although when there are millions, the fsck fixing it is somewhat > slow. > >>> It is mostly on aging hardware, so it >>> is possible that there are underlying controller issues. I also see >>> some rare cases on similar machines where a filesystem will go >>> read-only with some scsi errors logged, but didn't look for that yet >>> in this case. >> >> >> It's probably a similar cause in all cases. I don't know how many times >> I've seen you on this list defending running old hardware / obsolete >> hardware. Corruption and failure are more or less what I'd expect if >> your >> hardware is junk. > > Not junk - these are mostly IBM 3550/3650 boxes - pretty much top of > the line in their day (before the M2/3/4 versions), They have > Adaptec raid contollers,I never had Adaptec in _my_ list of good RAID hardware... But certainly I can note be the one to offer judgement on hardware I avoid to the best of my ability. If you can afford, I would do the test: replace Adaptec with something else (in my list it would be either 3ware or LSI or areca), leaving the rest of hardware as it is. And see it the problems continue. I do realize that there is more to it than just pulling one card and sticking another in its place (that's why I said if you can "afford" it meaning in more general sense, not just monetary). Valeri> SAS drives, mostly configured as RAID1 > mirrors. I realize that hardware isn't perfect and this is not > happening on a large percentage of them. But, I don't see anything > that looks like scsi errors in this log and I'm surprised that after > running apparently error-free there would be problems detected after a > software reboot. > > I think the newer M2 and later models went to a different RAID > controller, though. Maybe there was a reason. > > -- > Les Mikesell > lesmikesell at gmail.com > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++