Keith Keller
2015-Jan-07 06:10 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On 2015-01-07, Gordon Messmer <gordon.messmer at gmail.com> wrote:> > Of course, the other possibility is simply that you've formatted your > own filesystems, and they have a maximum mount count or a check > interval.If Les is having to run fsck manually, as he wrote in his OP, then this is unlikely to be the cause of the issues he described in that post. There must be some sort of errors on the filesystem that caused the unattended fsck to exit nonzero. --keith -- kkeller at wombat.san-francisco.ca.us
Les Mikesell
2015-Jan-07 13:53 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On Wed, Jan 7, 2015 at 12:10 AM, Keith Keller <kkeller at wombat.san-francisco.ca.us> wrote:> On 2015-01-07, Gordon Messmer <gordon.messmer at gmail.com> wrote: >> >> Of course, the other possibility is simply that you've formatted your >> own filesystems, and they have a maximum mount count or a check >> interval. > > If Les is having to run fsck manually, as he wrote in his OP, then this > is unlikely to be the cause of the issues he described in that post. > There must be some sort of errors on the filesystem that caused the > unattended fsck to exit nonzero. >Yes - the unattended fsck fails. Personally, I'd prefer for the default run to use '-y' in the first place. It's not like I'm more likely than fsck to know how to fix it and it is very inconvenient on remote machines. The recent case was an opennms system updating a lot of rrd files, but I've also seen it on backuppc archives with lots of files and lots of hard links. Some of these have been on VMware ESXi hosts where the physical host wasn't rebooted and the controller/power not involved at all. Eventually these will be replaced with CentOS7 systems, probably using XFS but I don't know if that will be better or worse. It is mostly on aging hardware, so it is possible that there are underlying controller issues. I also see some rare cases on similar machines where a filesystem will go read-only with some scsi errors logged, but didn't look for that yet in this case. -- Les Mikesell lesmikesell at gmail.com
Gordon Messmer
2015-Jan-07 15:52 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On 01/07/2015 05:53 AM, Les Mikesell wrote:> > Yes - the unattended fsck fails.In that case, there should be logs indicating the cause of the error when it was detected by the kernel. There's probably something wrong with your controller or other hardware.> Personally, I'd prefer for the > default run to use '-y' in the first place. It's not like I'm more > likely than fsck to know how to fix it and it is very inconvenient on > remote machines. The recent case was an opennms system updating a > lot of rrd files, but I've also seen it on backuppc archives with lots > of files and lots of hard links.Every regular file's directory entry on your system is a hard link. There's nothing particular about links (files) that make a filesystem fragile.> It is mostly on aging hardware, so it > is possible that there are underlying controller issues. I also see > some rare cases on similar machines where a filesystem will go > read-only with some scsi errors logged, but didn't look for that yet > in this case.It's probably a similar cause in all cases. I don't know how many times I've seen you on this list defending running old hardware / obsolete hardware. Corruption and failure are more or less what I'd expect if your hardware is junk.
Steve Clark
2015-Jan-07 16:32 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On 01/07/2015 08:53 AM, Les Mikesell wrote:> On Wed, Jan 7, 2015 at 12:10 AM, Keith Keller > <kkeller at wombat.san-francisco.ca.us> wrote: >> On 2015-01-07, Gordon Messmer <gordon.messmer at gmail.com> wrote: >>> Of course, the other possibility is simply that you've formatted your >>> own filesystems, and they have a maximum mount count or a check >>> interval. >> If Les is having to run fsck manually, as he wrote in his OP, then this >> is unlikely to be the cause of the issues he described in that post. >> There must be some sort of errors on the filesystem that caused the >> unattended fsck to exit nonzero. >> > Yes - the unattended fsck fails. Personally, I'd prefer for the > default run to use '-y' in the first place. It's not like I'm more > likely than fsck to know how to fix it and it is very inconvenient on > remote machines. The recent case was an opennms system updating a > lot of rrd files, but I've also seen it on backuppc archives with lots > of files and lots of hard links. Some of these have been on VMware > ESXi hosts where the physical host wasn't rebooted and the > controller/power not involved at all. Eventually these will be > replaced with CentOS7 systems, probably using XFS but I don't know if > that will be better or worse. It is mostly on aging hardware, so it > is possible that there are underlying controller issues. I also see > some rare cases on similar machines where a filesystem will go > read-only with some scsi errors logged, but didn't look for that yet > in this case. >I know that I have seen it take 10 ot 15 minutes to sync a 7200 rpm 3 TB WD drive that had over 2 million rrd files being updated by ntopng when the system had 32GB of ram. The system is a Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz but one cpu will in in constant IO wait state until the sync finishes. I have never tried shutting it down when it was syncing though. -- Stephen Clark *NetWolves Managed Services, LLC.* Director of Technology Phone: 813-579-3200 Fax: 813-882-0209 Email: steve.clark at netwolves.com http://www.netwolves.com
Les Mikesell
2015-Jan-07 17:03 UTC
[CentOS] reboot - is there a timeout on filesystem flush?
On Wed, Jan 7, 2015 at 10:15 AM, <m.roth at 5-cent.us> wrote:>>> >> Yes - the unattended fsck fails. Personally, I'd prefer for the >> default run to use '-y' in the first place. It's not like I'm more >> likely than fsck to know how to fix it and it is very inconvenient on >> remote machines. The recent case was an opennms system updating a > <snip> > > In some ways, I prefer the fsck run by reboot to fail - that way, I see > it, and it most probably tells me that it's time to look at replacing the > disk.Seems random to me - not repeating on the same box, and rare enough that it is hard to make any generalization except that it is painful to talk some remote helper through the recovery process - usually involving emailing some cell phone photos of the console to figure out which partition has the problem. -- Les Mikesell lesmikesell at gmail.com