thr3ads.net - CentOS - [CentOS] reboot - is there a timeout on filesystem flush? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Gary Greene

2015-Jan-07 19:37 UTC

[CentOS] reboot - is there a timeout on filesystem flush?

> On Jan 6, 2015, at 9:23 PM, Gordon Messmer <gordon.messmer at
gmail.com> wrote:
> 
> On 01/06/2015 04:37 PM, Gary Greene wrote:
>> This has been discussed to death on various lists, including the
>> LKML...
>> 
>> Almost every controller and drive out there now lies about what is
>> and isn?t flushed to disk, making it nigh on impossible for the
>> Kernel to reliably know 100% of the time that the data HAS been
>> flushed to disk. This is part of the reason why it is always a Good
>> Idea? to have some sort of pause in the shut down to ensure that it
>> IS flushed.
> 
> That's pretty much entirely irrelevant to the original question.
> 
> (Feel free to correct me if I'm wrong in the following)
> 
> A filesystem has three states: Clean, Dirty, and Dirty with errors.
> 
> When a filesystem is unmounted, the cache is flushed and it is marked clean
last.  This is the expected state when a filesystem is mounted.
> 
> Once a filesystem is mounted read/write, then it is marked dirty.  If a
filesystem is dirty when it is mounted, then it wasn't unmounted properly. 
In the case of a journaled filesystem, typically the journal will be replayed
and the filesystem will then be mounted.
> 
> The last case, dirty with errors indicates that the kernel found invalid
data while the filesystem was mounted, and recorded that fact in the filesystem
metadata.  This will normally be the only condition that will force an fsck on
boot.  It will also normally result in logs being generated when the errors are
encountered.  If your filesystems are force-checked on boot, then the logs
should usually tell you why.  It's not a matter of a timeout or some device
not flushing its cache.
> 
> Of course, the other possibility is simply that you've formatted your
own filesystems, and they have a maximum mount count or a check interval.  Use
'tune2fs -l' to check those two values.  If either of them are set, then
there is no problem with your system.  It is behaving as designed, and forcing a
periodic check because that is the default behavior.
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
Problem is, Gordon, the layer I?m talking about is _below_ the logical layer
that filesystems live at, in the block layer, at the mercy of drivers, and
firmware that the kernel has zero control over. While in a perfect world, the
controller would do strictly only what the Kernel tells it, that just isn?t true
for a while now with the large caches that drives and controllers have now.

In most cases, this should never trigger, however in some buggy drivers, or
controllers that have buggy firmware, the writes can be seriously delayed to
disk, which can cause data to never make it to the platter.

--
Gary L. Greene, Jr.
Sr. Systems Administrator
IT Operations
Minerva Networks, Inc.
Cell: +1 (650) 704-6633

Les Mikesell

2015-Jan-07 19:47 UTC

head link

[CentOS] reboot - is there a timeout on filesystem flush?

On Wed, Jan 7, 2015 at 1:37 PM, Gary Greene <ggreene at
minervanetworks.com> wrote:>>
> Problem is, Gordon, the layer I?m talking about is _below_ the logical
layer that filesystems live at, in the block layer, at the mercy of drivers, and
firmware that the kernel has zero control over. While in a perfect world, the
controller would do strictly only what the Kernel tells it, that just isn?t true
for a while now with the large caches that drives and controllers have now.
>
> In most cases, this should never trigger, however in some buggy drivers, or
controllers that have buggy firmware, the writes can be seriously delayed to
disk, which can cause data to never make it to the platter.
>
I'd have to shut one down and get into the bios config to see, but I
think these default to write-through if they aren't battery backed -
caching may not even be an option.   This one might have a battery
going bad, though.

I see a bunch of entries like:
ioatdma 0000:00:08.0: Channel halted, chanerr = 2
ioatdma 0000:00:08.0: Channel halted, chanerr = 0
in the logs and one of these:
hrtimer: interrupt took 258633 ns

Not sure what those mean.   We do have considerably more systems
running windows than linux on this hardware and I don't think anyone
has noticed a systemic problem there.

-- 
   Les Mikesell
     lesmikesell at gmail.com

Charles Polisher

2015-Jan-19 22:53 UTC

head link

[CentOS] reboot - is there a timeout on filesystem flush?

On Jan 07, 2015 at 01:47:53PM -0600, Les Mikesell wrote:> 
> I see a bunch of entries like:
> ioatdma 0000:00:08.0: Channel halted, chanerr = 2
> ioatdma 0000:00:08.0: Channel halted, chanerr = 0
> in the logs and one of these:
> hrtimer: interrupt took 258633 ns
> 
> Not sure what those mean.   We do have considerably more systems
> running windows than linux on this hardware and I don't think anyone
> has noticed a systemic problem there.
Was this resolved? The ioatdma messages are from ioat_dma.c, a
driver for Intel's I/OAT DMA engine typically used on high-end
server hardware to accelerate network I/O. chanerr = 2 might be
an issue with the DMA channel being in a suspended state when
the driver isn't expecting it to be. Maybe a network driver bug.

Maybe Matching Threads

Search for more apparently analagous threads

CentOS - Jan 2015 - reboot - is there a timeout on filesystem flush?

[CentOS] reboot - is there a timeout on filesystem flush?

[CentOS] reboot - is there a timeout on filesystem flush?

[CentOS] reboot - is there a timeout on filesystem flush?

Maybe Matching Threads