ahlist
2007-Mar-19 21:15 UTC
rebooting more often to stop fsck problems and total disk loss
Hi, I run several hundred servers that are used heavily (webhosting, etc.) all day long. Quite often we'll have a server that either needs a really long fsck (10 hours - 200 gig drive) or an fsck that evntually results in everything going to lost+found (pretty much a total loss). Would rebooting these servers monthly (or some other frequency) stop this? Is it correct to visualize this as small errors compounding over time thus more frequent reboots would allow quick fsck's to fix the errors before they become huge? (OS is redhat 7.3 and el3) Thanks for any input!
Andreas Dilger
2007-Mar-19 21:27 UTC
rebooting more often to stop fsck problems and total disk loss
On Mar 19, 2007 17:15 -0400, ahlist wrote:> Quite often we'll have a server that either needs a really long fsck > (10 hours - 200 gig drive) or an fsck that evntually results in > everything going to lost+found (pretty much a total loss).Strange. We get 1TB/hr fscks these days unless the filesystem is completely corrupted and has a lot of duplicate blocks.> Would rebooting these servers monthly (or some other frequency) stop this?What else is important is that if you do an fsck you run with "-f" to actually check the filesystem instead of just the superblock. e2fsck will only do a full e2fsck if the kernel detected disk corruption, OR if the "last checked" time is > 6 months or {20 < X < 40} mounts have happened since the last check time. See tune2fs(8) for details.> Is it correct to visualize this as small errors compounding over time > thus more frequent reboots would allow quick fsck's to fix the errors > before they become huge?That is definitely true. If the bitmaps get corrupted, then this will spread corruption throughout the filesystem.> (OS is redhat 7.3 and el3)I would instead suggest updating to a newer kernel (e.g. RHEL4 2.6.9) as this has fixed a LOT of bugs in ext3. Also, make sure you are using the newest e2fsck available, as some bugs have been fixed there also. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.