Hi all,
We have a server that has a 580GB ext3 file system on it. Until
recently we ran around 15 virtual servers from this file system. It was
fine for at least a few months, then the file system would periodically
become inaccessible, getting more frequent as time went on. Eventually
we wouldn't even get through a 15-hour period without having to reboot
the server.
When the I/O got blocked, all processes accessing files on
/var/lib/vservers (its mount point) would get stuck waiting for I/O to
complete ("D" state) and I couldn't find any way to revive it
apart from
rebooting the server. I tried sending various signals (TERM and KILL)
to some kernel threads but that didn't help at all.
The "kjournald" process also got stuck in the "D" state.
The server is running kernel 2.6.22.19 with the Linux-Vserver patch
vs2.2.0.7, DRBD 8.2.6 and the Areca RAID driver updated to
1.20.0X.15-80603 which was the latest available from Areca at the time.
The OS is Debian etch.
As part of troubleshooting the problem I'd taken DRBD out of the mix,
tried updating the RAID driver in the kernel, replaced the RAID card
with another one with slightly later firmware, and also replaced the
power supply with a known-good one at the same time and disabled the
swap space. None of that helped.
What did help was copying the files from the existing file system to a
newly formatted ext3 file system. The newly formatted file system is
only around 320GB, but is also set up the same as the existing one (both
are hardware RAID-6, running on the same host, same controller, same
physical disks, etc).
When the file system would become inaccessible, there were no notices
from the kernel about any issue at all. We have a serial console on
this server and nothing was captured by the serial console when this
happened, nor is there anything in the system logs (which should have
been writable all this time as they are not on the broken file system).
I used 'dd' to check if I could read from the underlying device files
that the file system was on (/dev/sdc1 and /dev/drbd1), there was no
problem doing that. I didn't test writes to these devices though since
I don't know of any safe way to do so, but using the SysRq feature, an
emergency sync would not complete, nor would an emergency umount, so I
assume writes were out of the question. Doing an 'ls' on
/var/lib/vservers just left me with yet another process stuck in the
"D"
state.
A forced fsck of the file system (using a fresh build of e2fsprogs
1.41.3 with the matching libraries) provides no hint of any problems.
The root file system is an ext3 file system as well, and there were no
problems reading/writing to that file system while the ext3 file system
on /var/lib/vservers was inaccessible. The filesystem is also on the
same RAID card, physical disks, etc.
One reason I've not moved to a newer kernel yet is because there isn't a
stable linux-vserver patch for anything newer than 2.6.22.19, so I'm
kind of stuck with that kernel until there is. I made a start on
backporting the ext3 code from 2.6.26.5 to 2.6.22.19 but its not
something I trust myself to get right, so I'd rather avoid that approach
unless there is another way of doing that.
So my questions are:
Are there any further diagnostics I can perform on the old file system
to try and track down the problem? If so, what are they?
Is this a known bug/problem with ext3 or something related to it?
Is it likely that one of the 3 or so deadlocks that have been fixed in
kernels since 2.6.22.19 would have cured this problem, or would these
deadlocks have taken down the hole box and not just affected the one
file system?
Or even this bug: http://bugzilla.kernel.org/show_bug.cgi?id=10882 (the
softlockup part, I think not though because I was able to copy
everything off that file system and on to a new one without having any
lockups or any other complaints from the kernel).
Thanks.
--
Regards,
Robert Davidson.
Obsidian Consulting Group.
Ph. 03-9355-7844
E-Mail: support at obsidian.com.au