Liebster, Daniel
2003-May-22 17:14 UTC
Tuning system response degradation under heavy ext3/2 activity.
Hello. I'm looking for assistance or pointers for the following problem. OS: RHAS2.1 enterprise16 HW: HP proliant 2CPU 6GB RAM, internal RAID1 + RAID5(4 x 10K 72GB) When we run any kind of process (especially tar for some reason) that creates heavy disk activity the machine becomes Very Slow, (e.g. takes 30-45 seconds to get a reply from ls at the console, or a minute to log in.) I experimented with data=journal, and that made the box more responsive to other new processes, but naturally the io rate fell through the floor. Also experimented with the elvtune, and we got very improved results for our tests (mainly scripts of dd, cp, and diff) but then apps (like tar) suffered. (used elvtune -r 32 -w 32 gave the best numbers on the test scripts). The best we can tell, the processing subsystem of the server is is way to fast for this relatively slow raid card. During the test, io %util is always 100%, and the await and svc times hit 6 digits. Is there anything else we can tune (such as scheduler behavior, etc) to mitigate this? Thanks, Dan
Liebster, Daniel
2003-May-22 18:35 UTC
RE: Tuning system response degradation under heavy ext3/2 activity.
Followup correction... The average wait time in iostat is 6 figures, the average queue is 5 figures, and the svc time is 2-3 figures. -----Original Message----- From: Liebster, Daniel Sent: Thursday, May 22, 2003 1:14 PM To: 'ext3-users@redhat.com' Subject: Tuning system response degradation under heavy ext3/2 activity. Hello. I'm looking for assistance or pointers for the following problem. OS: RHAS2.1 enterprise16 HW: HP proliant 2CPU 6GB RAM, internal RAID1 + RAID5(4 x 10K 72GB) When we run any kind of process (especially tar for some reason) that creates heavy disk activity the machine becomes Very Slow, (e.g. takes 30-45 seconds to get a reply from ls at the console, or a minute to log in.) I experimented with data=journal, and that made the box more responsive to other new processes, but naturally the io rate fell through the floor. Also experimented with the elvtune, and we got very improved results for our tests (mainly scripts of dd, cp, and diff) but then apps (like tar) suffered. (used elvtune -r 32 -w 32 gave the best numbers on the test scripts). The best we can tell, the processing subsystem of the server is is way to fast for this relatively slow raid card. During the test, io %util is always 100%, and the await and svc times hit 6 digits. Is there anything else we can tune (such as scheduler behavior, etc) to mitigate this? Thanks, Dan
Mike Fedyk
2003-May-22 22:52 UTC
Re: Tuning system response degradation under heavy ext3/2 activity.
On Thu, May 22, 2003 at 01:14:22PM -0400, Liebster, Daniel wrote:> Is there anything else we can tune (such as scheduler behavior, etc) to > mitigate this?You can try a larger journal (with data=journal), change the default write timer from 5 to 30 seconds (like ext2), try using raid10 instead of raid5 (but you'll get less disk space available), try using the orlov allocator patch, or try using reiserfs that takes more measures to keep data close together on disk. Google for each of these issues will find more details.
Andrew Morton
2003-May-23 21:43 UTC
Re: Tuning system response degradation under heavy ext3/2 activity.
"Liebster, Daniel" <Daniel.Liebster@adeccona.com> wrote:> > When we run any kind of process (especially tar for some reason) that > creates heavy disk activity the machine becomes Very Slow, (e.g. takes > 30-45 seconds to get a reply from ls at the console, or a minute to log > in.)Linux has traditionally been very bad at handling concurrent disk read and write traffic - it allows the write stream to starve the reads. Fixes for this are still under development in 2.5, alas. I'd expect that you would see much better behaviour from Andrea Arcangeli's kernels - he has done quite a bit of work on this problem and when I did some not-extensive testing on it a couple of months ago it behaved well. The stalls which you are seeing do appear to be unusually large and obtrusive. There may be other factors at play here, such as (perhaps) over-aggressive queueing in the controllers or really poor bandwidth. What writeout rate do you actually see? time (dd if=/dev/zero of=foo bs=1M count=512 ; sync) ?