thr3ads.net - Ocfs2 users - [Ocfs2-users] High Load using OCFS2 [Aug 2008]

If this information is useful, please help other people find it:
Share via:
Mick Waters
2008-Aug-12 17:03 UTC
[Ocfs2-users] High Load using OCFS2

We are using OCFS2 on Red Hat Enterprise Linux to share regular UNIX filesystems
between three web servers and this seems to work reasonably well except that we
frequently experience high load situations on our servers (all at the same
time).

The underlying SAN is an HP EVA 4100 and the SAN diagnostics show that the SAN
itself is coping fairly easily; CPU loads on the controllers rarely rise above
5% but usually steady at around 1%.  The number of read requests per second is
usually around 500-600 although this does peak occasionally at 2000-3000.  On
one occasion the number of requests per second went over 12,000 with an average
of 52MB/s transferred; the SAN coped with this with an average latency of 0.1ms.
Write requests rarely go higher than 20-30 per second but have been known to hit
2500 during busy periods.

Because we are web-serving, this results in lots of read requests for small
files for the majority of the time but there are periods in the day when we
update many thousands of images and it is at times like this when we are doing a
relatively high volume of writes that the load shoots up.  We recently has a
situation where the 1 minute load average hit 650.  Normally when we hit high
load situations, the load seems to be higher on the nodes that aren't
writing to the SAN than on the one that is.

Running 'iostat -x' frequently shows output similar to that below:

Time: 08:05:48 AM
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz 
await  svctm  %util
sda              46.40   255.80 250.40 28.00  2373.60  2267.60    16.67     1.44
5.16   3.24  90.26

Time: 08:05:53 AM
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz 
await  svctm  %util
sda              18.47   104.22 221.49 232.93  1902.41  2699.40    10.13    
2.68    5.78   2.00  90.82

Time: 08:05:58 AM
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz 
await  svctm  %util
sda              28.60    44.20 47.40  2.40   613.60    98.00    14.29     5.40 
104.24  20.09 100.04

Time: 08:06:03 AM
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz 
await  svctm  %util
sda              38.80    10.20 149.80  3.00  1517.20   373.40    12.37     3.71
25.98   6.34  96.86

This situation goes on for several minutes or tens of minutes before things calm
down again.  The number of reads and writes per second don't seem to be
unreasonably high to me, especially considering the underlying SAN performance
but it is clear that the device is totally saturated with elevated figures for
await and svctm.  The 1 minute load averages at these times is typically 60-80
and so it seems clear to me that we are running out of steam.  Oddly, vmstat
doesn't indicate blocked processes or waiting on I/O during these times. 
CPU utilization is never more than about 20% (the servers were considerably
over-spec'ed with 4 quad-core Xeon processors in each and 8GB RAM.  There is
no swapping taking place either.  BTW, I have mounted the SAN-based filesystems
with noatime and this has helped a little.

It occurred to me that maybe it was the inter-node communications to negotiate
locks that may be slowing things down but the servers are connected via Gigabit
NICs and all are on the same subnet so network switches can be ruled out.

My question is: is OCFS2 (or any clustered filesystem for that matter) suitable
for the demands that we are placing on it?  Has anyone any experience of using
OCFS2 under what I assume to be 'extreme' conditions?

Any advice would be greatly appreciated - this is causing me untold grief due to
poor response times from the web servers when this is happening.  On a few
occasions, when the load was particularly high, one or other of the web servers
have fenced and rebooted themselves.

Regards,

Mick.




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080812/9c6aee12/attachment.html
Ocfs2 users - Aug 2008 - High Load using OCFS2

[Ocfs2-users] High Load using OCFS2