Andreas Kossmann
2010-Mar-03 10:04 UTC
[Ocfs2-users] High load average on Apache Cluster with drbd + ocfs2
Hello all, I have an enviroment with 2 Debian 5.0 servers. Kernel is 2.6.26-2-amd64. I have installed drbd-8.0.14 and ocfs2-tools 1.4.1. It is an Active/Active WebCluster with Apache. The 2 servers write to the same log files. In my test enviroment everything works fine. In the production environment I have the problem, that after a few weeks the Apache-Servers goes crazy and get a very high load >100. First I thought the problem may be drbd, but I have read many problemes with ocfs2 and apache load average. The curios thing is that the load is often very high at times where request are very small ( eg. 11:00 PM ) I've disconnected the second webserver from the network and checked the filesystem. A few bitmap errors occured and i repaired them. Then I changed the drbd config so, that only webserver 1 is primary and the webserver 2 is secondary. So webserver 2 cannot write to the device. After I connect the webserver 2 to the network again and the sync from the primary starts. The load on webserver 1 is going > 100. I have also tested the connection with webserver 2 with disconnected drbd. I discovered that the load on webserver 1 is going i little higher also. Is there any solution for the ocfs2 load problem with apache? If there is no solution I hvae to change from active/active to active/passive with ext3 as filesystem. Please, help me. Thanks a lot Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100303/da5efb72/attachment.html
Joel Becker
2010-Mar-03 10:36 UTC
[Ocfs2-users] High load average on Apache Cluster with drbd + ocfs2
On Wed, Mar 03, 2010 at 11:04:48AM +0100, Andreas Kossmann wrote:> I have an enviroment with 2 Debian 5.0 servers. > Kernel is 2.6.26-2-amd64. I have installed drbd-8.0.14 and ocfs2-tools 1.4.1. > It is an Active/Active WebCluster with Apache. > The 2 servers write to the same log files.It's always better if they can write to separate log files, but I'm going to proceed as if that won't change, because we should handle both.> First I thought the problem may be drbd, but I have read many problemes with ocfs2 and apache load average.Doesn't mean it's not drbd ;-) Let's see what I can find out.> The curios thing is that the load is often very high at times where request are very small ( eg. 11:00 PM )The problem may or may not be related to apache's volume of requests.> I've disconnected the second webserver from the network and checked the filesystem. A few bitmap errors occured and i repaired them. Then I changed the drbd config so, that only webserver 1 is primary and the webserver 2 is secondary. So webserver 2 cannot write to the device.Wait, you checked the filesystem while webserver 1 was still running? The errors you found are dirty state from webserver 1. They would not occur if webserver 1 was cleanly unmounted.> After I connect the webserver 2 to the network again and the sync from the primary starts. The load on webserver 1 is going > 100.What does it look like after the sync is done? Does performance get better?> Is there any solution for the ocfs2 load problem with apache?I don't know if this is drbd keeping the volumes in sync or an ocfs2 issue. Are you sure it is the log files that cause the problem? Can you get me debugfs.ocfs2 output for those files? For example, if your filesystem is on /dev/drbd1, is mounted at /web, and your log file is /web/logs/error_log, then we want the output of "debugfs.ocfs2 -R 'stat /logs/error_log' /dev/drbd1". Please do this for each log file and send it along to me. Joel -- "Can any of you seriously say the Bill of Rights could get through Congress today? It wouldn't even get out of committee." - F. Lee Bailey Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127
Brad Plant
2010-Mar-03 10:47 UTC
[Ocfs2-users] High load average on Apache Cluster with drbd + ocfs2
Hi Andreas, I saw almost exactly what you described when using ocfs2 on web servers. Some time late at night, the load would go through the roof on 1 web server because there were lots of apache processes in the uninterruptible "D" state If I stopped apache on the problem server and the load dropped, but went back up as soon as I started it again. Turns out I'd hit a free space fragmentation problem. While df reported I had heaps of free space (>50% from memory!), I couldn't write (echo >>) to the log files on the problem web server. Note that you'll find you can still create small files and append to small files, but not the larger apache log files. The fact that it happens late at night was very confusing, but eventually made sense. As the day goes on, the log files get bigger and bigger pieces of contiguous free space are required to extend the file. Eventually, a contiguous piece of free space cannot be found and your writes will start to fail. A *partial* fix went into 2.6.33. It's partial because it doesn't fix the free space fragmentation issue but rather allows the problem node to steal some free space from the node that is still ok. All it does is prolong the problem a little such that writes will start to fail on both nodes at the same time. Another thing you can do that doesn't require a kernel upgrade is to reduce the number of node slots. The default is 8 (-N to mkfs.ocfs2) so reducing this will free up some *contiguous* free space. Unfortunately this is an offline operation. This may not be your issue, but it certainly sounds familiar. I recall it was very frustrating trying to diagnose the issue. Cheers, Brad On Wed, 3 Mar 2010 11:04:48 +0100 "Andreas Kossmann" <kossmann.andreas at gmx.de> wrote:> Hello all, > > I have an enviroment with 2 Debian 5.0 servers. > Kernel is 2.6.26-2-amd64. I have installed drbd-8.0.14 and ocfs2-tools 1.4.1. > It is an Active/Active WebCluster with Apache. > The 2 servers write to the same log files. > > In my test enviroment everything works fine. In the production environment I have the problem, that after a few weeks the Apache-Servers goes crazy and get a very high load >100. > > First I thought the problem may be drbd, but I have read many problemes with ocfs2 and apache load average. > > The curios thing is that the load is often very high at times where request are very small ( eg. 11:00 PM ) > > I've disconnected the second webserver from the network and checked the filesystem. A few bitmap errors occured and i repaired them. Then I changed the drbd config so, that only webserver 1 is primary and the webserver 2 is secondary. So webserver 2 cannot write to the device. > > After I connect the webserver 2 to the network again and the sync from the primary starts. The load on webserver 1 is going > 100. > > I have also tested the connection with webserver 2 with disconnected drbd I discovered that the load on webserver 1 is going i little higher also. > > Is there any solution for the ocfs2 load problem with apache? > If there is no solution I hvae to change from active/active to active/passive with ext3 as filesystem. > > Please, help me. > > Thanks a lot > > Andreas-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100303/29d58943/attachment.bin