I have been running a 2 node cluster of web servers handling around 800,000 - 1,200,000 requests on a typical weekday. The systems are each dual quad core xeon w/4GB ram, and generally maintain a fairly low load average. The servers are sharing a device via iSCSI SAN for the OCFS2 volume. They are running RHEL5 U2 (2.6.18-92.1.10.el5) and OCFS2 1.2.9-1 binaries of the matching kernel version. The cluster has been in a production environment for about 1.5 months now, and just in the past last week it has started to have problems. The user experience is an occasional lag of 5 to 15 seconds, after which everything appears normal. Digging deeper into the problem I had narrowed it to an I/O issue, and iostat shows near 100% utilization on said device during the lag. Once it clears the utilization is back down to a consistent 0-5% average. Also, when the lag is happening, a process listing shows the affected processes in the D state. After some reading of the mailing list archives and much google searching, I found this old thread which seems quite similar: http://oss.oracle.com/pipermail/ocfs2-users/2007-May/001643.html And the corresponding bug from this same author: http://oss.oracle.com/bugzilla/show_bug.cgi?id=882 I see it has been a long time since this bug was last updated, but there seems to have been no [documented] resolution. I wrote a small script which hits a very simple web page at 10 second intervals. If the server responds in less than 1 second it outputs a dot; if the page takes more than 1 second to load it outputs a line indicating how long it took, which node processed it, and a timestamp. Here is some sample output over the course of about an hour: ................ Real time of 1 for wolf at Thu Jan 22 16:35:32 CST 2009 ............................................................ Real time of 6 for cow at Thu Jan 22 16:45:52 CST 2009 ..... Real time of 7 for cow at Thu Jan 22 16:46:59 CST 2009 ................................................................................. Real time of 12 for wolf at Thu Jan 22 17:00:54 CST 2009 ................................... Real time of 3 for wolf at Thu Jan 22 17:06:58 CST 2009 .................... Real time of 13 for cow at Thu Jan 22 17:10:42 CST 2009 ... Real time of 11 for cow at Thu Jan 22 17:11:33 CST 2009 ....................... Real time of 9 for cow at Thu Jan 22 17:15:43 CST 2009 ....................................... Real time of 2 for wolf at Thu Jan 22 17:22:27 CST 2009 .. Real time of 8 for wolf at Thu Jan 22 17:23:05 CST 2009 ............................. Real time of 10 for wolf at Thu Jan 22 17:28:17 CST 2009 .................................... Real time of 14 for wolf at Thu Jan 22 17:34:42 CST 2009 There are no real patterns here, but the output is interesting nonetheless -- it does show that it is intermittent, but quite frequent. I tried using the "ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN" command to see what processes were hanging on each system during the time of the lags. Like Stephan's output, I see a few processes in the D (uninterruptable sleep) state, which makes sense if they are performing I/O. And also like Stephan, one node shows the "-" output, and the other shows "ocfs2_wait_for_status_completion" as the function. node cow: PID STAT COMMAND WIDE-WCHAN-COLUMN 8511 D cronolog ocfs2_wait_for_status_completion 8510 D cronolog ocfs2_wait_for_status_completion node wolf: PID STAT COMMAND WIDE-WCHAN-COLUMN 9439 D cronolog - In my case, however, it was not the httpd process that was listed, but rather cronolog (which is what I use with apache for automatic daily log rotations). I assume that in his case it the http was directly writing to log files at the time of the hangs. Also, mine typically was only a single process on each system in the D state, only sometimes more. Another thing I tried was to shut down httpd on one of the nodes to see if the lags went away -- the result was unchanged, still had similar lags at similar intervals. This time, the "ps" output on the non-participating node was clean, but the other had multiple processes, and not all web server related: node wolf: PID STAT COMMAND WIDE-WCHAN-COLUMN 9436 D cronolog - 9438 D cronolog - 9439 D cronolog - 9448 D cronolog - 9457 D cronolog - 31201 D httpd - 31202 D httpd - 31710 D vsftpd sync_buffer So apparently this problem can happen between processes of the same node, not just between nodes. Watching a "top" command on each box during the lags only shows CPU usage on iowait, no significant user/system CPU.
The two issues are different. For starters, the issue in bugzilla#882 was a hang. Not a temporary slowdown. And the stack showed it was related to flock() that ocfs2 1.2 did not support. Well, not cluster-aware flock(). Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2 1.4. Are the apache logs also hitting the iscsi storage? If so, one explanation could be that the log flush is saturating the network. That would cause the ioutil to jump higher affecting the httpd ios. Sunil David Johle wrote:> I have been running a 2 node cluster of web servers handling around > 800,000 - 1,200,000 requests on a typical weekday. The systems are > each dual quad core xeon w/4GB ram, and generally maintain a fairly > low load average. The servers are sharing a device via iSCSI SAN for > the OCFS2 volume. They are running RHEL5 U2 (2.6.18-92.1.10.el5) and > OCFS2 1.2.9-1 binaries of the matching kernel version. > > The cluster has been in a production environment for about 1.5 months > now, and just in the past last week it has started to have > problems. The user experience is an occasional lag of 5 to 15 > seconds, after which everything appears normal. Digging deeper into > the problem I had narrowed it to an I/O issue, and iostat shows near > 100% utilization on said device during the lag. Once it clears the > utilization is back down to a consistent 0-5% average. Also, when > the lag is happening, a process listing shows the affected processes > in the D state. > > After some reading of the mailing list archives and much google > searching, I found this old thread which seems quite similar: > http://oss.oracle.com/pipermail/ocfs2-users/2007-May/001643.html > > And the corresponding bug from this same author: > http://oss.oracle.com/bugzilla/show_bug.cgi?id=882 > > I see it has been a long time since this bug was last updated, but > there seems to have been no [documented] resolution. > > I wrote a small script which hits a very simple web page at 10 second > intervals. If the server responds in less than 1 second it outputs a > dot; if the page takes more than 1 second to load it outputs a line > indicating how long it took, which node processed it, and a > timestamp. Here is some sample output over the course of about an hour: > > ................ > Real time of 1 for wolf at Thu Jan 22 16:35:32 CST 2009 > ............................................................ > Real time of 6 for cow at Thu Jan 22 16:45:52 CST 2009 > ..... > Real time of 7 for cow at Thu Jan 22 16:46:59 CST 2009 > ................................................................................. > Real time of 12 for wolf at Thu Jan 22 17:00:54 CST 2009 > ................................... > Real time of 3 for wolf at Thu Jan 22 17:06:58 CST 2009 > .................... > Real time of 13 for cow at Thu Jan 22 17:10:42 CST 2009 > ... > Real time of 11 for cow at Thu Jan 22 17:11:33 CST 2009 > ....................... > Real time of 9 for cow at Thu Jan 22 17:15:43 CST 2009 > ....................................... > Real time of 2 for wolf at Thu Jan 22 17:22:27 CST 2009 > .. > Real time of 8 for wolf at Thu Jan 22 17:23:05 CST 2009 > ............................. > Real time of 10 for wolf at Thu Jan 22 17:28:17 CST 2009 > .................................... > Real time of 14 for wolf at Thu Jan 22 17:34:42 CST 2009 > > > There are no real patterns here, but the output is interesting > nonetheless -- it does show that it is intermittent, but quite frequent. > > I tried using the "ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN" > command to see what processes were hanging on each system during the > time of the lags. Like Stephan's output, I see a few processes in > the D (uninterruptable sleep) state, which makes sense if they are > performing I/O. And also like Stephan, one node shows the "-" > output, and the other shows "ocfs2_wait_for_status_completion" as the function. > > node cow: > PID STAT COMMAND WIDE-WCHAN-COLUMN > 8511 D cronolog ocfs2_wait_for_status_completion > 8510 D cronolog ocfs2_wait_for_status_completion > > node wolf: > PID STAT COMMAND WIDE-WCHAN-COLUMN > 9439 D cronolog - > > In my case, however, it was not the httpd process that was listed, > but rather cronolog (which is what I use with apache for automatic > daily log rotations). I assume that in his case it the http was > directly writing to log files at the time of the hangs. Also, mine > typically was only a single process on each system in the D state, > only sometimes more. > > Another thing I tried was to shut down httpd on one of the nodes to > see if the lags went away -- the result was unchanged, still had > similar lags at similar intervals. This time, the "ps" output on the > non-participating node was clean, but the other had multiple > processes, and not all web server related: > > node wolf: > PID STAT COMMAND WIDE-WCHAN-COLUMN > 9436 D cronolog - > 9438 D cronolog - > 9439 D cronolog - > 9448 D cronolog - > 9457 D cronolog - > 31201 D httpd - > 31202 D httpd - > 31710 D vsftpd sync_buffer > > So apparently this problem can happen between processes of the same > node, not just between nodes. > > Watching a "top" command on each box during the lags only shows CPU > usage on iowait, no significant user/system CPU. > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
I want to throw in my experience here: You WILL have issues if you have reasonably high load, with multiple nodes performing writes to log files. It will hang all processes waiting on the write lock. In the case of apache, especially when using prefork, this is a VERY BAD idea. It is highly highly recommended to move your log files outside of the cluster FS. We run ocfs2 in production, hundreds of gigs of files, with 5 nodes per cluster, 4 clusters. Sustained throughput of 200mbit/s being served by these nodes. All this is on a fiber channel SAN with very fast arrays. Even so, when we stumble across the occasional log file contention, and remove it (say, a php file writing to a log file, or apache, etc), load drops considerably. When we had apache writing log files to the cluster FS with ? the current load, apache choked and died, it couldn't handle the contention. It will continue to escalate in a non-linear fashion. My recommendations: 1. Move all logs, and any files which are written to concurrently off to local system (yes, it makes administration harder, but systems happier, and was a MUST for our environment, and we've managed to push OCFS2 scaling a lot farther by doing so) 2. Switch away from prefork if you are using that 3. Make absolutely sure you're mounted with noatime 4. If possible, switch away from apache and utilize lighttpd (we recently did this, and were able to gain another 20-30% out of each box) Relevant lighttpd settings: server.max-fds = 2048 server.max-write-idle = 120 server.max-worker = 8 server.network-backend = "linux-sendfile" server.stat-cache-engine = "simple" server.max-keep-alive-requests = 10 server.max-keep-alive-idle = 8 server.event-handler = "linux-sysepoll" Again, we've been using OCFS2 in production for 2 years, throughout many problems, and I highly recommend moving the log files to local storage, OR piping them to a network cable syslog. (Also a possible solution). Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090126/33c034db/attachment.html
Bret, Also, intel adapters occasionally had a problem with requires a firmware fix (performed via ethtool) using fixeep-82573-dspd.sh (However, I still doubt that is the problem in this case, it's almost always worth checking to see if you have an intel NIC affected by that bug) Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090127/5244359f/attachment.html
Well I can confirm that in my case the intel NICs are not affected by this bug. Checked the EEPROM dump, as well as used the script to verify, and neither one indicates a problem. On these particular servers, the iSCSI target is being accessed over 2 NICs, one is a Broadcomm onboard, and one is an Intel PCIe card. I'm using device-mapper for managing the active/active multipathing. All in all I've had great performance with this setup, actually seeing very little overhead despite using a software initiator. At 05:01 PM 1/27/2009, ocfs2-users-request at oss.oracle.com wrote:>Bret, > >Also, intel adapters occasionally had a problem with requires a >firmware fix (performed via ethtool) using fixeep-82573-dspd.sh > >(However, I still doubt that is the problem in this case, it's >almost always worth checking to see if you have an intel NIC >affected by that bug) > >Michael
Sunil, ? Hmm, I thought you meant the cluster size. Regards, Luis --- On Wed, 1/28/09, Sunil Mushran <sunil.mushran at oracle.com> wrote: From: Sunil Mushran <sunil.mushran at oracle.com> Subject: Re: [Ocfs2-users] ocfs2 hangs during webserver usage To: lfreitas34 at yahoo.com Cc: ocfs2-users at oss.oracle.com Date: Wednesday, January 28, 2009, 1:08 AM Luis Freitas wrote:> The extent size would play a role on this also, as Sunil pointed. Youcould check if the extent size is different on your test environment. The mkfs tool might have defaulted a larger extent size if the total size of the filesystem is larger. (Sunil, correct me on this if I am wrong). The extend size I was referring to is not related to any ondisk size. It depends on the location the write() is issued to. Say you have a 100K sized file and you issue a write() at offset 1G. If the fs supports sparse files, it will allocate (and init) blocks only around the 1G location, leaving a hole in between. A non-sparsefile supporting fs, on the other hand, will need to allocate (and init) the entire space between 100K and 1G. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090127/885f8622/attachment.html
It's also worth pointing out that it wouldn't be best to write even to separate log files on the ocfs2 volume. For one, using iSCSI, you're probably using only gigabit Ethernet. (You could be using 10gigE, but you didn't mention it). As performance is a concern, I would move what I could off of the ocfs2 volume (including, where possible, shared codebase) to local storage, and utilize symlinks where needed. For logging, again, I highly recommend using local storage only. As I said, we've been using this in production for many webservers for a long time (over 2 years), and we've learned a lot of lessons over those years. 1. Log to local disk only, we've had MANY problems with write contention and log files (this is my number on recommendation, and moving those log files will probably eliminate your problem) 2. Move common code/files to local storage where possible (for instance, we have vbulletin, with the attachments, etc on the ocfs2 volume, but the php files which are common are on local storage) Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090128/bb56b64d/attachment.html
>I will try changing the logging to use separate files for a day or >so to at least see what effect it has with regards to this issue...stay tuned.Okay, that was easy to test, and fail -- still plenty of lag situations. As expected, the results agree with another test I did which was to shut down one of the two nodes to the point of unmounting the shared filesystem. I believe the problem here is that actual chid processes of Apache locally sharing the logfile is just as bad as sharing across nodes. Maybe perhaps that is the root cause of the problem, and the fact that additional nodes are affected is just consequential. Michael Moody suggested not using prefork MPM, presumably for this reason. I wonder, does the worker MPM "consolidate" all the writes from the threads within a worker into a single buffered write stream? If so, I can see how having 32 threads per worker would not reduce the amount of writes to the logfile, but could reduce the amount of contending I/O requests by a factor of 32. Is this the case? Way back when we first transitioned to Apache 2.x we looked into using a better MPM, but had some necessary modules installed that were not thread-safe, and thus would only work with prefork. I am pretty sure this is still a constraint for us at this point.
David, Again, I recommend local storage for the logs. I know I've said it many times, but I've run into these SAME issues, and local was the way to go. For log parsing across nodes, or for tailing them all at the same time, we utilize Capistrano (a ruby-based project), which easily allows me to tail log files across multiple servers (or perform any operations for that matter) all to one window. Really though, try local storage for your logs. In regards to the MPM, prefork has benefits, and also severe disadvantages. I'm not certain that the worker MPM buffers the logs, and of course, you'll want multiple worker processes (and threads), but it certainly is an improvement over prefork. However, if you have non-threadsafe modules (just curious, which ones?), as you stated, you may have trouble. This is one of the reasons that where we can, we switched to lighttpd. In regards to non-threadsafe modules, if you are specifically speaking of handlers (php, ruby, perl, etc), you could simply migrate those to use fastcgi (I recommend mod_fcgid). It's a considerable improvement, and works great with non-threadsafe modules/libraries. Your best solution however, is still to use local log files. Again, it's easy to test, and if the lags go away, you'll know. Michael David Johle djohle at industrialinfo.com <mailto:ocfs2-users%40oss.oracle.com?Subject=Re:%20%5BOcfs2-users%5D%20ocfs2%20hangs%20during%20webserver%20usage&In-Reply-To=%3C200901282147.n0SLlGB9025423%40mix.industrialinfo.com%3E> Wed Jan 28 13:40:53 PST 2009 Okay, that was easy to test, and fail -- still plenty of lag situations. As expected, the results agree with another test I did which was to shut down one of the two nodes to the point of unmounting the shared filesystem. I believe the problem here is that actual chid processes of Apache locally sharing the logfile is just as bad as sharing across nodes. Maybe perhaps that is the root cause of the problem, and the fact that additional nodes are affected is just consequential. Michael Moody suggested not using prefork MPM, presumably for this reason. I wonder, does the worker MPM "consolidate" all the writes from the threads within a worker into a single buffered write stream? If so, I can see how having 32 threads per worker would not reduce the amount of writes to the logfile, but could reduce the amount of contending I/O requests by a factor of 32. Is this the case? Way back when we first transitioned to Apache 2.x we looked into using a better MPM, but had some necessary modules installed that were not thread-safe, and thus would only work with prefork. I am pretty sure this is still a constraint for us at this point. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090128/716663c3/attachment.html