Hi, We are having a problem with apache+perl being hang. 1626 ? Ss 0:00 sendmail: rejecting connections on daemon MTA: load average: 152 1634 ? Ss 0:00 sendmail: Queue runner at 01:00:00 for /var/spool/clientmqueue 1741 ? Ss 0:00 /usr/sbin/httpd 1744 ? S 0:00 \_ /usr/local/sbin/cronolog /site/logssite/access_log.%Y%m%d 21377 ? S 0:00 \_ /usr/sbin/httpd 23942 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 21518 ? S 0:00 \_ /usr/sbin/httpd 23987 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 21552 ? S 0:00 \_ /usr/sbin/httpd 23873 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 21563 ? S 0:00 \_ /usr/sbin/httpd 23948 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 21590 ? S 0:00 \_ /usr/sbin/httpd 23866 ? R 39:21 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 21596 ? S 0:00 \_ /usr/sbin/httpd 23929 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi Process 23866 keeps on running and all the others freeze. Strace also blocks. Attached i'm sending locking_state data. Dmesg: ----- OCFS2 Node Manager 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build 9e5f332181e8ebfad464946bcc4888af) OCFS2 DLM 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build e2556a71429f31033b275dff4b5594aa) OCFS2 DLMFS 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build e2556a71429f31033b275dff4b5594aa) OCFS2 User DLM kernel interface loaded o2net: accepted connection from node ws3 (num 19) at 172.16.42.3:7777 o2net: connected to node ws1 (num 0) at 172.16.42.1:7777 o2net: connected to node ws2 (num 1) at 172.16.42.2:7777 OCFS2 1.2.5 Tue Apr 10 12:29:28 EDT 2007 (build 0f745576f5282c9408787369d99ba880) ocfs2_dlm: Nodes in domain ("C1B50B9082BC4B74A13FF6F34D35B68B"): 0 1 12 19 kjournald starting. Commit interval 5 seconds ocfs2: Mounting device (3,3) on (node 12, slot 2) ----- These lockouts keep on happening from time to time (about 3 times a week). Today it happended 2 times already. Thanks for any info Nuno Fernandes -------------- next part -------------- A non-text attachment was scrubbed... Name: locking_state.bz2 Type: application/x-bzip2 Size: 162483 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080404/0e2f6483/attachment-0001.bz2
Don't send the raw locking_state info. Instead send the human readable as output-ed by debugfs.ocfs2. $ debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/out Nuno Fernandes wrote:> Hi, > > We are having a problem with apache+perl being hang. > > 1626 ? Ss 0:00 sendmail: rejecting connections on daemon MTA: load > average: 152 > 1634 ? Ss 0:00 sendmail: Queue runner at 01:00:00 > for /var/spool/clientmqueue > 1741 ? Ss 0:00 /usr/sbin/httpd > 1744 ? S 0:00 > \_ /usr/local/sbin/cronolog /site/logssite/access_log.%Y%m%d > 21377 ? S 0:00 \_ /usr/sbin/httpd > 23942 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21518 ? S 0:00 \_ /usr/sbin/httpd > 23987 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21552 ? S 0:00 \_ /usr/sbin/httpd > 23873 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21563 ? S 0:00 \_ /usr/sbin/httpd > 23948 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21590 ? S 0:00 \_ /usr/sbin/httpd > 23866 ? R 39:21 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21596 ? S 0:00 \_ /usr/sbin/httpd > 23929 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > > Process 23866 keeps on running and all the others freeze. Strace also blocks. > Attached i'm sending locking_state data. Dmesg: > > ----- > OCFS2 Node Manager 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build > 9e5f332181e8ebfad464946bcc4888af) > OCFS2 DLM 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build > e2556a71429f31033b275dff4b5594aa) > OCFS2 DLMFS 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build > e2556a71429f31033b275dff4b5594aa) > OCFS2 User DLM kernel interface loaded > o2net: accepted connection from node ws3 (num 19) at 172.16.42.3:7777 > o2net: connected to node ws1 (num 0) at 172.16.42.1:7777 > o2net: connected to node ws2 (num 1) at 172.16.42.2:7777 > OCFS2 1.2.5 Tue Apr 10 12:29:28 EDT 2007 (build > 0f745576f5282c9408787369d99ba880) > ocfs2_dlm: Nodes in domain ("C1B50B9082BC4B74A13FF6F34D35B68B"): 0 1 12 19 > kjournald starting. Commit interval 5 seconds > ocfs2: Mounting device (3,3) on (node 12, slot 2) > ----- > > These lockouts keep on happening from time to time (about 3 times a week). > Today it happended 2 times already. > > Thanks for any info > Nuno Fernandes > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
On Friday 04 April 2008 18:40:42 Sunil Mushran wrote:> debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/outHi, I happened again. Attached i send the output. Thanks for any info, Best regards, Nuno Fernandes -------------- next part -------------- A non-text attachment was scrubbed... Name: out.bz2 Type: application/x-bzip2 Size: 85711 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080404/8337c6c8/attachment-0001.bz2
No busy locks. The hang is not because of o2dlm. Do a kernel stack dump. echo t >/proc/sysrq-trigger That should tell us where the process is stuck in the kernel. Nuno Fernandes wrote:> On Friday 04 April 2008 18:40:42 Sunil Mushran wrote: > >> debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/out >> > > Hi, > > I happened again. Attached i send the output. > > Thanks for any info, > Best regards, > Nuno Fernandes > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Hello, The sysrq is attached. The processlist is: ... 17280 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 6373 ? S 0:00 \_ /usr/sbin/httpd 16561 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 6380 ? S 0:00 \_ /usr/sbin/httpd 17093 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7649 ? S 0:00 \_ /usr/sbin/httpd 16566 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7772 ? S 0:00 \_ /usr/sbin/httpd 17829 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7905 ? S 0:00 \_ /usr/sbin/httpd 17233 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7924 ? S 0:00 \_ /usr/sbin/httpd 16477 ? R 343:20 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7986 ? S 0:00 \_ /usr/sbin/httpd 16565 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 8007 ? S 0:00 \_ /usr/sbin/httpd 30015 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 8066 ? S 0:00 \_ /usr/sbin/httpd 17236 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 8070 ? S 0:00 \_ /usr/sbin/httpd 16480 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi ... Thanks for any info, Best regards, Nuno Fernandes On Friday 04 April 2008 19:21:05 Sunil Mushran wrote:> No busy locks. The hang is not because of o2dlm. > > Do a kernel stack dump. echo t >/proc/sysrq-trigger > That should tell us where the process is stuck in the kernel. > > Nuno Fernandes wrote: > > On Friday 04 April 2008 18:40:42 Sunil Mushran wrote: > >> debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/out > > > > Hi, > > > > I happened again. Attached i send the output. > > > > Thanks for any info, > > Best regards, > > Nuno Fernandes > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users at oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- A non-text attachment was scrubbed... Name: messages.bz2 Type: application/x-bzip2 Size: 13891 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080406/358cdc4c/attachment.bz2
We have a similar situation, our system hangs several times a day. I still can't figure out exactly what's going wrong. But on 1 node of our system (where Apache runs + a webservice written in Ruby (Mongrel/Camping)), the system load keeps rising until it is not responding to anything. Also in the process list there are a lot of processes in D state at that time. The weird thing is that we just discovered that rebooting *another* node (we have 4 in total) fixes this situation. Suddenly the system load on the node that initially had the problem returns to a normal level and the processes that were in a D state are also returning to their normal states. Any idea why rebooting another node results fixes this situation? And what might be the cause of this? We are running: Linux test01 2.6.22-14-server #1 SMP Thu Jan 31 23:57:25 UTC 2008 x86_64 GNU/Linux [ 77.688875] OCFS2 Node Manager 1.3.3 [ 77.703166] OCFS2 DLM 1.3.3 [ 77.710731] OCFS2 DLMFS 1.3.3 [ 77.710816] OCFS2 User DLM kernel interface loaded [ 85.870956] OCFS2 1.3.3 Kind regards, Erik.> Hello, > > yes.. when this situation happens there is allways a process spinning (running > at 100%cpu). We can't kill it even with kill -9 >
In terms of the locks, we were having the same problems, apache reading off of the ocfs2 volume. It hasn't happened in a while. One thing we found is that the (in our case) php processes were all trying to write to the same log file on the ocfs2 volume. This of course had some rather serious locking concerns. Of course we also turned off atimes (but we did that from the beginning), and we're using ocfs2-tools 1.3.9. Michael
Some more information on the lockup Erik reported: when the lockup occurs, all processes trying to access the ocfs2 filesystem get stuck in D state, with WCHAN showing "ocfs2_wait_for_mask". fs_locks then shows about 5 Busy locks, all of the metadata type. Attached are the results of "echo t >/proc/sysrq-trigger". Kind regards, Ivo -------------- next part -------------- A non-text attachment was scrubbed... Name: sysrq_trigger_t.txt.gz Type: application/octet-stream Size: 40443 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080411/1ee9f267/attachment-0001.obj