Hi,
We are having a problem with apache+perl being hang.
1626 ? Ss 0:00 sendmail: rejecting connections on daemon MTA: load
average: 152
1634 ? Ss 0:00 sendmail: Queue runner at 01:00:00
for /var/spool/clientmqueue
1741 ? Ss 0:00 /usr/sbin/httpd
1744 ? S 0:00
\_ /usr/local/sbin/cronolog /site/logssite/access_log.%Y%m%d
21377 ? S 0:00 \_ /usr/sbin/httpd
23942 ? D 0:00 |
\_ /usr/bin/perl -w
/storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi
21518 ? S 0:00 \_ /usr/sbin/httpd
23987 ? D 0:00 |
\_ /usr/bin/perl -w
/storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi
21552 ? S 0:00 \_ /usr/sbin/httpd
23873 ? D 0:00 |
\_ /usr/bin/perl -w
/storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi
21563 ? S 0:00 \_ /usr/sbin/httpd
23948 ? D 0:00 |
\_ /usr/bin/perl -w
/storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi
21590 ? S 0:00 \_ /usr/sbin/httpd
23866 ? R 39:21 |
\_ /usr/bin/perl -w
/storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi
21596 ? S 0:00 \_ /usr/sbin/httpd
23929 ? D 0:00 |
\_ /usr/bin/perl -w
/storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi
Process 23866 keeps on running and all the others freeze. Strace also blocks.
Attached i'm sending locking_state data. Dmesg:
-----
OCFS2 Node Manager 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build
9e5f332181e8ebfad464946bcc4888af)
OCFS2 DLM 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build
e2556a71429f31033b275dff4b5594aa)
OCFS2 DLMFS 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build
e2556a71429f31033b275dff4b5594aa)
OCFS2 User DLM kernel interface loaded
o2net: accepted connection from node ws3 (num 19) at 172.16.42.3:7777
o2net: connected to node ws1 (num 0) at 172.16.42.1:7777
o2net: connected to node ws2 (num 1) at 172.16.42.2:7777
OCFS2 1.2.5 Tue Apr 10 12:29:28 EDT 2007 (build
0f745576f5282c9408787369d99ba880)
ocfs2_dlm: Nodes in domain ("C1B50B9082BC4B74A13FF6F34D35B68B"): 0 1
12 19
kjournald starting. Commit interval 5 seconds
ocfs2: Mounting device (3,3) on (node 12, slot 2)
-----
These lockouts keep on happening from time to time (about 3 times a week).
Today it happended 2 times already.
Thanks for any info
Nuno Fernandes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: locking_state.bz2
Type: application/x-bzip2
Size: 162483 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080404/0e2f6483/attachment-0001.bz2
Don't send the raw locking_state info. Instead send the human readable as output-ed by debugfs.ocfs2. $ debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/out Nuno Fernandes wrote:> Hi, > > We are having a problem with apache+perl being hang. > > 1626 ? Ss 0:00 sendmail: rejecting connections on daemon MTA: load > average: 152 > 1634 ? Ss 0:00 sendmail: Queue runner at 01:00:00 > for /var/spool/clientmqueue > 1741 ? Ss 0:00 /usr/sbin/httpd > 1744 ? S 0:00 > \_ /usr/local/sbin/cronolog /site/logssite/access_log.%Y%m%d > 21377 ? S 0:00 \_ /usr/sbin/httpd > 23942 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21518 ? S 0:00 \_ /usr/sbin/httpd > 23987 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21552 ? S 0:00 \_ /usr/sbin/httpd > 23873 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21563 ? S 0:00 \_ /usr/sbin/httpd > 23948 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21590 ? S 0:00 \_ /usr/sbin/httpd > 23866 ? R 39:21 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > 21596 ? S 0:00 \_ /usr/sbin/httpd > 23929 ? D 0:00 | > \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi > > Process 23866 keeps on running and all the others freeze. Strace also blocks. > Attached i'm sending locking_state data. Dmesg: > > ----- > OCFS2 Node Manager 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build > 9e5f332181e8ebfad464946bcc4888af) > OCFS2 DLM 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build > e2556a71429f31033b275dff4b5594aa) > OCFS2 DLMFS 1.2.5 Tue Apr 10 12:29:33 EDT 2007 (build > e2556a71429f31033b275dff4b5594aa) > OCFS2 User DLM kernel interface loaded > o2net: accepted connection from node ws3 (num 19) at 172.16.42.3:7777 > o2net: connected to node ws1 (num 0) at 172.16.42.1:7777 > o2net: connected to node ws2 (num 1) at 172.16.42.2:7777 > OCFS2 1.2.5 Tue Apr 10 12:29:28 EDT 2007 (build > 0f745576f5282c9408787369d99ba880) > ocfs2_dlm: Nodes in domain ("C1B50B9082BC4B74A13FF6F34D35B68B"): 0 1 12 19 > kjournald starting. Commit interval 5 seconds > ocfs2: Mounting device (3,3) on (node 12, slot 2) > ----- > > These lockouts keep on happening from time to time (about 3 times a week). > Today it happended 2 times already. > > Thanks for any info > Nuno Fernandes > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
On Friday 04 April 2008 18:40:42 Sunil Mushran wrote:> debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/outHi, I happened again. Attached i send the output. Thanks for any info, Best regards, Nuno Fernandes -------------- next part -------------- A non-text attachment was scrubbed... Name: out.bz2 Type: application/x-bzip2 Size: 85711 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080404/8337c6c8/attachment-0001.bz2
No busy locks. The hang is not because of o2dlm. Do a kernel stack dump. echo t >/proc/sysrq-trigger That should tell us where the process is stuck in the kernel. Nuno Fernandes wrote:> On Friday 04 April 2008 18:40:42 Sunil Mushran wrote: > >> debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/out >> > > Hi, > > I happened again. Attached i send the output. > > Thanks for any info, > Best regards, > Nuno Fernandes > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Hello, The sysrq is attached. The processlist is: ... 17280 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 6373 ? S 0:00 \_ /usr/sbin/httpd 16561 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 6380 ? S 0:00 \_ /usr/sbin/httpd 17093 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7649 ? S 0:00 \_ /usr/sbin/httpd 16566 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7772 ? S 0:00 \_ /usr/sbin/httpd 17829 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7905 ? S 0:00 \_ /usr/sbin/httpd 17233 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7924 ? S 0:00 \_ /usr/sbin/httpd 16477 ? R 343:20 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 7986 ? S 0:00 \_ /usr/sbin/httpd 16565 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 8007 ? S 0:00 \_ /usr/sbin/httpd 30015 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 8066 ? S 0:00 \_ /usr/sbin/httpd 17236 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi 8070 ? S 0:00 \_ /usr/sbin/httpd 16480 ? D 0:00 | \_ /usr/bin/perl -w /storage/webhosting/site/public_html/alojados/site/MT/come2x.cgi ... Thanks for any info, Best regards, Nuno Fernandes On Friday 04 April 2008 19:21:05 Sunil Mushran wrote:> No busy locks. The hang is not because of o2dlm. > > Do a kernel stack dump. echo t >/proc/sysrq-trigger > That should tell us where the process is stuck in the kernel. > > Nuno Fernandes wrote: > > On Friday 04 April 2008 18:40:42 Sunil Mushran wrote: > >> debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/out > > > > Hi, > > > > I happened again. Attached i send the output. > > > > Thanks for any info, > > Best regards, > > Nuno Fernandes > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users at oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- A non-text attachment was scrubbed... Name: messages.bz2 Type: application/x-bzip2 Size: 13891 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080406/358cdc4c/attachment.bz2
We have a similar situation, our system hangs several times a day. I still can't figure out exactly what's going wrong. But on 1 node of our system (where Apache runs + a webservice written in Ruby (Mongrel/Camping)), the system load keeps rising until it is not responding to anything. Also in the process list there are a lot of processes in D state at that time. The weird thing is that we just discovered that rebooting *another* node (we have 4 in total) fixes this situation. Suddenly the system load on the node that initially had the problem returns to a normal level and the processes that were in a D state are also returning to their normal states. Any idea why rebooting another node results fixes this situation? And what might be the cause of this? We are running: Linux test01 2.6.22-14-server #1 SMP Thu Jan 31 23:57:25 UTC 2008 x86_64 GNU/Linux [ 77.688875] OCFS2 Node Manager 1.3.3 [ 77.703166] OCFS2 DLM 1.3.3 [ 77.710731] OCFS2 DLMFS 1.3.3 [ 77.710816] OCFS2 User DLM kernel interface loaded [ 85.870956] OCFS2 1.3.3 Kind regards, Erik.> Hello, > > yes.. when this situation happens there is allways a process spinning (running > at 100%cpu). We can't kill it even with kill -9 >
In terms of the locks, we were having the same problems, apache reading off of the ocfs2 volume. It hasn't happened in a while. One thing we found is that the (in our case) php processes were all trying to write to the same log file on the ocfs2 volume. This of course had some rather serious locking concerns. Of course we also turned off atimes (but we did that from the beginning), and we're using ocfs2-tools 1.3.9. Michael
Some more information on the lockup Erik reported: when the lockup occurs, all processes trying to access the ocfs2 filesystem get stuck in D state, with WCHAN showing "ocfs2_wait_for_mask". fs_locks then shows about 5 Busy locks, all of the metadata type. Attached are the results of "echo t >/proc/sysrq-trigger". Kind regards, Ivo -------------- next part -------------- A non-text attachment was scrubbed... Name: sysrq_trigger_t.txt.gz Type: application/octet-stream Size: 40443 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080411/1ee9f267/attachment-0001.obj