Henning Holtschneider
2008-Jan-22 18:37 UTC
[asterisk-users] chan_sip deadlocks after some time
Hello everybody, I'm running Asterisk 1.2.24 on three servers which are configured almost identical. The servers use IAX to communicate between each other and SIP to communicate with the outside world through a Patton Smartnode 4960 gateway. One server has about 30 SIP phones registered, the other two servers have about 100 phones registered each. The "small" server runs fine without any problem whatsoever. On the two larger servers, however, chan_sip stops processing calls and CLI commands after some time. "Some time" is two hours one day or four hours on another day. On some days, everything works flawlessly ... Whenever chan_sip stops responding, I fire up gdb and I see something like this: (gdb) info thread 25 Thread -1211487312 (LWP 12519) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 24 Thread -1211749456 (LWP 12520) 0xb7fb58ae in accept () from /lib/tls/libpthread.so.0 23 Thread -1212011600 (LWP 12521) 0xb7fb3295 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 22 Thread -1215665232 (LWP 12522) 0xb7ea4a27 in select () from /lib/tls/libc.so.6 21 Thread -1217684560 (LWP 12523) 0xb7e7c99c in nanosleep () from /lib/tls/libc.so.6 20 Thread -1218057296 (LWP 12524) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 19 Thread -1218319440 (LWP 12525) 0xb7e7c99c in nanosleep () from /lib/tls/libc.so.6 18 Thread -1218937936 (LWP 12526) 0xb7fb5436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 17 Thread -1220633680 (LWP 12527) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 16 Thread -1221022800 (LWP 12528) 0xb7ea4a27 in select () from /lib/tls/libc.so.6 15 Thread -1221940304 (LWP 12529) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 14 Thread -1227236432 (LWP 12540) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 13 Thread -1250468944 (LWP 23446) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 12 Thread -1250731088 (LWP 23712) 0xb7e7c99c in nanosleep () from /lib/tls/libc.so.6 11 Thread -1246721104 (LWP 23717) 0xb7e7c99c in nanosleep () from /lib/tls/libc.so.6 10 Thread -1245934672 (LWP 23734) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 9 Thread -1247245392 (LWP 23741) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 8 Thread -1246983248 (LWP 23772) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 7 Thread -1249682512 (LWP 23788) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 6 Thread -1246458960 (LWP 23817) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 5 Thread -1247507536 (LWP 23824) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 4 Thread -1245672528 (LWP 23827) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 3 Thread -1246196816 (LWP 23838) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 2 Thread -1250206800 (LWP 23848) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 1 Thread -1211293568 (LWP 12517) 0xb7ea2523 in poll () from /lib/tls/libc.so.6 I think the interesting thread is #18, so this is the "thread apply bt all" excerpt of thread #18: Thread 18 (Thread -1218937936 (LWP 12526)): #0 0xb7fb5436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #1 0xb7fb289f in _L_mutex_lock_73 () from /lib/tls/libpthread.so.0 #2 0x00000000 in ?? () #3 0xffffffff in ?? () #4 0xb7e4ce84 in strncasecmp () from /lib/tls/libc.so.6 #5 0x0806162d in ast_deactivate_generator (chan=0x0) at lock.h:601 #6 0xb7bdf259 in local_ast_moh_stop (chan=0x8177838) at res_musiconhold.c:939 #7 0x080676fc in ast_moh_stop (chan=0xfffffffc) at channel.c:3935 #8 0xb7595499 in process_sdp (p=0xb5963000, req=0xb75867b0) at chan_sip.c:3805 #9 0xb75af211 in handle_request_invite (p=0xb5963000, req=0xb75867b0, debug=0, ignore=0, seqno=4, sin=0xfffffffc, recount=0xfffffffc, e=0xfffffffc <Address 0xfffffffc out of bounds>) at chan_sip.c:10671 #10 0xb75b0df1 in handle_request (p=0xb5963000, req=0xb75867b0, sin=0xb75867a0, recount=0xfffffffc, nounlock=0xfffffffc) at chan_sip.c:11457 #11 0xb75b1806 in sipsock_read (id=0x818c4d0, fd=13, events=1, ignore=0x0) at chan_sip.c:11603 #12 0x08055f87 in ast_io_wait (ioc=0x818c7f8, howlong=-4) at io.c:284 #13 0xb75b1f20 in do_monitor (data=0x0) at chan_sip.c:11774 #14 0xb7fb0b63 in start_thread () from /lib/tls/libpthread.so.0 #15 0xb7eab18a in clone () from /lib/tls/libc.so.6 The unknown symbols in lines 2 and 3 come from app_queue, which is stripped on the machines. I've tried every possible configuration change I can think of (ranging from turning on/off canreinvite in sip.conf to removing all SIP Hints in extensions.conf) without any visible success :-( I would appreciate if someone with profound knowledge of chan_sip could have a look at the issue. I can provide a full backtrace (except information from app_queue.so, see above) if necessary. I would like to file a bug in Digium's bugtracker, but I think it will be rejected because I'm using Asterisk 1.2 (I cannot upgrade due to dialplan incompatibilities). Since this is a commercial project, I'm also willing to pay for support if in prospect of success. Thanks for your help, Henning Holtschneider -- LocaNet oHG - http://www.loca.net Lindemannstrasse 81, D-44137 Dortmund tel +49 231 91596-25, fax +49 231 91596-55 sip 25 at voip.loca.net Registergericht Amtsgericht Dortmund HRA 14208 Gesch?ftsf?hrer Sven Haufe, Henning Holtschneider -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://lists.digium.com/pipermail/asterisk-users/attachments/20080122/de313af6/attachment.pgp