Dovid Bender
2020-Jul-03 06:31 UTC
[asterisk-users] Exceptionally long queue length queuing
On Mon, Jun 29, 2020 at 6:46 AM Joshua C. Colp <jcolp at sangoma.com> wrote:> On Sun, Jun 28, 2020 at 2:26 PM Dovid Bender <dovid at telecurve.com> wrote: > >> Hi, >> >> We have a box up and we are starting to see a lot of "Exceptionally long >> queue length queuing" in the logs. From all the research so far it seems >> like this leads to their systems crashing and being unreachable. In our >> case the box remains up and takes calls. We are running Asterisk 16.6.1. We >> are using MusicOnHold to play online music streams via ffmpeg. Any idea on >> how to troubleshoot this further to see why this is happening? >> > > The message occurs when either a channel is deadlocked or the processing > time to handle what a channel is doing is substantial, and things build up. > Getting a backtrace[1] at the time it occurs will show what the system is > actually doing and can provide insight into what is causing it. > > [1] > https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-Runningast_coredumperfordeadlocks,taskprocessorbackups,etc > . > > -- >I ended up re-writing my code to use less Local channels and most of the errors went away. I also noticed that my load average and CPU usage is way down. I will open a ticket since it seems that it may be a bug that others are experiencing.>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.digium.com/pipermail/asterisk-users/attachments/20200703/c4d838a6/attachment.html>
Joshua C. Colp
2020-Jul-03 09:28 UTC
[asterisk-users] Exceptionally long queue length queuing
On Fri, Jul 3, 2020 at 3:32 AM Dovid Bender <dovid at telecurve.com> wrote:> > > On Mon, Jun 29, 2020 at 6:46 AM Joshua C. Colp <jcolp at sangoma.com> wrote: > >> On Sun, Jun 28, 2020 at 2:26 PM Dovid Bender <dovid at telecurve.com> wrote: >> >>> Hi, >>> >>> We have a box up and we are starting to see a lot of "Exceptionally long >>> queue length queuing" in the logs. From all the research so far it seems >>> like this leads to their systems crashing and being unreachable. In our >>> case the box remains up and takes calls. We are running Asterisk 16.6.1. We >>> are using MusicOnHold to play online music streams via ffmpeg. Any idea on >>> how to troubleshoot this further to see why this is happening? >>> >> >> The message occurs when either a channel is deadlocked or the processing >> time to handle what a channel is doing is substantial, and things build up. >> Getting a backtrace[1] at the time it occurs will show what the system is >> actually doing and can provide insight into what is causing it. >> >> [1] >> https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-Runningast_coredumperfordeadlocks,taskprocessorbackups,etc >> . >> >> -- >> > I ended up re-writing my code to use less Local channels and most of the > errors went away. I also noticed that my load average and CPU usage is way > down. I will open a ticket since it seems that it may be a bug that others > are experiencing. >A backtrace would still be needed for the issue, otherwise there is no way to know what exactly is going on our where things are getting held up. -- Joshua C. Colp Asterisk Technical Lead Sangoma Technologies Check us out at www.sangoma.com and www.asterisk.org -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.digium.com/pipermail/asterisk-users/attachments/20200703/0e986829/attachment.html>
Patrick Wakano
2020-Jul-23 00:05 UTC
[asterisk-users] Exceptionally long queue length queuing
Hello list, Hope you are all doing well! Sorry for the long email but I tried to explain all I've seen regarding this issue.... I am going to open a ticket for this issue but I found it useful to already explain it here. I've also recently faced this "Exceptionally long queue length queuing" error in some servers running 16.8.0 and after A LOT of investigation, I've discovered what is causing it in my case (not sure if is the same case as initially reported by Dovid) So, what I discovered is that in case the diaplan has a Wait() and during this wait period many "deferrable frames" are received, the final piece of the Wait function (ast_safe_sleep_conditional) can throw the "Exceptionally long queue length queuing" error message. It is not really common to get a lot of "deferrable frames" (see ast_is_deferrable_frame) but we can have that by simply putting on hold and off hold while the channel is on "wait". My production case is a little more complex and happens while on Wait and with the AST_CONTROL_SRCCHANGE event. This happens when an old Asterisk 13.13.0 receives a hold REINVITE and starts a new RTP stream (with different SSRC) towards some other victim Asterisk with the Music on Hold while still sending the original caller RTP (so at this point 2 RTP streams with different SSRCs are sent to the destination that happens to be a channel currently on Wait). I could not find any ticket for such bug, but version 16.8.0 does not have this problem and instead seems to inject the Music on Hold in the existing RTP stream so there aren't 2 streams at the same time to the same destination. The problem with 2 streams is that each RTP package of the new stream generates a source change frame (AST_CONTROL_SRCCHANGE), and at 50pps this builds up a lot of events fast and so if the Wait is high the issue pops up (I agree having a channel on Wait for too long is not a good practice but noone is perfect...). Also if multiple channels are facing this situation then the problem just escalates very bad. When the "Exceptionally long queue length queuing" is happening all AMI commands seem to fail, BYE messages seem to be missed/delayed, internal timeouts expire and all sort of weird things happen to all channels in the server, apparently the whole Asterisk process gets locked in this loop affecting everything else... and this is what makes this issue really bad because Asterisk can become completely unresponsive while the error message is happening... if it only could throw the error and keep working fine I guess it would be reasonable leaving it up to the dialplan logic to avoid a long Wait().... Anyway, to replicate the issue is enough a simple dialplan like this: exten => 8888,1,Answer() same => n,Wait(30) same => n,NoOp(After wait) same => n,Playback(goodbye) same => n,Hangup() After calling in and while in the Wait(30), repeatedly press Hold/Unhold in the telephone (with Linphone just press the hold once and then press and hold the spacebar which will repeatedly do the hold and unhold several times). After Wait is finished the "Exceptionally long queue length queuing" will show up but gets resolved very fast just because I think there weren't enough frames queued to really freeze Asterisk for too long. Thank you, Kind regards, Patrick Wakano On Fri, 3 Jul 2020 at 19:29, Joshua C. Colp <jcolp at sangoma.com> wrote:> On Fri, Jul 3, 2020 at 3:32 AM Dovid Bender <dovid at telecurve.com> wrote: > >> >> >> On Mon, Jun 29, 2020 at 6:46 AM Joshua C. Colp <jcolp at sangoma.com> wrote: >> >>> On Sun, Jun 28, 2020 at 2:26 PM Dovid Bender <dovid at telecurve.com> >>> wrote: >>> >>>> Hi, >>>> >>>> We have a box up and we are starting to see a lot of "Exceptionally >>>> long queue length queuing" in the logs. From all the research so far it >>>> seems like this leads to their systems crashing and being unreachable. In >>>> our case the box remains up and takes calls. We are running Asterisk >>>> 16.6.1. We are using MusicOnHold to play online music streams via ffmpeg. >>>> Any idea on how to troubleshoot this further to see why this is happening? >>>> >>> >>> The message occurs when either a channel is deadlocked or the processing >>> time to handle what a channel is doing is substantial, and things build up. >>> Getting a backtrace[1] at the time it occurs will show what the system is >>> actually doing and can provide insight into what is causing it. >>> >>> [1] >>> https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-Runningast_coredumperfordeadlocks,taskprocessorbackups,etc >>> . >>> >>> -- >>> >> I ended up re-writing my code to use less Local channels and most of the >> errors went away. I also noticed that my load average and CPU usage is way >> down. I will open a ticket since it seems that it may be a bug that others >> are experiencing. >> > > A backtrace would still be needed for the issue, otherwise there is no way > to know what exactly is going on our where things are getting held up. > > -- > Joshua C. Colp > Asterisk Technical Lead > Sangoma Technologies > Check us out at www.sangoma.com and www.asterisk.org > -- > _____________________________________________________________________ > -- Bandwidth and Colocation Provided by http://www.api-digital.com -- > > Check out the new Asterisk community forum at: > https://community.asterisk.org/ > > New to Asterisk? Start here: > https://wiki.asterisk.org/wiki/display/AST/Getting+Started > > asterisk-users mailing list > To UNSUBSCRIBE or update options visit: > http://lists.digium.com/mailman/listinfo/asterisk-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.digium.com/pipermail/asterisk-users/attachments/20200723/df45cc05/attachment.html>