George Joseph
2021-Sep-14 17:55 UTC
[asterisk-users] Large system seeing single CPU core spiking
On Tue, Sep 14, 2021 at 9:19 AM Dan Cropp <dan at amtelco.com> wrote:> Thank you George. > > > > It is using local file based configuration files. >Well, that's good at least. It eliminates the database layer which can be troublesome in virtualized environments, especially if a SAN and/or a remote database server is used.> > > Other factors. > > We run Asterisk in realtime mode to allow it to run as fast as possible. >Running at "realtime" level is usually NOT a good thing for Asterisk and rarely needed when there are adequate resources. Let's say you have a local DNS resolver running. If the system is stressed, Asterisk could actually starve the resolver of resources, which then causes Asterisk to back up waiting for DNS resolution to complete. We've seen this happen when using a database backend for configuration. Someone thinks "I'll just give Asterisk more resources" forgetting that Asterisk needs the database engine to run.> > > I just learned customer upgraded to 24 CPU cores. Although, I’m not sure > they actually assigned 24 physical cores to this machine or just increasing > Hyper-V values. >How is this VM's priority versus other VMs on the same cluster? Just because it has 24 threads doesn't mean it's got 24 threads dedicated. Does using a realtime priority in the VM trickle down to Hyper-V's hypervisor's resource management algorithms?> > I will monitor for additional information and see if the customer will > allow me to capture a coredump when problems are happening. > > Waiting for them to report an incident. > > > > Here is a small sample of the system right now (24 cores), to the best of > my knowledge it’s running fine. > > > > top -p 1509 -n 1 -H -b > > top - 15:06:32 up 9:06, 2 users, load average: 6.02, 5.59, 5.26 > > Threads: 1709 total, 8 running, 1701 sleeping, 0 stopped, 0 zombie > > %Cpu(s): 3.1 us, 2.5 sy, 0.0 ni, 94.3 id, 0.0 wa, 0.0 hi, 0.1 si, > 0.0 st > > KiB Mem : 32143072 total, 29750072 free, 1016132 used, 1376868 buff/cache > > KiB Swap: 8388604 total, 8388604 free, 0 used. 30697060 avail Mem > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 1830 root -11 0 13.741g 493680 28828 R 99.9 1.5 174:13.39 > asterisk > > 1541 root -11 0 13.741g 493680 28828 R 14.3 1.5 20:03.30 > asterisk > > 33601 root -11 0 13.741g 493680 28828 S 9.5 1.5 0:16.30 asterisk > > 46605 root -11 0 13.741g 493680 28828 S 9.5 1.5 0:30.06 asterisk > > 2295 root -11 0 13.741g 493680 28828 S 4.8 1.5 12:25.50 > asterisk > > 2297 root -11 0 13.741g 493680 28828 S 4.8 1.5 1:10.59 > asterisk >There's definitely one thread that's pegging a CPU. If that thread is one of the few "singleton" threads, that can be an issue. What does "core show taskprocessors" indicate? Are there any that are hitting their limits?> > > > > *From:* asterisk-users <asterisk-users-bounces at lists.digium.com> *On > Behalf Of *George Joseph > *Sent:* Tuesday, September 14, 2021 9:39 AM > *To:* Asterisk Users Mailing List - Non-Commercial Discussion < > asterisk-users at lists.digium.com> > *Subject:* Re: [asterisk-users] Large system seeing single CPU core > spiking > > > > > > > > On Tue, Sep 14, 2021 at 8:07 AM Dan Cropp <dan at amtelco.com> wrote: > > I am working with a very large customer running Asterisk with PJSIP. > Systems total channels have been over 2500 (which includes hundreds of > local channels and ConfBridges) when the issues occur. > > It’s running on a Hyper-V VM with 12 CPU cores. > > Things work fine most of the time. > > > > They periodically see 10-30 minute periods where audio starts sounding > like jitter buffer type issues. Can literally have someone spelling their > name and a ConfBridge recording of it shows the audio is missing a letter > or two. > > The odd part is another system (not running Asterisk) was handling these > calls previously. The overall network has plenty of bandwidth (as > evidenced by another system able to handle the call volume) > > > > One area that has perplexed us is when using htop, we see a single CPU > core will spike to 100%. Which core does keep changing. > > > > Asterisk is definitely the process using the vast majority of the CPU > cycles. > > > > We recently found a setting on Hyper-V networking SR-IOV which improved > things. Prior to changing this setting, we were seeing SIP OPTIONS > packets/responses would occasionally take more than 3 seconds causing > devices to drop and come back online. > > > > We have configured a similar system running at Amazon handling far more > traffic and can’t get the single CPU core to spike. Only small static pops > during the calls. > > > > The sheer scale of the system is making it hard to diagnose the problem. > > > > Any thoughts on how to diagnose what is causing the single CPU core to > spike? > > Any thoughts on how to diagnose the problem? > > Any other thoughts/comments? > > > > > > The first thing I'd do is see where the CPU is spending time: userspace, > system, nice, wait, etc. > > Is it actually the asterisk process consuming the CPU? > > Is Asterisk running with local file-based configs, local database, remote > database, etc? > > > > If call quality is really bad already and your customer agrees, you could > try the following the next time it happens... > > 1. Run "top -p `pidof asterisk` -n 1 -H -b" to get a list of all of > Asterisk's threads and their CPU utilization. > > 2. Run ast_coredumper with the --RUNNING option. This will pause > Asterisk while the dump is being generated! > > 3. See if you can correlate the high cpu thread IDs from the top output > to the threads listed in the coredumper's -brief.txt file. > > > > That _may_ give you an idea of where to look. > > > > > > > > Dan > > > This email is intended only for the use of the party to which it is > addressed and may contain information that is privileged, confidential, or > protected by law. If you are not the intended recipient you are hereby > notified that any dissemination, copying or distribution of this email or > its contents is strictly prohibited. If you have received this message in > error, please notify us immediately by replying to the message and deleting > it from your computer. > > -- > _____________________________________________________________________ > -- Bandwidth and Colocation Provided by http://www.api-digital.com -- > > Check out the new Asterisk community forum at: > https://community.asterisk.org/ > > New to Asterisk? Start here: > https://wiki.asterisk.org/wiki/display/AST/Getting+Started > > asterisk-users mailing list > To UNSUBSCRIBE or update options visit: > http://lists.digium.com/mailman/listinfo/asterisk-users > > > This email is intended only for the use of the party to which it is > addressed and may contain information that is privileged, confidential, or > protected by law. If you are not the intended recipient you are hereby > notified that any dissemination, copying or distribution of this email or > its contents is strictly prohibited. If you have received this message in > error, please notify us immediately by replying to the message and deleting > it from your computer. > -- > _____________________________________________________________________ > -- Bandwidth and Colocation Provided by http://www.api-digital.com -- > > Check out the new Asterisk community forum at: > https://community.asterisk.org/ > > New to Asterisk? Start here: > https://wiki.asterisk.org/wiki/display/AST/Getting+Started > > asterisk-users mailing list > To UNSUBSCRIBE or update options visit: > http://lists.digium.com/mailman/listinfo/asterisk-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.digium.com/pipermail/asterisk-users/attachments/20210914/e5d5f040/attachment.html>
Dan Cropp
2021-Sep-14 18:59 UTC
[asterisk-users] Large system seeing single CPU core spiking
Thank you George.
That’s good advice on the realtime mode.
I can have them change this so realtime mode later this week.
Customer’s taskprocessor list is very large.
There are a large number of entries from core show taskprocessors data
The are one a few that are showing the Max Depth of 10 or more. Only including
those and the stasis/pool
Processor
Processed In Queue Max Depth Low water High water
pjsip/pool-control
501599 0 89 450 500
stasis/m:cache_pattern:0/endpoint:all-000015f0
383224 0 21 450 500
stasis/m:devicestate:all-00000002
233836 0 28 450 500
stasis/m:manager:core-00000006
4649316 0 69 2700 3000
stasis/pool
11670 0 2 450 500
stasis/pool-control
23505 0 75 450 500
5922 taskprocessors
We do use AMI for a significant amount of communication (action/events).
Might this be a singleton that could explain the high use for a single asterisk
process id?
NOTE: working on migrating to ARI which I know will help in the call control.
Dan
From: asterisk-users <asterisk-users-bounces at lists.digium.com> On
Behalf Of George Joseph
Sent: Tuesday, September 14, 2021 12:56 PM
To: Asterisk Users Mailing List - Non-Commercial Discussion <asterisk-users
at lists.digium.com>
Subject: Re: [asterisk-users] Large system seeing single CPU core spiking
On Tue, Sep 14, 2021 at 9:19 AM Dan Cropp <dan at amtelco.com<mailto:dan
at amtelco.com>> wrote:
Thank you George.
It is using local file based configuration files.
Well, that's good at least. It eliminates the database layer which can be
troublesome in virtualized environments, especially if a SAN and/or a remote
database server is used.
Other factors.
We run Asterisk in realtime mode to allow it to run as fast as possible.
Running at "realtime" level is usually NOT a good thing for Asterisk
and rarely needed when there are adequate resources. Let's say you have a
local DNS resolver running. If the system is stressed, Asterisk could actually
starve the resolver of resources, which then causes Asterisk to back up waiting
for DNS resolution to complete. We've seen this happen when using a
database backend for configuration. Someone thinks "I'll just give
Asterisk more resources" forgetting that Asterisk needs the database engine
to run.
I just learned customer upgraded to 24 CPU cores. Although, I’m not sure they
actually assigned 24 physical cores to this machine or just increasing Hyper-V
values.
How is this VM's priority versus other VMs on the same cluster? Just
because it has 24 threads doesn't mean it's got 24 threads dedicated.
Does using a realtime priority in the VM trickle down to Hyper-V's
hypervisor's resource management algorithms?
I will monitor for additional information and see if the customer will allow me
to capture a coredump when problems are happening.
Waiting for them to report an incident.
Here is a small sample of the system right now (24 cores), to the best of my
knowledge it’s running fine.
top -p 1509 -n 1 -H -b
top - 15:06:32 up 9:06, 2 users, load average: 6.02, 5.59, 5.26
Threads: 1709 total, 8 running, 1701 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.1 us, 2.5 sy, 0.0 ni, 94.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32143072 total, 29750072 free, 1016132 used, 1376868 buff/cache
KiB Swap: 8388604 total, 8388604 free, 0 used. 30697060 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1830 root -11 0 13.741g 493680 28828 R 99.9 1.5 174:13.39 asterisk
1541 root -11 0 13.741g 493680 28828 R 14.3 1.5 20:03.30 asterisk
33601 root -11 0 13.741g 493680 28828 S 9.5 1.5 0:16.30 asterisk
46605 root -11 0 13.741g 493680 28828 S 9.5 1.5 0:30.06 asterisk
2295 root -11 0 13.741g 493680 28828 S 4.8 1.5 12:25.50 asterisk
2297 root -11 0 13.741g 493680 28828 S 4.8 1.5 1:10.59 asterisk
There's definitely one thread that's pegging a CPU. If that thread is
one of the few "singleton" threads, that can be an issue. What does
"core show taskprocessors" indicate? Are there any that are hitting
their limits?
From: asterisk-users <asterisk-users-bounces at
lists.digium.com<mailto:asterisk-users-bounces at lists.digium.com>> On
Behalf Of George Joseph
Sent: Tuesday, September 14, 2021 9:39 AM
To: Asterisk Users Mailing List - Non-Commercial Discussion <asterisk-users
at lists.digium.com<mailto:asterisk-users at lists.digium.com>>
Subject: Re: [asterisk-users] Large system seeing single CPU core spiking
On Tue, Sep 14, 2021 at 8:07 AM Dan Cropp <dan at amtelco.com<mailto:dan
at amtelco.com>> wrote:
I am working with a very large customer running Asterisk with PJSIP. Systems
total channels have been over 2500 (which includes hundreds of local channels
and ConfBridges) when the issues occur.
It’s running on a Hyper-V VM with 12 CPU cores.
Things work fine most of the time.
They periodically see 10-30 minute periods where audio starts sounding like
jitter buffer type issues. Can literally have someone spelling their name and a
ConfBridge recording of it shows the audio is missing a letter or two.
The odd part is another system (not running Asterisk) was handling these calls
previously. The overall network has plenty of bandwidth (as evidenced by
another system able to handle the call volume)
One area that has perplexed us is when using htop, we see a single CPU core will
spike to 100%. Which core does keep changing.
Asterisk is definitely the process using the vast majority of the CPU cycles.
We recently found a setting on Hyper-V networking SR-IOV which improved things.
Prior to changing this setting, we were seeing SIP OPTIONS packets/responses
would occasionally take more than 3 seconds causing devices to drop and come
back online.
We have configured a similar system running at Amazon handling far more traffic
and can’t get the single CPU core to spike. Only small static pops during the
calls.
The sheer scale of the system is making it hard to diagnose the problem.
Any thoughts on how to diagnose what is causing the single CPU core to spike?
Any thoughts on how to diagnose the problem?
Any other thoughts/comments?
The first thing I'd do is see where the CPU is spending time: userspace,
system, nice, wait, etc.
Is it actually the asterisk process consuming the CPU?
Is Asterisk running with local file-based configs, local database, remote
database, etc?
If call quality is really bad already and your customer agrees, you could try
the following the next time it happens...
1. Run "top -p `pidof asterisk` -n 1 -H -b" to get a list of all of
Asterisk's threads and their CPU utilization.
2. Run ast_coredumper with the --RUNNING option. This will pause Asterisk
while the dump is being generated!
3. See if you can correlate the high cpu thread IDs from the top output to the
threads listed in the coredumper's -brief.txt file.
That _may_ give you an idea of where to look.
Dan
This email is intended only for the use of the party to which it is addressed
and may contain information that is privileged, confidential, or protected by
law. If you are not the intended recipient you are hereby notified that any
dissemination, copying or distribution of this email or its contents is strictly
prohibited. If you have received this message in error, please notify us
immediately by replying to the message and deleting it from your computer.
--
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
Check out the new Asterisk community forum at: https://community.asterisk.org/
New to Asterisk? Start here:
https://wiki.asterisk.org/wiki/display/AST/Getting+Started
asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
http://lists.digium.com/mailman/listinfo/asterisk-users
This email is intended only for the use of the party to which it is addressed
and may contain information that is privileged, confidential, or protected by
law. If you are not the intended recipient you are hereby notified that any
dissemination, copying or distribution of this email or its contents is strictly
prohibited. If you have received this message in error, please notify us
immediately by replying to the message and deleting it from your computer.
--
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
Check out the new Asterisk community forum at: https://community.asterisk.org/
New to Asterisk? Start here:
https://wiki.asterisk.org/wiki/display/AST/Getting+Started
asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
http://lists.digium.com/mailman/listinfo/asterisk-users
This email is intended only for the use of the party to which it is addressed
and may contain information that is privileged, confidential, or protected by
law. If you are not the intended recipient you are hereby notified that any
dissemination, copying or distribution of this email or its contents is strictly
prohibited. If you have received this message in error, please notify us
immediately by replying to the message and deleting it from your computer.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.digium.com/pipermail/asterisk-users/attachments/20210914/2f567479/attachment.html>