thr3ads.net - CentOS virt - [CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86

If this information is useful, please help other people find it:
Share via:

Sarah Newman

2016-Mar-10 04:15 UTC

[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one
host for the last couple of weeks and have gotten several soft lockups
within the last 24 hours. I am posting here first in case anyone else has
experienced the same issue.

Here is the first instance:

sched: RT throttling activated
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
Modules linked in: ebt_arp xen_pciback xen_gntalloc ebt_ip ebtable_filter
ebtables ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4
iptable_filter ip_tables xt_physdev br_netfilter bridge stp llc ip6t_REJECT
nf_reject_ipv6 nf_c
onntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables
ipv6 xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev
xen_evtchn xenfs xen_privcmd joydev sg 8250_fintek serio_raw gpio_ich iTCO_wdt
iTCO_vendor_su
pport coretemp intel_powerclamp crct10dif_pclmul crc32_pclmul crc32c_intel
pcspkr i2c_i801 lpc_ich igb ptp pps_core hwmon ioatdma dca i7core_edac
edac_core shpchp ext3 jbd mbcache raid10 raid1 sd_mod mptsas mptscsih mptbase
scsi_transpor
t_sas aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 ahci
libahci mgag200 ttm drm_kms_helper dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.25-18.el6.x86_64 #1
Hardware name: Supermicro X8DTN+-F/X8DTN+-F, BIOS 080016  08/03/2011
task: ffffffff81c1b4c0 ti: ffffffff81c00000 task.ti: ffffffff81c00000
RIP: e030:[<ffffffffa02811d5>]  [<ffffffffa02811d5>]
xenvif_tx_build_gops+0xa5/0x890 [xen_netback]
RSP: e02b:ffff88013f403ca8  EFLAGS: 00000206
RAX: 000000000000003c RBX: ffffc90012c28000 RCX: ffffc90012c280d0
RDX: 0000000000071ea1 RSI: 0000000000000040 RDI: ffffc90012c28000
RBP: ffff88013f403e38 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000071e65
R13: ffff88013f403e50 R14: 000000000000003c R15: 0000000000000032
FS:  00007fe942ac7980(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8000007f6800 CR3: 00000000bcd4c000 CR4: 0000000000002660
Stack:
 ffff88013f403d30 ffff880006de6800 ffff8800adfee1c0 ffff88013f403e54
 000000015786e75a 00000040a0351878 ffffc90012c2def0 000000015786e73c
 ffffc90012c280d0 ffffc90012c2def0 ffffffffa03516c0 ffff8800bf29e000
Call Trace:
 <IRQ>
 [<ffffffffa03516c0>] ? br_handle_frame_finish+0x3f0/0x3f0 [bridge]
 [<ffffffff815a2b9e>] ? __netif_receive_skb_core+0x1ee/0x640
 [<ffffffff815a3017>] ? __netif_receive_skb+0x27/0x70
 [<ffffffff815a326d>] ? netif_receive_skb_internal+0x2d/0x90
 [<ffffffffa01d5053>] ? igb_alloc_rx_buffers+0x63/0xe0 [igb]
 [<ffffffffa0281a0d>] xenvif_tx_action+0x4d/0xa0 [xen_netback]
 [<ffffffffa02843b5>] xenvif_poll+0x35/0x68 [xen_netback]
 [<ffffffff815a39c2>] net_rx_action+0x112/0x2a0
 [<ffffffff81076b7c>] __do_softirq+0xfc/0x2b0
 [<ffffffff81076e3d>] irq_exit+0xbd/0xd0
 [<ffffffff813b2cbc>] xen_evtchn_do_upcall+0x3c/0x50
 [<ffffffff8167659e>] xen_do_hypervisor_callback+0x1e/0x40
 <EOI>
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff8100a830>] ? xen_safe_halt+0x10/0x20
 [<ffffffff8101ec84>] ? default_idle+0x24/0xc0
 [<ffffffff8101e28f>] ? arch_cpu_idle+0xf/0x20
 [<ffffffff810b2276>] ? cpuidle_idle_call+0xd6/0x1d0
 [<ffffffff81091312>] ? __atomic_notifier_call_chain+0x12/0x20
 [<ffffffff810b24a5>] ? cpu_idle_loop+0x135/0x1e0
 [<ffffffff810b256b>] ? cpu_startup_entry+0x1b/0x70
 [<ffffffff810b25b0>] ? cpu_startup_entry+0x60/0x70
 [<ffffffff81667c57>] ? rest_init+0x77/0x80
 [<ffffffff81d774c9>] ? start_kernel+0x441/0x448
 [<ffffffff81d76ea6>] ? set_init_arg+0x5d/0x5d
 [<ffffffff81d76603>] ? x86_64_start_reservations+0x2a/0x2c
 [<ffffffff81d7aae1>] ? xen_start_kernel+0x5ef/0x5f1
Code: 00 0f 87 06 07 00 00 44 8b b3 b8 00 00 00 44 03 b3 c0 00 00 00 45 29 e6 41
39 c6 44 0f 47 f0 45 85 f6 0f 84 8f 00 00 00 0f ae e8 <8b> 83 c0 00
00 00 83 e8 01 44 21 e0 48 8d 04 40 48 c1 e0 02 48

Of the remaining lockups, here is the common backtrace with the exception that
there have been two instances of RIP being in net_rx_action:

 [<ffffffff815a39c2>] net_rx_action+0x112/0x2a0
 [<ffffffff81076b7c>] __do_softirq+0xfc/0x2b0
 [<ffffffff81076e3d>] irq_exit+0xbd/0xd0
 [<ffffffff813b2cbc>] xen_evtchn_do_upcall+0x3c/0x50
 [<ffffffff8167659e>] xen_do_hypervisor_callback+0x1e/0x40
 <EOI>
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff8100a830>] ? xen_safe_halt+0x10/0x20
 [<ffffffff8101ec84>] ? default_idle+0x24/0xc0
 [<ffffffff8101e28f>] ? arch_cpu_idle+0xf/0x20
 [<ffffffff810b2276>] ? cpuidle_idle_call+0xd6/0x1d0
 [<ffffffff81091312>] ? __atomic_notifier_call_chain+0x12/0x20
 [<ffffffff810b24a5>] ? cpu_idle_loop+0x135/0x1e0
 [<ffffffff810b256b>] ? cpu_startup_entry+0x1b/0x70
 [<ffffffff810b25b0>] ? cpu_startup_entry+0x60/0x70
 [<ffffffff81667c57>] ? rest_init+0x77/0x80
 [<ffffffff81d774c9>] ? start_kernel+0x441/0x448
 [<ffffffff81d76ea6>] ? set_init_arg+0x5d/0x5d
 [<ffffffff81d76603>] ? x86_64_start_reservations+0x2a/0x2c
 [<ffffffff81d7aae1>] ? xen_start_kernel+0x5ef/0x5f1

I can post more complete backtraces if that information would be useful to
someone.

Sarah Newman

2016-Mar-10 08:05 UTC

head link

[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

On 03/09/2016 08:15 PM, Sarah Newman wrote:> I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on
one host for the last couple of weeks and have gotten several soft lockups
> within the last 24 hours. I am posting here first in case anyone else has
experienced the same issue.
> 
Here is mpstat from around the time of the issue:

0:08:56 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest 
%idle
10:09:10 PM  all    0.00    0.00   66.67    0.00    0.00   33.33    0.00    0.00
0.00
10:09:11 PM  all    2.17    0.00    5.43   32.61    0.00   58.70    1.09    0.00
0.00
10:09:12 PM  all    0.00    0.00    1.15    0.00    0.00   85.06    0.00    0.00
13.79
10:09:13 PM  all    0.00    0.00    1.08    0.00    0.00   83.87    0.00    0.00
15.05
10:09:14 PM  all    0.00    0.00    1.10    0.00    0.00   83.52    0.00    0.00
15.38
10:09:15 PM  all    1.09    0.00    1.09    0.00    0.00   85.87    0.00    0.00
11.96
10:09:51 PM  all    0.00    0.00    1.09    0.00    0.00   84.78    1.09    0.00
13.04
Message from syslogd at Mar  9 22:09:51 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
10:10:02 PM  all    0.00    0.00   33.33   50.00    0.00   16.67    0.00    0.00
0.00
10:10:03 PM  all    3.16    0.00   10.53    8.42    0.00    2.11    1.05    0.00
74.74
10:10:04 PM  all    0.00    0.00    3.23   38.71    0.00    1.08    1.08    0.00
55.91
10:10:05 PM  all    0.00    0.00    4.30   11.83    0.00    3.23    1.08    0.00
79.57

Typical load:

10:22:15 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest
%idle
10:22:16 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00    0.00
97.96
10:22:17 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.04    0.00
98.96
10:22:18 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01    0.00
97.98
10:22:19 PM  all    0.00    0.00    1.01    0.00    0.00    1.01    0.00    0.00
97.98
10:22:20 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.02    0.00
98.98
10:22:21 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00    0.00
97.96
10:22:22 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01    0.00
97.98


I reverted to an older kernel since the older kernel had run for a couple of
months without issues.

Sarah Newman

2016-Mar-12 23:47 UTC

head link

[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

On 03/10/2016 12:05 AM, Sarah Newman wrote:> On 03/09/2016 08:15 PM, Sarah Newman wrote:
>> I've been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9
on one host for the last couple of weeks and have gotten several soft lockups
>> within the last 24 hours. I am posting here first in case anyone else
has experienced the same issue.
>>
> 
> Here is mpstat from around the time of the issue:
> 
> 0:08:56 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
%guest   %idle
> 10:09:10 PM  all    0.00    0.00   66.67    0.00    0.00   33.33    0.00   
0.00    0.00
> 10:09:11 PM  all    2.17    0.00    5.43   32.61    0.00   58.70    1.09   
0.00    0.00
> 10:09:12 PM  all    0.00    0.00    1.15    0.00    0.00   85.06    0.00   
0.00   13.79
> 10:09:13 PM  all    0.00    0.00    1.08    0.00    0.00   83.87    0.00   
0.00   15.05
> 10:09:14 PM  all    0.00    0.00    1.10    0.00    0.00   83.52    0.00   
0.00   15.38
> 10:09:15 PM  all    1.09    0.00    1.09    0.00    0.00   85.87    0.00   
0.00   11.96
> 10:09:51 PM  all    0.00    0.00    1.09    0.00    0.00   84.78    1.09   
0.00   13.04
> Message from syslogd at Mar  9 22:09:51 ...
>  kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> 10:10:02 PM  all    0.00    0.00   33.33   50.00    0.00   16.67    0.00   
0.00    0.00
> 10:10:03 PM  all    3.16    0.00   10.53    8.42    0.00    2.11    1.05   
0.00   74.74
> 10:10:04 PM  all    0.00    0.00    3.23   38.71    0.00    1.08    1.08   
0.00   55.91
> 10:10:05 PM  all    0.00    0.00    4.30   11.83    0.00    3.23    1.08   
0.00   79.57
> 
> Typical load:
> 
> 10:22:15 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
%guest   %idle
> 10:22:16 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00   
0.00   97.96
> 10:22:17 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.04   
0.00   98.96
> 10:22:18 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01   
0.00   97.98
> 10:22:19 PM  all    0.00    0.00    1.01    0.00    0.00    1.01    0.00   
0.00   97.98
> 10:22:20 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    1.02   
0.00   98.98
> 10:22:21 PM  all    0.00    0.00    1.02    0.00    0.00    1.02    0.00   
0.00   97.96
> 10:22:22 PM  all    0.00    0.00    0.00    0.00    0.00    1.01    1.01   
0.00   97.98
> 
> 
> I reverted to an older kernel since the older kernel had run for a couple
of months without issues.

This did not fix it. I isolated the issue to a vif rate limit of 100Mb/s being
applied to one of the guests and am now able to reproduce on a
different machine.

I will look into whether this has been fixed already; if so I will submit a pull
request for the Xen4CentOS kernel and if not I will take it up with
the xen-devel list.

Seemingly Similar Threads

Search for more maybe matching threads

CentOS virt - Mar 2016 - Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

[CentOS-virt] Soft lockups with Xen4CentOS 3.18.25-18.el6.x86_64

Seemingly Similar Threads