thr3ads.net - CentOS virt - [CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Nathan March

2017-Nov-07 23:12 UTC

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

Since moving from 4.4 to 4.6, I've been seeing an increasing number of
stability issues on our hypervisors. I'm not clear if there's a singular
root cause here, or if I'm dealing with multiple bugs.

 

One of the more common ones I've seen, is a VM on shutdown will remain in
the null state and a kernel bug is thrown:

 

xen001 log # xl list

Name                                        ID   Mem VCPUs      State
Time(s)

Domain-0                                     0  6144    24     r-----
6639.7

(null)                                       3     0     1     --pscd
36.3

 

[89920.839074] BUG: unable to handle kernel paging request at
ffff88020ee9a000

[89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20

[89920.839933] PGD 2008067 

[89920.840022] PUD 17f43f067 

[89920.840390] PMD 1e0976067 

[89920.840469] PTE 0

[89920.840833] 

[89920.841123] Oops: 0000 [#1] SMP

[89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables
arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss
nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding
xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn
xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler
joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb
dca ptp pps_core uas usb_storage wmi ttm

[89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64
#1

[89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1
03/04/2015

[89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000

[89920.848192] RIP: e030:[<ffffffff81430922>]  [<ffffffff81430922>]
__memcpy+0x12/0x20

[89920.848783] RSP: e02b:ffffc900460e3b20  EFLAGS: 00010246

[89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX:
0000000000000200

[89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI:
ffff88018916d000

[89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09:
0000000000000002

[89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12:
0000000000001000

[89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15:
ffffc900460e3b68

[89920.850605] FS:  00007fb865c30700(0000) GS:ffff880204b00000(0000)
knlGS:0000000000000000

[89920.851118] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033

[89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4:
0000000000042660

[89920.851720] Stack:

[89920.852009]  ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08
ffffc900460e3bb8

[89920.852821]  ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08
0000000000001000

[89920.853633]  ffffc900460e3d88 0000000000000000 0000000000001000
ffffea0000000000

[89920.854445] Call Trace:

[89920.854741]  [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70

[89920.855043]  [<ffffffff814381c5>]
iov_iter_copy_from_user_atomic+0x265/0x290

[89920.855354]  [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0

[89920.855673]  [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160

[89920.855992]  [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs]

[89920.856297]  [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0

[89920.856599]  [<ffffffff815fa365>] lo_write_bvec+0x65/0x100

[89920.856899]  [<ffffffff815fc375>] do_req_filebacked+0x195/0x300

[89920.857202]  [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80

[89920.857505]  [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0

[89920.857808]  [<ffffffff818d9dca>] ? schedule+0x3a/0xa0

[89920.858108]  [<ffffffff818ddbb6>] ?
_raw_spin_unlock_irqrestore+0x16/0x20

[89920.858411]  [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40

[89920.858713]  [<ffffffff810c63f5>] kthread+0xe5/0x100

[89920.859014]  [<ffffffff810c6310>] ? __kthread_init_worker+0x40/0x40

[89920.859317]  [<ffffffff818de2d5>] ret_from_fork+0x25/0x30

[89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90
90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07
<f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 

[89920.864410] RIP  [<ffffffff81430922>] __memcpy+0x12/0x20

[89920.864749]  RSP <ffffc900460e3b20>

[89920.865021] CR2: ffff88020ee9a000

[89920.865294] ---[ end trace b77d2ce5646284d1 ]---

 

Wondering if anyone has advice on how to troubleshoot the above, or might
have some insight into that the issue could be? This hypervisor was only up
for a day, had almost no VMs running on it since boot, I booted a single
windows test VM which BSOD'ed and then this happened.

 

This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
across a wide number of systems with from both Dell and Supermicro, although
we run the same Intel x540 10gb nic's in each system with the same netapp
nfs backend storage.

 

Cheers,

Nathan

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20171107/97d2bd1a/attachment.html>

Sarah Newman

2017-Nov-08 00:57 UTC

head link

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

On 11/07/2017 03:12 PM, Nathan March wrote:> Since moving from 4.4 to 4.6, I've been seeing an increasing number of
> stability issues on our hypervisors. I'm not clear if there's a
singular
> root cause here, or if I'm dealing with multiple bugs.
> 
>  
> 
> One of the more common ones I've seen, is a VM on shutdown will remain
in
> the null state and a kernel bug is thrown:
> 
>  
> 
> xen001 log # xl list
> 
> Name                                        ID   Mem VCPUs      State
> Time(s)
> 
> Domain-0                                     0  6144    24     r-----
> 6639.7
> 
> (null)                                       3     0     1     --pscd
> 36.3
> 
>  
> 
> [89920.839074] BUG: unable to handle kernel paging request at
> ffff88020ee9a000
> <snip>
> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
> across a wide number of systems with from both Dell and Supermicro,
although
> we run the same Intel x540 10gb nic's in each system with the same
netapp
> nfs backend storage.
We don't use NFS and have not seen the exact same issue.

--Sarah

Sarah Newman

2017-Nov-08 01:00 UTC

head link

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

On 11/07/2017 04:57 PM, Sarah Newman wrote:> On 11/07/2017 03:12 PM, Nathan March wrote:
>> Since moving from 4.4 to 4.6, I've been seeing an increasing number
of
>> stability issues on our hypervisors. I'm not clear if there's a
singular
>> root cause here, or if I'm dealing with multiple bugs.
>>
>>  
>>
>> One of the more common ones I've seen, is a VM on shutdown will
remain in
>> the null state and a kernel bug is thrown:
>>
>>  
>>
>> xen001 log # xl list
>>
>> Name                                        ID   Mem VCPUs      State
>> Time(s)
>>
>> Domain-0                                     0  6144    24     r-----
>> 6639.7
>>
>> (null)                                       3     0     1     --pscd
>> 36.3
>>
>>  
>>
>> [89920.839074] BUG: unable to handle kernel paging request at
>> ffff88020ee9a000
>>
> <snip>
> 
>> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these
issues
>> across a wide number of systems with from both Dell and Supermicro,
although
>> we run the same Intel x540 10gb nic's in each system with the same
netapp
>> nfs backend storage.
> 
> We don't use NFS and have not seen the exact same issue.
Additionally we aren't using xen 4.6 any more, we're using 4.8, but we
didn't see issues like that when we were using xen 4.6. We're also still
on
4.9.39. You might try an older kernel or a newer version of xen in addition to
looking for nfs specific issues.

--Sarah

George Dunlap

2017-Nov-15 15:09 UTC

head link

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

Natan,

Thanks for the report.  Would you mind re-posting this to the
xen-users mailing list?  You're much more likely to get someone there
who's seen such a bug before.

 -George

On Tue, Nov 7, 2017 at 11:12 PM, Nathan March <nathan at gt.net>
wrote:> Since moving from 4.4 to 4.6, I?ve been seeing an increasing number of
> stability issues on our hypervisors. I?m not clear if there?s a singular
> root cause here, or if I?m dealing with multiple bugs?
>
>
>
> One of the more common ones I?ve seen, is a VM on shutdown will remain in
> the null state and a kernel bug is thrown:
>
>
>
> xen001 log # xl list
>
> Name                                        ID   Mem VCPUs      State
> Time(s)
>
> Domain-0                                     0  6144    24     r-----
> 6639.7
>
> (null)                                       3     0     1     --pscd
> 36.3
>
>
>
> [89920.839074] BUG: unable to handle kernel paging request at
> ffff88020ee9a000
>
> [89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20
>
> [89920.839933] PGD 2008067
>
> [89920.840022] PUD 17f43f067
>
> [89920.840390] PMD 1e0976067
>
> [89920.840469] PTE 0
>
> [89920.840833]
>
> [89920.841123] Oops: 0000 [#1] SMP
>
> [89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables
> arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss
> nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding
> xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn
> xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler
> joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb
> dca ptp pps_core uas usb_storage wmi ttm
>
> [89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted
4.9.58-29.el6.x86_64
> #1
>
> [89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1
> 03/04/2015
>
> [89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000
>
> [89920.848192] RIP: e030:[<ffffffff81430922>] 
[<ffffffff81430922>]
> __memcpy+0x12/0x20
>
> [89920.848783] RSP: e02b:ffffc900460e3b20  EFLAGS: 00010246
>
> [89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX:
> 0000000000000200
>
> [89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI:
> ffff88018916d000
>
> [89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09:
> 0000000000000002
>
> [89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12:
> 0000000000001000
>
> [89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15:
> ffffc900460e3b68
>
> [89920.850605] FS:  00007fb865c30700(0000) GS:ffff880204b00000(0000)
> knlGS:0000000000000000
>
> [89920.851118] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>
> [89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4:
> 0000000000042660
>
> [89920.851720] Stack:
>
> [89920.852009]  ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08
> ffffc900460e3bb8
>
> [89920.852821]  ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08
> 0000000000001000
>
> [89920.853633]  ffffc900460e3d88 0000000000000000 0000000000001000
> ffffea0000000000
>
> [89920.854445] Call Trace:
>
> [89920.854741]  [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70
>
> [89920.855043]  [<ffffffff814381c5>]
> iov_iter_copy_from_user_atomic+0x265/0x290
>
> [89920.855354]  [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0
>
> [89920.855673]  [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160
>
> [89920.855992]  [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs]
>
> [89920.856297]  [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0
>
> [89920.856599]  [<ffffffff815fa365>] lo_write_bvec+0x65/0x100
>
> [89920.856899]  [<ffffffff815fc375>] do_req_filebacked+0x195/0x300
>
> [89920.857202]  [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80
>
> [89920.857505]  [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0
>
> [89920.857808]  [<ffffffff818d9dca>] ? schedule+0x3a/0xa0
>
> [89920.858108]  [<ffffffff818ddbb6>] ?
_raw_spin_unlock_irqrestore+0x16/0x20
>
> [89920.858411]  [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40
>
> [89920.858713]  [<ffffffff810c63f5>] kthread+0xe5/0x100
>
> [89920.859014]  [<ffffffff810c6310>] ?
__kthread_init_worker+0x40/0x40
>
> [89920.859317]  [<ffffffff818de2d5>] ret_from_fork+0x25/0x30
>
> [89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07
> <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3
>
> [89920.864410] RIP  [<ffffffff81430922>] __memcpy+0x12/0x20
>
> [89920.864749]  RSP <ffffc900460e3b20>
>
> [89920.865021] CR2: ffff88020ee9a000
>
> [89920.865294] ---[ end trace b77d2ce5646284d1 ]---
>
>
>
> Wondering if anyone has advice on how to troubleshoot the above, or might
> have some insight into that the issue could be? This hypervisor was only up
> for a day, had almost no VMs running on it since boot, I booted a single
> windows test VM which BSOD?ed and then this happened.
>
>
>
> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
> across a wide number of systems with from both Dell and Supermicro,
although
> we run the same Intel x540 10gb nic?s in each system with the same netapp
> nfs backend storage.
>
>
>
> Cheers,
>
> Nathan
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
>

Possibly Parallel Threads

Search for more apparently analagous threads

CentOS virt - Nov 2017 - Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

Possibly Parallel Threads