Nathan March
2017-Nov-07 23:12 UTC
[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state
Since moving from 4.4 to 4.6, I've been seeing an increasing number of stability issues on our hypervisors. I'm not clear if there's a singular root cause here, or if I'm dealing with multiple bugs. One of the more common ones I've seen, is a VM on shutdown will remain in the null state and a kernel bug is thrown: xen001 log # xl list Name ID Mem VCPUs State Time(s) Domain-0 0 6144 24 r----- 6639.7 (null) 3 0 1 --pscd 36.3 [89920.839074] BUG: unable to handle kernel paging request at ffff88020ee9a000 [89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20 [89920.839933] PGD 2008067 [89920.840022] PUD 17f43f067 [89920.840390] PMD 1e0976067 [89920.840469] PTE 0 [89920.840833] [89920.841123] Oops: 0000 [#1] SMP [89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb dca ptp pps_core uas usb_storage wmi ttm [89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64 #1 [89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1 03/04/2015 [89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000 [89920.848192] RIP: e030:[<ffffffff81430922>] [<ffffffff81430922>] __memcpy+0x12/0x20 [89920.848783] RSP: e02b:ffffc900460e3b20 EFLAGS: 00010246 [89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX: 0000000000000200 [89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI: ffff88018916d000 [89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09: 0000000000000002 [89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12: 0000000000001000 [89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15: ffffc900460e3b68 [89920.850605] FS: 00007fb865c30700(0000) GS:ffff880204b00000(0000) knlGS:0000000000000000 [89920.851118] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4: 0000000000042660 [89920.851720] Stack: [89920.852009] ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08 ffffc900460e3bb8 [89920.852821] ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08 0000000000001000 [89920.853633] ffffc900460e3d88 0000000000000000 0000000000001000 ffffea0000000000 [89920.854445] Call Trace: [89920.854741] [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70 [89920.855043] [<ffffffff814381c5>] iov_iter_copy_from_user_atomic+0x265/0x290 [89920.855354] [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0 [89920.855673] [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160 [89920.855992] [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs] [89920.856297] [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0 [89920.856599] [<ffffffff815fa365>] lo_write_bvec+0x65/0x100 [89920.856899] [<ffffffff815fc375>] do_req_filebacked+0x195/0x300 [89920.857202] [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80 [89920.857505] [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0 [89920.857808] [<ffffffff818d9dca>] ? schedule+0x3a/0xa0 [89920.858108] [<ffffffff818ddbb6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 [89920.858411] [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40 [89920.858713] [<ffffffff810c63f5>] kthread+0xe5/0x100 [89920.859014] [<ffffffff810c6310>] ? __kthread_init_worker+0x40/0x40 [89920.859317] [<ffffffff818de2d5>] ret_from_fork+0x25/0x30 [89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 [89920.864410] RIP [<ffffffff81430922>] __memcpy+0x12/0x20 [89920.864749] RSP <ffffc900460e3b20> [89920.865021] CR2: ffff88020ee9a000 [89920.865294] ---[ end trace b77d2ce5646284d1 ]--- Wondering if anyone has advice on how to troubleshoot the above, or might have some insight into that the issue could be? This hypervisor was only up for a day, had almost no VMs running on it since boot, I booted a single windows test VM which BSOD'ed and then this happened. This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues across a wide number of systems with from both Dell and Supermicro, although we run the same Intel x540 10gb nic's in each system with the same netapp nfs backend storage. Cheers, Nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20171107/97d2bd1a/attachment.html>
Sarah Newman
2017-Nov-08 00:57 UTC
[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state
On 11/07/2017 03:12 PM, Nathan March wrote:> Since moving from 4.4 to 4.6, I've been seeing an increasing number of > stability issues on our hypervisors. I'm not clear if there's a singular > root cause here, or if I'm dealing with multiple bugs. > > > > One of the more common ones I've seen, is a VM on shutdown will remain in > the null state and a kernel bug is thrown: > > > > xen001 log # xl list > > Name ID Mem VCPUs State > Time(s) > > Domain-0 0 6144 24 r----- > 6639.7 > > (null) 3 0 1 --pscd > 36.3 > > > > [89920.839074] BUG: unable to handle kernel paging request at > ffff88020ee9a000 ><snip>> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues > across a wide number of systems with from both Dell and Supermicro, although > we run the same Intel x540 10gb nic's in each system with the same netapp > nfs backend storage.We don't use NFS and have not seen the exact same issue. --Sarah
Sarah Newman
2017-Nov-08 01:00 UTC
[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state
On 11/07/2017 04:57 PM, Sarah Newman wrote:> On 11/07/2017 03:12 PM, Nathan March wrote: >> Since moving from 4.4 to 4.6, I've been seeing an increasing number of >> stability issues on our hypervisors. I'm not clear if there's a singular >> root cause here, or if I'm dealing with multiple bugs. >> >> >> >> One of the more common ones I've seen, is a VM on shutdown will remain in >> the null state and a kernel bug is thrown: >> >> >> >> xen001 log # xl list >> >> Name ID Mem VCPUs State >> Time(s) >> >> Domain-0 0 6144 24 r----- >> 6639.7 >> >> (null) 3 0 1 --pscd >> 36.3 >> >> >> >> [89920.839074] BUG: unable to handle kernel paging request at >> ffff88020ee9a000 >> > <snip> > >> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues >> across a wide number of systems with from both Dell and Supermicro, although >> we run the same Intel x540 10gb nic's in each system with the same netapp >> nfs backend storage. > > We don't use NFS and have not seen the exact same issue.Additionally we aren't using xen 4.6 any more, we're using 4.8, but we didn't see issues like that when we were using xen 4.6. We're also still on 4.9.39. You might try an older kernel or a newer version of xen in addition to looking for nfs specific issues. --Sarah
George Dunlap
2017-Nov-15 15:09 UTC
[CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state
Natan, Thanks for the report. Would you mind re-posting this to the xen-users mailing list? You're much more likely to get someone there who's seen such a bug before. -George On Tue, Nov 7, 2017 at 11:12 PM, Nathan March <nathan at gt.net> wrote:> Since moving from 4.4 to 4.6, I?ve been seeing an increasing number of > stability issues on our hypervisors. I?m not clear if there?s a singular > root cause here, or if I?m dealing with multiple bugs? > > > > One of the more common ones I?ve seen, is a VM on shutdown will remain in > the null state and a kernel bug is thrown: > > > > xen001 log # xl list > > Name ID Mem VCPUs State > Time(s) > > Domain-0 0 6144 24 r----- > 6639.7 > > (null) 3 0 1 --pscd > 36.3 > > > > [89920.839074] BUG: unable to handle kernel paging request at > ffff88020ee9a000 > > [89920.839546] IP: [<ffffffff81430922>] __memcpy+0x12/0x20 > > [89920.839933] PGD 2008067 > > [89920.840022] PUD 17f43f067 > > [89920.840390] PMD 1e0976067 > > [89920.840469] PTE 0 > > [89920.840833] > > [89920.841123] Oops: 0000 [#1] SMP > > [89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables > arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss > nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding > xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn > xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler > joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb > dca ptp pps_core uas usb_storage wmi ttm > > [89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64 > #1 > > [89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1 > 03/04/2015 > > [89920.847893] task: ffff8801b75e0700 task.stack: ffffc900460e0000 > > [89920.848192] RIP: e030:[<ffffffff81430922>] [<ffffffff81430922>] > __memcpy+0x12/0x20 > > [89920.848783] RSP: e02b:ffffc900460e3b20 EFLAGS: 00010246 > > [89920.849081] RAX: ffff88018916d000 RBX: ffff8801b75e0700 RCX: > 0000000000000200 > > [89920.849384] RDX: 0000000000000000 RSI: ffff88020ee9a000 RDI: > ffff88018916d000 > > [89920.849686] RBP: ffffc900460e3b38 R08: ffff88011da9fcf8 R09: > 0000000000000002 > > [89920.849989] R10: ffff88019535bddc R11: ffffea0006245b5c R12: > 0000000000001000 > > [89920.850294] R13: ffff88018916e000 R14: 0000000000001000 R15: > ffffc900460e3b68 > > [89920.850605] FS: 00007fb865c30700(0000) GS:ffff880204b00000(0000) > knlGS:0000000000000000 > > [89920.851118] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [89920.851418] CR2: ffff88020ee9a000 CR3: 00000001ef03b000 CR4: > 0000000000042660 > > [89920.851720] Stack: > > [89920.852009] ffffffff814375ca ffffc900460e3b38 ffffc900460e3d08 > ffffc900460e3bb8 > > [89920.852821] ffffffff814381c5 ffffc900460e3b68 ffffc900460e3d08 > 0000000000001000 > > [89920.853633] ffffc900460e3d88 0000000000000000 0000000000001000 > ffffea0000000000 > > [89920.854445] Call Trace: > > [89920.854741] [<ffffffff814375ca>] ? memcpy_from_page+0x3a/0x70 > > [89920.855043] [<ffffffff814381c5>] > iov_iter_copy_from_user_atomic+0x265/0x290 > > [89920.855354] [<ffffffff811cf633>] generic_perform_write+0xf3/0x1d0 > > [89920.855673] [<ffffffff8101e39a>] ? xen_load_tls+0xaa/0x160 > > [89920.855992] [<ffffffffc025cf2b>] nfs_file_write+0xdb/0x200 [nfs] > > [89920.856297] [<ffffffff81269062>] vfs_iter_write+0xa2/0xf0 > > [89920.856599] [<ffffffff815fa365>] lo_write_bvec+0x65/0x100 > > [89920.856899] [<ffffffff815fc375>] do_req_filebacked+0x195/0x300 > > [89920.857202] [<ffffffff815fc53b>] loop_queue_work+0x5b/0x80 > > [89920.857505] [<ffffffff810c6898>] kthread_worker_fn+0x98/0x1b0 > > [89920.857808] [<ffffffff818d9dca>] ? schedule+0x3a/0xa0 > > [89920.858108] [<ffffffff818ddbb6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 > > [89920.858411] [<ffffffff810c6800>] ? kthread_probe_data+0x40/0x40 > > [89920.858713] [<ffffffff810c63f5>] kthread+0xe5/0x100 > > [89920.859014] [<ffffffff810c6310>] ? __kthread_init_worker+0x40/0x40 > > [89920.859317] [<ffffffff818de2d5>] ret_from_fork+0x25/0x30 > > [89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90 > 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 > <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 > > [89920.864410] RIP [<ffffffff81430922>] __memcpy+0x12/0x20 > > [89920.864749] RSP <ffffc900460e3b20> > > [89920.865021] CR2: ffff88020ee9a000 > > [89920.865294] ---[ end trace b77d2ce5646284d1 ]--- > > > > Wondering if anyone has advice on how to troubleshoot the above, or might > have some insight into that the issue could be? This hypervisor was only up > for a day, had almost no VMs running on it since boot, I booted a single > windows test VM which BSOD?ed and then this happened. > > > > This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues > across a wide number of systems with from both Dell and Supermicro, although > we run the same Intel x540 10gb nic?s in each system with the same netapp > nfs backend storage. > > > > Cheers, > > Nathan > > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > https://lists.centos.org/mailman/listinfo/centos-virt >
Maybe Matching Threads
- Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state
- exception looking up device number for sda1
- 2.2.3a connection failure from XP to 10.3.4
- GFS2, OCFS2, and FUSE cause xenU to oops.
- [PATCH v2] Revert "drm/nouveau/device/pci: set as non-CPU-coherent on ARM64"