I''ve run into a problem on 3.1.2 with an HVM guest using PV disks. In dom0, the physical disk is accessed using iSCSI. The symptom is that applications in dom0 which are monitoring the iSCSI network interface (e.g. tcpdump) die with EFAULT errors. When the block I/O completes, it looks like blkback is doing a GNTTABOP_unmap_grant_ref on a guest page, even though the dom0 kernel has done get_page() on it and still holds references. The page had been passed through iSCSI into the network stack, so it ends up referenced by one or more skb''s. Because there was an AF_PACKET socket open, a clone of the skb ends up queued for an indeterminate amount on that socket queue. When the application finally gets around to reading the data, the page is no longer mapped, and the read fails trying to copy the data out of the kernel. Has anyone else seen anything similar? I mentioned tcpdump, but the problem also shows up with dhcpcd, which needs to process packets at the ethernet layer. I''m thinking blkback will have to make a dom0 copy of the page before doing the unmap if there are still extra references? Gary -- Gary Grebus Virtual Iron Software, Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Gary Grebus schreef:> I''ve run into a problem on 3.1.2 with an HVM guest using PV disks. In > dom0, the physical disk is accessed using iSCSI. The symptom is that > applications in dom0 which are monitoring the iSCSI network interface > (e.g. tcpdump) die with EFAULT errors. > > When the block I/O completes, it looks like blkback is doing a > GNTTABOP_unmap_grant_ref on a guest page, even though the dom0 kernel > has done get_page() on it and still holds references. > > The page had been passed through iSCSI into the network stack, so it > ends up referenced by one or more skb''s. Because there was an AF_PACKET > socket open, a clone of the skb ends up queued for an indeterminate > amount on that socket queue. When the application finally gets around > to reading the data, the page is no longer mapped, and the read fails > trying to copy the data out of the kernel. > > Has anyone else seen anything similar? I mentioned tcpdump, but the > problem also shows up with dhcpcd, which needs to process packets at the > ethernet layer. > > I''m thinking blkback will have to make a dom0 copy of the page before > doing the unmap if there are still extra references?I''m running the same setup. Are you using iSCSI over the same interface as your Xen bridge? Stefan -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHrLThYH1+F2Rqwn0RCgwNAJ4my+4sQvRxUzIIYp88GKY04I4j0wCfU0FN 0zHBUpww1N7mSaMV4CLnjEo=Q+wF -----END PGP SIGNATURE----- _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Gary, On Fri, Feb 08, 2008 at 02:54:14PM -0500, Gary Grebus wrote:> I''ve run into a problem on 3.1.2 with an HVM guest using PV disks. In > dom0, the physical disk is accessed using iSCSI. The symptom is that > applications in dom0 which are monitoring the iSCSI network interface > (e.g. tcpdump) die with EFAULT errors. > > When the block I/O completes, it looks like blkback is doing a > GNTTABOP_unmap_grant_ref on a guest page, even though the dom0 kernel > has done get_page() on it and still holds references. > > The page had been passed through iSCSI into the network stack, so it > ends up referenced by one or more skb''s. Because there was an AF_PACKET > socket open, a clone of the skb ends up queued for an indeterminate > amount on that socket queue. When the application finally gets around > to reading the data, the page is no longer mapped, and the read fails > trying to copy the data out of the kernel. > > Has anyone else seen anything similar? I mentioned tcpdump, but the > problem also shows up with dhcpcd, which needs to process packets at the > ethernet layer. >We''re seeing the same thing with 3.1.3. When running iscsi in dom0 (over a xen bridge) presenting these via blkfront to the guest we see the same crash (below) while performing failover tests on the storage controller. Just as you said, the error occurs in skb_remove_foreign_references from loopback_start_xmit. It''s running all the foreign pages, attempting to copy each locally when it dies on the source address (esi) of the following memcpy: 115 vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]); 116 off = skb_shinfo(skb)->frags[i].page_offset; 117 memcpy(page_address(page) + off, 118 vaddr + off, 119 skb_shinfo(skb)->frags[i].size); c053f2f7: 0f b7 74 c8 18 movzwl 0x18(%eax,%ecx,8),%esi c053f2fc: 0f b7 5c c8 1a movzwl 0x1a(%eax,%ecx,8),%ebx c053f301: 8b 44 24 0c mov 0xc(%esp),%eax c053f305: e8 ba 09 f1 ff call 0xc044fcc4 page_address c053f30a: 89 d9 mov %ebx,%ecx c053f30c: c1 e9 02 shr $0x2,%ecx c053f30f: 8d 3c 30 lea (%eax,%esi,1),%edi c053f312: 03 74 24 04 add 0x4(%esp),%esi c053f316: f3 a5 rep movsl %ds:(%esi),%es:(%edi) <<<<< memcpy ds: 007b esi: c0df7000 es: 007b edi: ebffb000 It seems one of the skb->frags has been unmapped.> I''m thinking blkback will have to make a dom0 copy of the page before > doing the unmap if there are still extra references? >Can the unmap be deferred, handled by the last reference holder? Or does this open up a potential security hole? Thanks kurt Kurt Hackel Oracle Corp. ========================================== BUG: unable to handle kernel paging request at virtual address c0df7000 printing eip: c053f316 36d4c000 -> *pde = 00000000:c4237027 36c37000 -> *pme = 00000001:1bd14067 00d14000 -> *pte = 00000000:00000000 Oops: 0000 [#1] SMP Modules linked in: xt_physdev bridge autofs4 sunrpc dm_round_robin ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables ib_iser rdma_cm ib_addr ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi ocfs2(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs dm_mirror dm_multipath dm_mod video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport joydev sg i2c_piix4 i2c_core pcspkr k8_edac edac_mc tg3 ide_cd serio_raw serial_core cdrom qla2xxx scsi_transport_fc sata_svw libata mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 3 EIP: 0061:[<c053f316>] Not tainted VLI EFLAGS: 00010286 (2.6.18-8.1.6.0.18.el5xen #1) EIP is at loopback_start_xmit+0x107/0x2bd eax: ebffb000 ebx: 00000578 ecx: 0000015e edx: c065c800 esi: c0df7000 edi: ebffb000 ebp: f1134ea8 esp: c0701e6c ds: 007b es: 007b ss: 0069 Process swapper (pid: 0, ti=c0701000 task=f77c05a0 task.ti=c0d2f000) Stack: c9a13c00 c0df7000 00000001 c157ff60 c9a13800 f1134ea8 c9a13980 c9a13800 c059fc02 c9a13800 f1134ea8 c9a13980 0000000e c05a1768 c0dcf824 c0dcf800 f1134ea8 c05a5cfc c9a13800 ed20e040 00001fc2 00000000 f48703d4 f48703e8 Call Trace: [<c059fc02>] dev_hard_start_xmit+0x198/0x1ee [<c05a1768>] dev_queue_xmit+0x24c/0x2e8 [<c05a5cfc>] neigh_resolve_output+0x1b7/0x1e1 [<c05bea8b>] ip_output+0x1c0/0x1f9 [<c05be309>] ip_queue_xmit+0x390/0x3cf [<c059fc02>] dev_hard_start_xmit+0x198/0x1ee [<c05adbe6>] __qdisc_run+0x30/0x19a [<c05a17e6>] dev_queue_xmit+0x2ca/0x2e8 [<f8640d48>] br_dev_queue_push_xmit+0x15b/0x17e [bridge] [<c05cbc6f>] tcp_transmit_skb+0x5e4/0x612 [<f8641945>] br_handle_frame+0x146/0x15d [bridge] [<c05cc9ad>] tcp_retransmit_skb+0x4b7/0x595 [<c05c5baf>] tcp_enter_loss+0x1a2/0x1ff [<c05cee58>] tcp_write_timer+0x3ff/0x5d3 [<c05cea59>] tcp_write_timer+0x0/0x5d3 [<c0427146>] run_timer_softirq+0x120/0x19b [<c0423162>] __do_softirq+0x73/0xe8 [<c0406dda>] do_softirq+0x6e/0x102 ====================== [<c0406d63>] do_IRQ+0xa5/0xae [<c052f040>] evtchn_do_upcall+0x85/0xde [<c04056a1>] hypervisor_callback+0x3d/0x45 [<c040800e>] raw_safe_halt+0xc2/0xe8 [<c040442a>] xen_idle+0x43/0x4f [<c04033b0>] cpu_idle+0xa1/0xbb Code: 24 08 89 44 24 04 8b 85 a4 00 00 00 0f b7 74 c8 18 0f b7 5c c8 1a 8b 44 24 0c e8 ba 09 f1 ff 89 d9 c1 e9 02 8d 3c 30 03 74 24 04 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 8b 44 24 04 ba 05 00 00 00 e8 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/2/08 06:15, "Kurt Hackel" <kurt.hackel@oracle.com> wrote:>> I''m thinking blkback will have to make a dom0 copy of the page before >> doing the unmap if there are still extra references? > > Can the unmap be deferred, handled by the last reference holder? Or > does this open up a potential security hole?netback already does this kind of reference counting. It oughtn''t to be hard to check the page reference count in the blkback I/O completion handler and, if non-zero, set up a callback for when the count does fall to zero. And defer responding to the frontend until that time. Netback is even more sophisticated in that it also sets a time out and if the page languishes for too long with non-zero count, it''s able to forcibly copy-and-release the page. I don''t think we need to go that far for blkback however. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2008-02-08 at 22:15 -0800, Kurt Hackel wrote:> Hi Gary, > > On Fri, Feb 08, 2008 at 02:54:14PM -0500, Gary Grebus wrote: > > I''ve run into a problem on 3.1.2 with an HVM guest using PV disks. In > > dom0, the physical disk is accessed using iSCSI. The symptom is that > > applications in dom0 which are monitoring the iSCSI network interface > > (e.g. tcpdump) die with EFAULT errors. > >...> > > > We''re seeing the same thing with 3.1.3. When running iscsi in dom0 > (over a xen bridge) presenting these via blkfront to the guest we see > the same crash (below) while performing failover tests on the storage > controller. > > Just as you said, the error occurs in skb_remove_foreign_references from > loopback_start_xmit. It''s running all the foreign pages, attempting to > copy each locally when it dies on the source address (esi) of the > following memcpy:That''s a different failure than I see, but looks like the same underlying cause. Our test used a dedicated iSCSI NIC, so netback wasn''t involved. I haven''t looked at how netback handles the mapped pages.> > 115 vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]); > 116 off = skb_shinfo(skb)->frags[i].page_offset; > 117 memcpy(page_address(page) + off, > 118 vaddr + off, > 119 skb_shinfo(skb)->frags[i].size); > > c053f2f7: 0f b7 74 c8 18 movzwl 0x18(%eax,%ecx,8),%esi > c053f2fc: 0f b7 5c c8 1a movzwl 0x1a(%eax,%ecx,8),%ebx > c053f301: 8b 44 24 0c mov 0xc(%esp),%eax > c053f305: e8 ba 09 f1 ff call 0xc044fcc4 page_address > c053f30a: 89 d9 mov %ebx,%ecx > c053f30c: c1 e9 02 shr $0x2,%ecx > c053f30f: 8d 3c 30 lea (%eax,%esi,1),%edi > c053f312: 03 74 24 04 add 0x4(%esp),%esi > c053f316: f3 a5 rep movsl %ds:(%esi),%es:(%edi) > <<<<< memcpy > ds: 007b esi: c0df7000 es: 007b edi: ebffb000 > > It seems one of the skb->frags has been unmapped. > > > > I''m thinking blkback will have to make a dom0 copy of the page before > > doing the unmap if there are still extra references? > > > > Can the unmap be deferred, handled by the last reference holder? Or > does this open up a potential security hole? >When the initial block I/O completes, blkfront is going to remove the grant, so I think you would have to defer notifying blkfront as well. That doesn''t see workable, since the guest could see the I/O take an extremely long time, and trigger some timeout. I think there has to be a copy made at some point. /gary _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 2008-02-09 at 08:07 +0000, Keir Fraser wrote:> On 9/2/08 06:15, "Kurt Hackel" <kurt.hackel@oracle.com> wrote: > > >> I''m thinking blkback will have to make a dom0 copy of the page before > >> doing the unmap if there are still extra references? > > > > Can the unmap be deferred, handled by the last reference holder? Or > > does this open up a potential security hole? > > netback already does this kind of reference counting. It oughtn''t to be hard > to check the page reference count in the blkback I/O completion handler and, > if non-zero, set up a callback for when the count does fall to zero. And > defer responding to the frontend until that time. Netback is even more > sophisticated in that it also sets a time out and if the page languishes for > too long with non-zero count, it''s able to forcibly copy-and-release the > page. I don''t think we need to go that far for blkback however.In the failure I''m seeing, the skb could sit on a socket queue indefinitely. The application reading the socket could be blocked for some other reason. blkback can''t defer responding to blkfront (completing the guest I/O). I think blkback needs to assume that a completion with a non-zero page reference count means it needs to make a copy, or implement a timeout like netback. Gary -- Gary Grebus Virtual Iron Software, Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/2/08 15:26, "Gary Grebus" <ggrebus@virtualiron.com> wrote:>> netback already does this kind of reference counting. It oughtn''t to be hard >> to check the page reference count in the blkback I/O completion handler and, >> if non-zero, set up a callback for when the count does fall to zero. And >> defer responding to the frontend until that time. Netback is even more >> sophisticated in that it also sets a time out and if the page languishes for >> too long with non-zero count, it''s able to forcibly copy-and-release the >> page. I don''t think we need to go that far for blkback however. > > In the failure I''m seeing, the skb could sit on a socket queue > indefinitely. The application reading the socket could be blocked for > some other reason. blkback can''t defer responding to blkfront > (completing the guest I/O). > > I think blkback needs to assume that a completion with a non-zero page > reference count means it needs to make a copy, or implement a timeout > like netback.Either way, most of the infrastructure you need should be there, and you can crib from netback to work out how to use it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel