Dante Cinco
2010-Nov-11 01:16 UTC
[Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
We have Fibre Channel HBA devices that we PCI passthrough to our pvops domU kernel. Without swiotlb=force in the domU''s kernel command line, both domU and dom0 lock up after loading the kernel module drivers for the HBA devices. With swiotlb=force, the domU and dom0 are stable after loading the kernel module drivers but the I/O performance is at least an order of magnitude worse than what we were seeing with the HVM kernel. I see the following in /var/log/kern.log in the pvops domU: PCI-DMA: Using software bounce buffering for IO (SWIOTLB) Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 software IO TLB at phys 0x5800000 - 0x9800000 Is swiotlb=force responsible for the I/O performance degradation? I don''t understand what swiotlb=force does so I would appreciate an explanation or a pointer. Thanks. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-11 16:04 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote:> We have Fibre Channel HBA devices that we PCI passthrough to our pvops > domU kernel. Without swiotlb=force in the domU''s kernel command line, > both domU and dom0 lock up after loading the kernel module drivers for > the HBA devices. With swiotlb=force, the domU and dom0 are stableWhoa. That is not good - what happens if you just pass in iommu=soft? Does the PCI-DMA: Using.. show up if you don''t pass in any of those parameters? (I don''t think it does, but just doing ''iommu=soft'' should enable it).> after loading the kernel module drivers but the I/O performance is at > least an order of magnitude worse than what we were seeing with the > HVM kernel. I see the following in /var/log/kern.log in the pvops > domU: > > PCI-DMA: Using software bounce buffering for IO (SWIOTLB) > Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 > software IO TLB at phys 0x5800000 - 0x9800000 > > Is swiotlb=force responsible for the I/O performance degradation? I > don''t understand what swiotlb=force does so I would appreciate an > explanation or a pointer.So, you should only need to use ''iommu=soft''. It will enable the Linux kernel IOMMU to translate the pseudo-PFNs to the real machine frame numbers (bus addresses). If your card is 64-bit, then that is all it would do. If however your card is 32-bit and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page to memory below 4GB, DMA that, and when done, copy it back to the where the user-space page is. This is called bounce-buffering and this is why you would use a mix of pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver. However, I think your cards are 64-bit, so you don''t need this bounce-buffering. But if you say ''swiotlb=force'' it will force _all_ DMAs to go through the bounce-buffer. So, try just ''iommu=soft'' and see what happens. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-11 18:31 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
Konrad, Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce buffering for IO" in /var/log/kern.log. With iommu=soft and without swiotlb=force, I see the "software bounce buffering" in /var/log/kern.log and an NMI (see below) when I load the kernel module drivers. I made sure the NMI is reproducible and not a one-time event. /var/log/kern.log (iommu=soft): PCI-DMA: Using software bounce buffering for IO (SWIOTLB) Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 software IO TLB at phys 0x5800000 - 0x9800000 (XEN) (XEN) (XEN) NMI - I/O ERROR (XEN) ----[ Xen-4.1-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 (XEN) RFLAGS: 0000000000000012 CONTEXT: hypervisor (XEN) rax: 0000000000000080 rbx: ffff82c480287c48 rcx: 0000000000000000 (XEN) rdx: 0000000000000080 rsi: 0000000000000080 rdi: ffff82c480287c48 (XEN) rbp: ffff82c480287c78 rsp: ffff82c480287c38 r8: 0000000000000000 (XEN) r9: 0000000000000037 r10: 0000ffff0000ffff r11: 00ff00ff00ff00ff (XEN) r12: ffff82c48029f080 r13: 0000000000000001 r14: 0000000000000008 (XEN) r15: ffff82c4802b0c20 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 00000001250a9000 cr2: 00007f6165ae9428 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c480287c38: (XEN) ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000 (XEN) ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300 (XEN) ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001 (XEN) 0000000000000100 0000000000000000 0000000000000002 ffff8300df606000 (XEN) 000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299 (XEN) 0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0 (XEN) ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000 (XEN) 000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea (XEN) ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066 (XEN) ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59 (XEN) ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000 (XEN) ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043 (XEN) 0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000 (XEN) 000000000000010c 0000000000000000 0000000000000000 0000000000000002 (XEN) ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18 (XEN) ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6 (XEN) 0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0 (XEN) ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff (XEN) 0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000 (XEN) 0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000 (XEN) Xen call trace: (XEN) [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 (XEN) [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302 (XEN) [<ffff82c48011f299>] vcpu_wake+0x243/0x43e (XEN) [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c (XEN) [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f (XEN) [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32 (XEN) [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190 (XEN) [<ffff82c480105f81>] send_guest_pirq+0x54/0x56 (XEN) [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c (XEN) [<ffff82c480154ee6>] common_interrupt+0x26/0x30 (XEN) [<ffff82c48014e3c3>] default_idle+0x82/0x87 (XEN) [<ffff82c480150664>] idle_loop+0x5a/0x68 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) FATAL TRAP: vector = 2 (nmi) (XEN) [error_code=0000] , IN INTERRUPT CONTEXT (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... Dante On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote: >> We have Fibre Channel HBA devices that we PCI passthrough to our pvops >> domU kernel. Without swiotlb=force in the domU''s kernel command line, >> both domU and dom0 lock up after loading the kernel module drivers for >> the HBA devices. With swiotlb=force, the domU and dom0 are stable > > Whoa. That is not good - what happens if you just pass in iommu=soft? > Does the PCI-DMA: Using.. show up if you don''t pass in any of those parameters? > (I don''t think it does, but just doing ''iommu=soft'' should enable it). > > >> after loading the kernel module drivers but the I/O performance is at >> least an order of magnitude worse than what we were seeing with the >> HVM kernel. I see the following in /var/log/kern.log in the pvops >> domU: >> >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 >> software IO TLB at phys 0x5800000 - 0x9800000 >> >> Is swiotlb=force responsible for the I/O performance degradation? I >> don''t understand what swiotlb=force does so I would appreciate an >> explanation or a pointer. > > So, you should only need to use ''iommu=soft''. It will enable the Linux kernel IOMMU > to translate the pseudo-PFNs to the real machine frame numbers (bus addresses). > > If your card is 64-bit, then that is all it would do. If however your card is 32-bit > and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page > to memory below 4GB, DMA that, and when done, copy it back to the where the user-space > page is. This is called bounce-buffering and this is why you would use a mix of > pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver. > > However, I think your cards are 64-bit, so you don''t need this bounce-buffering. But > if you say ''swiotlb=force'' it will force _all_ DMAs to go through the bounce-buffer. > > So, try just ''iommu=soft'' and see what happens. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-11 19:03 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:> Konrad, > > Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce > buffering for IO" in /var/log/kern.log. > > With iommu=soft and without swiotlb=force, I see the "software bounce > buffering" in /var/log/kern.log and an NMI (see below) when I load the > kernel module drivers. I made sure the NMI is reproducible and not aWhat is the kernel module doing to cause this? DMA?> one-time event.So doing 64-bit DMA causes an NMI. Do you have the Hypervisor''s IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work?> > /var/log/kern.log (iommu=soft): > PCI-DMA: Using software bounce buffering for IO (SWIOTLB) > Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 > software IO TLB at phys 0x5800000 - 0x9800000 > > (XEN) > (XEN) > (XEN) NMI - I/O ERROR > (XEN) ----[ Xen-4.1-unstable x86_64 debug=y Not tainted ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 > (XEN) RFLAGS: 0000000000000012 CONTEXT: hypervisor > (XEN) rax: 0000000000000080 rbx: ffff82c480287c48 rcx: 0000000000000000 > (XEN) rdx: 0000000000000080 rsi: 0000000000000080 rdi: ffff82c480287c48 > (XEN) rbp: ffff82c480287c78 rsp: ffff82c480287c38 r8: 0000000000000000 > (XEN) r9: 0000000000000037 r10: 0000ffff0000ffff r11: 00ff00ff00ff00ff > (XEN) r12: ffff82c48029f080 r13: 0000000000000001 r14: 0000000000000008 > (XEN) r15: ffff82c4802b0c20 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 00000001250a9000 cr2: 00007f6165ae9428 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c480287c38: > (XEN) ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000 > (XEN) ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300 > (XEN) ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001 > (XEN) 0000000000000100 0000000000000000 0000000000000002 ffff8300df606000 > (XEN) 000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299 > (XEN) 0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0 > (XEN) ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000 > (XEN) 000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea > (XEN) ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066 > (XEN) ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59 > (XEN) ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000 > (XEN) ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043 > (XEN) 0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000 > (XEN) 000000000000010c 0000000000000000 0000000000000000 0000000000000002 > (XEN) ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18 > (XEN) ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6 > (XEN) 0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0 > (XEN) ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff > (XEN) 0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000 > (XEN) 0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000 > (XEN) Xen call trace: > (XEN) [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 > (XEN) [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302 > (XEN) [<ffff82c48011f299>] vcpu_wake+0x243/0x43e > (XEN) [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c > (XEN) [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f > (XEN) [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32 > (XEN) [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190 > (XEN) [<ffff82c480105f81>] send_guest_pirq+0x54/0x56 > (XEN) [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c > (XEN) [<ffff82c480154ee6>] common_interrupt+0x26/0x30 > (XEN) [<ffff82c48014e3c3>] default_idle+0x82/0x87 > (XEN) [<ffff82c480150664>] idle_loop+0x5a/0x68 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) FATAL TRAP: vector = 2 (nmi) > (XEN) [error_code=0000] , IN INTERRUPT CONTEXT > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > > Dante > > > On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote: > >> We have Fibre Channel HBA devices that we PCI passthrough to our pvops > >> domU kernel. Without swiotlb=force in the domU''s kernel command line, > >> both domU and dom0 lock up after loading the kernel module drivers for > >> the HBA devices. With swiotlb=force, the domU and dom0 are stable > > > > Whoa. That is not good - what happens if you just pass in iommu=soft? > > Does the PCI-DMA: Using.. show up if you don''t pass in any of those parameters? > > (I don''t think it does, but just doing ''iommu=soft'' should enable it). > > > > > >> after loading the kernel module drivers but the I/O performance is at > >> least an order of magnitude worse than what we were seeing with the > >> HVM kernel. I see the following in /var/log/kern.log in the pvops > >> domU: > >> > >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) > >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 > >> software IO TLB at phys 0x5800000 - 0x9800000 > >> > >> Is swiotlb=force responsible for the I/O performance degradation? I > >> don''t understand what swiotlb=force does so I would appreciate an > >> explanation or a pointer. > > > > So, you should only need to use ''iommu=soft''. It will enable the Linux kernel IOMMU > > to translate the pseudo-PFNs to the real machine frame numbers (bus addresses). > > > > If your card is 64-bit, then that is all it would do. If however your card is 32-bit > > and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page > > to memory below 4GB, DMA that, and when done, copy it back to the where the user-space > > page is. This is called bounce-buffering and this is why you would use a mix of > > pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver. > > > > However, I think your cards are 64-bit, so you don''t need this bounce-buffering. But > > if you say ''swiotlb=force'' it will force _all_ DMAs to go through the bounce-buffer. > > > > So, try just ''iommu=soft'' and see what happens. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Lin, Ray
2010-Nov-11 19:42 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
Konrad, See my response in red. -Ray -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad Rzeszutek Wilk Sent: Thursday, November 11, 2010 11:04 AM To: Dante Cinco Cc: Xen-devel Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:> Konrad, > > Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce > buffering for IO" in /var/log/kern.log. > > With iommu=soft and without swiotlb=force, I see the "software bounce > buffering" in /var/log/kern.log and an NMI (see below) when I load the > kernel module drivers. I made sure the NMI is reproducible and not aWhat is the kernel module doing to cause this? DMA?> one-time event.So doing 64-bit DMA causes an NMI. Do you have the Hypervisor''s IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work? We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn''t work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times. 124: 86538 0 0 0 0 0 13462 0 0 0 0 0 0 0 xen-pirq-pcifront-msi HW_TACHYON 125: 88348 0 0 0 11652 0 0 0 0 0 0 0 0 0 xen-pirq-pcifront-msi HW_TACHYON 126: 89335 0 10665 0 0 0 0 0 0 0 0 0 0 0 xen-pirq-pcifront-msi HW_TACHYON 127: 100000 0 0 0 0 0 0 0 0 0 0 0 0 0 xen-pirq-pcifront-msi HW_TACHYON> > /var/log/kern.log (iommu=soft): > PCI-DMA: Using software bounce buffering for IO (SWIOTLB) Placing 64MB > software IO TLB between ffff880005800000 - ffff880009800000 software > IO TLB at phys 0x5800000 - 0x9800000 > > (XEN) > (XEN) > (XEN) NMI - I/O ERROR > (XEN) ----[ Xen-4.1-unstable x86_64 debug=y Not tainted ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 > (XEN) RFLAGS: 0000000000000012 CONTEXT: hypervisor > (XEN) rax: 0000000000000080 rbx: ffff82c480287c48 rcx: 0000000000000000 > (XEN) rdx: 0000000000000080 rsi: 0000000000000080 rdi: ffff82c480287c48 > (XEN) rbp: ffff82c480287c78 rsp: ffff82c480287c38 r8: 0000000000000000 > (XEN) r9: 0000000000000037 r10: 0000ffff0000ffff r11: 00ff00ff00ff00ff > (XEN) r12: ffff82c48029f080 r13: 0000000000000001 r14: 0000000000000008 > (XEN) r15: ffff82c4802b0c20 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 00000001250a9000 cr2: 00007f6165ae9428 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c480287c38: > (XEN) ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000 > (XEN) ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300 > (XEN) ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001 > (XEN) 0000000000000100 0000000000000000 0000000000000002 ffff8300df606000 > (XEN) 000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299 > (XEN) 0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0 > (XEN) ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000 > (XEN) 000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea > (XEN) ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066 > (XEN) ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59 > (XEN) ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000 > (XEN) ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043 > (XEN) 0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000 > (XEN) 000000000000010c 0000000000000000 0000000000000000 0000000000000002 > (XEN) ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18 > (XEN) ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6 > (XEN) 0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0 > (XEN) ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff > (XEN) 0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000 > (XEN) 0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000 > (XEN) Xen call trace: > (XEN) [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 > (XEN) [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302 > (XEN) [<ffff82c48011f299>] vcpu_wake+0x243/0x43e > (XEN) [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c > (XEN) [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f > (XEN) [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32 > (XEN) [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190 > (XEN) [<ffff82c480105f81>] send_guest_pirq+0x54/0x56 > (XEN) [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c > (XEN) [<ffff82c480154ee6>] common_interrupt+0x26/0x30 > (XEN) [<ffff82c48014e3c3>] default_idle+0x82/0x87 > (XEN) [<ffff82c480150664>] idle_loop+0x5a/0x68 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) FATAL TRAP: vector = 2 (nmi) > (XEN) [error_code=0000] , IN INTERRUPT CONTEXT > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > > Dante > > > On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote: > >> We have Fibre Channel HBA devices that we PCI passthrough to our > >> pvops domU kernel. Without swiotlb=force in the domU''s kernel > >> command line, both domU and dom0 lock up after loading the kernel > >> module drivers for the HBA devices. With swiotlb=force, the domU > >> and dom0 are stable > > > > Whoa. That is not good - what happens if you just pass in iommu=soft? > > Does the PCI-DMA: Using.. show up if you don''t pass in any of those parameters? > > (I don''t think it does, but just doing ''iommu=soft'' should enable it). > > > > > >> after loading the kernel module drivers but the I/O performance is > >> at least an order of magnitude worse than what we were seeing with > >> the HVM kernel. I see the following in /var/log/kern.log in the > >> pvops > >> domU: > >> > >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) Placing > >> 64MB software IO TLB between ffff880005800000 - ffff880009800000 > >> software IO TLB at phys 0x5800000 - 0x9800000 > >> > >> Is swiotlb=force responsible for the I/O performance degradation? I > >> don''t understand what swiotlb=force does so I would appreciate an > >> explanation or a pointer. > > > > So, you should only need to use ''iommu=soft''. It will enable the > > Linux kernel IOMMU to translate the pseudo-PFNs to the real machine frame numbers (bus addresses). > > > > If your card is 64-bit, then that is all it would do. If however > > your card is 32-bit and your are DMA-ing data from above the 32-bit > > limit, it would copy the user-space page to memory below 4GB, DMA > > that, and when done, copy it back to the where the user-space page > > is. This is called bounce-buffering and this is why you would use a mix of pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver. > > > > However, I think your cards are 64-bit, so you don''t need this > > bounce-buffering. But if you say ''swiotlb=force'' it will force _all_ DMAs to go through the bounce-buffer. > > > > So, try just ''iommu=soft'' and see what happens. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-11 22:32 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
With iommu=off,verbose in the Xen commandline, pvops domU works only with swiotlb=force and with the same performance degradation. Without swiotlb=force, there''s no NMI but DMA does not work (see Ray Lin''s reply on Thu 11/11/2010 11:42 AM). The XenPCIpassthrough wiki (http://wiki.xensource.com/xenwiki/XenPCIpassthrough) talks about setting iommu=pv in order to use the hardware IOMMU (VT-d) passthru for PV guests but I didn''t see any difference compared to my original setting (iommu=1,passthrough,no-intremap). Is iommu=pv still required for this particular pvops domU kernel (xen-pcifront-0.8.2) and if it is, what should I be looking for in the Xen log (xm dmesg) to verify its efficacy? With my original setting (iommu=1,passthrough,no-intremap), here''s what I see: (XEN) [VT-D]dmar.c:702: Host address width 39 (XEN) [VT-D]dmar.c:717: found ACPI_DMAR_DRHD: (XEN) [VT-D]dmar.c:413: dmaru->address = e7ffe000 (XEN) [VT-D]iommu.c:1136: drhd->address = e7ffe000 iommu->reg = ffff82c3fff57000 (XEN) [VT-D]iommu.c:1138: cap = c90780106f0462 ecap = f0207e (XEN) [VT-D]dmar.c:356: IOAPIC: 0:1e.1 (XEN) [VT-D]dmar.c:356: IOAPIC: 0:13.0 (XEN) [VT-D]dmar.c:427: flags: INCLUDE_ALL (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR: (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.7 (XEN) [VT-D]dmar.c:594: RMRR region: base_addr df7fc000 end_address df7fdfff (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR: (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.0 (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.1 (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.2 (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.3 (XEN) [VT-D]dmar.c:341: endpoint: 2:0.0 (XEN) [VT-D]dmar.c:341: endpoint: 2:0.2 (XEN) [VT-D]dmar.c:341: endpoint: 2:0.4 (XEN) [VT-D]dmar.c:594: RMRR region: base_addr df7f5000 end_address df7fafff (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR: (XEN) [VT-D]dmar.c:341: endpoint: 5:0.0 (XEN) [VT-D]dmar.c:341: endpoint: 2:0.0 (XEN) [VT-D]dmar.c:341: endpoint: 2:0.2 (XEN) [VT-D]dmar.c:594: RMRR region: base_addr df63e000 end_address df63ffff (XEN) [VT-D]dmar.c:727: found ACPI_DMAR_ATSR: (XEN) [VT-D]dmar.c:622: atsru->all_ports: 0 (XEN) [VT-D]dmar.c:327: bridge: 0:a.0 start = 0 sec = 7 sub = 7 (XEN) [VT-D]dmar.c:327: bridge: 0:9.0 start = 0 sec = 8 sub = a (XEN) [VT-D]dmar.c:327: bridge: 0:8.0 start = 0 sec = b sub = d (XEN) [VT-D]dmar.c:327: bridge: 0:7.0 start = 0 sec = e sub = 10 (XEN) [VT-D]dmar.c:327: bridge: 0:6.0 start = 0 sec = 18 sub = 1a (XEN) [VT-D]dmar.c:327: bridge: 0:5.0 start = 0 sec = 15 sub = 17 (XEN) [VT-D]dmar.c:327: bridge: 0:4.0 start = 0 sec = 14 sub = 14 (XEN) [VT-D]dmar.c:327: bridge: 0:3.0 start = 0 sec = 11 sub = 13 (XEN) [VT-D]dmar.c:327: bridge: 0:2.0 start = 0 sec = 6 sub = 6 (XEN) [VT-D]dmar.c:327: bridge: 0:1.0 start = 0 sec = 5 sub = 5 (XEN) Intel VT-d Snoop Control not enabled. (XEN) Intel VT-d Dom0 DMA Passthrough not enabled. (XEN) Intel VT-d Queued Invalidation enabled. (XEN) Intel VT-d Interrupt Remapping not enabled. (XEN) I/O virtualisation enabled (XEN) - Dom0 mode: Relaxed (XEN) Enabled directed EOI with ioapic_ack_old on! (XEN) [VT-D]iommu.c:743: iommu_enable_translation: iommu->reg = ffff82c3fff57000 domU bringup: (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.3 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.3 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.2 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.2 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.1 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.1 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.0 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.0 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.3 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.3 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.2 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.2 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.1 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.1 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.0 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.0 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.0 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.0 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.1 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.1 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.0 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.0 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.1 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.1 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.0 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.0 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.1 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.1 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.0 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.0 (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.1 (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.1 mapping kernel into physical memory about to get started... - Dante On Thu, Nov 11, 2010 at 11:03 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote: >> Konrad, >> >> Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce >> buffering for IO" in /var/log/kern.log. >> >> With iommu=soft and without swiotlb=force, I see the "software bounce >> buffering" in /var/log/kern.log and an NMI (see below) when I load the >> kernel module drivers. I made sure the NMI is reproducible and not a > > What is the kernel module doing to cause this? DMA? >> one-time event. > > So doing 64-bit DMA causes an NMI. Do you have the Hypervisor''s IOMMU VT-d > enabled or disabled? (iommu=off,verbose) If you turn it off does this work? >> >> /var/log/kern.log (iommu=soft): >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 >> software IO TLB at phys 0x5800000 - 0x9800000 >> >> (XEN) >> (XEN) >> (XEN) NMI - I/O ERROR >> (XEN) ----[ Xen-4.1-unstable x86_64 debug=y Not tainted ]---- >> (XEN) CPU: 0 >> (XEN) RIP: e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 >> (XEN) RFLAGS: 0000000000000012 CONTEXT: hypervisor >> (XEN) rax: 0000000000000080 rbx: ffff82c480287c48 rcx: 0000000000000000 >> (XEN) rdx: 0000000000000080 rsi: 0000000000000080 rdi: ffff82c480287c48 >> (XEN) rbp: ffff82c480287c78 rsp: ffff82c480287c38 r8: 0000000000000000 >> (XEN) r9: 0000000000000037 r10: 0000ffff0000ffff r11: 00ff00ff00ff00ff >> (XEN) r12: ffff82c48029f080 r13: 0000000000000001 r14: 0000000000000008 >> (XEN) r15: ffff82c4802b0c20 cr0: 000000008005003b cr4: 00000000000026f0 >> (XEN) cr3: 00000001250a9000 cr2: 00007f6165ae9428 >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 >> (XEN) Xen stack trace from rsp=ffff82c480287c38: >> (XEN) ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000 >> (XEN) ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300 >> (XEN) ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001 >> (XEN) 0000000000000100 0000000000000000 0000000000000002 ffff8300df606000 >> (XEN) 000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299 >> (XEN) 0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0 >> (XEN) ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000 >> (XEN) 000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea >> (XEN) ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066 >> (XEN) ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59 >> (XEN) ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000 >> (XEN) ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043 >> (XEN) 0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000 >> (XEN) 000000000000010c 0000000000000000 0000000000000000 0000000000000002 >> (XEN) ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18 >> (XEN) ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6 >> (XEN) 0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0 >> (XEN) ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff >> (XEN) 0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000 >> (XEN) 0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000 >> (XEN) Xen call trace: >> (XEN) [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 >> (XEN) [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302 >> (XEN) [<ffff82c48011f299>] vcpu_wake+0x243/0x43e >> (XEN) [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c >> (XEN) [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f >> (XEN) [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32 >> (XEN) [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190 >> (XEN) [<ffff82c480105f81>] send_guest_pirq+0x54/0x56 >> (XEN) [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c >> (XEN) [<ffff82c480154ee6>] common_interrupt+0x26/0x30 >> (XEN) [<ffff82c48014e3c3>] default_idle+0x82/0x87 >> (XEN) [<ffff82c480150664>] idle_loop+0x5a/0x68 >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 0: >> (XEN) FATAL TRAP: vector = 2 (nmi) >> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT >> (XEN) **************************************** >> (XEN) >> (XEN) Reboot in five seconds... >> >> Dante >> >> >> On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >> > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote: >> >> We have Fibre Channel HBA devices that we PCI passthrough to our pvops >> >> domU kernel. Without swiotlb=force in the domU''s kernel command line, >> >> both domU and dom0 lock up after loading the kernel module drivers for >> >> the HBA devices. With swiotlb=force, the domU and dom0 are stable >> > >> > Whoa. That is not good - what happens if you just pass in iommu=soft? >> > Does the PCI-DMA: Using.. show up if you don''t pass in any of those parameters? >> > (I don''t think it does, but just doing ''iommu=soft'' should enable it). >> > >> > >> >> after loading the kernel module drivers but the I/O performance is at >> >> least an order of magnitude worse than what we were seeing with the >> >> HVM kernel. I see the following in /var/log/kern.log in the pvops >> >> domU: >> >> >> >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) >> >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 >> >> software IO TLB at phys 0x5800000 - 0x9800000 >> >> >> >> Is swiotlb=force responsible for the I/O performance degradation? I >> >> don''t understand what swiotlb=force does so I would appreciate an >> >> explanation or a pointer. >> > >> > So, you should only need to use ''iommu=soft''. It will enable the Linux kernel IOMMU >> > to translate the pseudo-PFNs to the real machine frame numbers (bus addresses). >> > >> > If your card is 64-bit, then that is all it would do. If however your card is 32-bit >> > and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page >> > to memory below 4GB, DMA that, and when done, copy it back to the where the user-space >> > page is. This is called bounce-buffering and this is why you would use a mix of >> > pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver. >> > >> > However, I think your cards are 64-bit, so you don''t need this bounce-buffering. But >> > if you say ''swiotlb=force'' it will force _all_ DMAs to go through the bounce-buffer. >> > >> > So, try just ''iommu=soft'' and see what happens. >> > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-12 01:02 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
Here''s another datapoint: with iommu=1,passthrough,no-intremap,verbose in the Xen command line and iommu=soft in the pvops domU command line also results in an NMI (see below). Replacing iommu=soft with swiotlb=force in pvops domU works reliably but with the I/O performance degradation. It seems that regardless of whether iommu is enabled or disabled in the hypervisor, swiotlb=force is necessary in the pvops domU. (XEN) (XEN) NMI - I/O ERROR (XEN) ----[ Xen-4.1-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48015c006>] do_IRQ+0x375/0x59c (XEN) RFLAGS: 0000000000000002 CONTEXT: hypervisor (XEN) rax: ffff83011dae4460 rbx: ffff8301616a6990 rcx: 000000000000010c (XEN) rdx: 000000000000010c rsi: 0000000000000086 rdi: 0000000000000001 (XEN) rbp: ffff82c480287e28 rsp: ffff82c480287db8 r8: 000000000000007a (XEN) r9: ffff8300df4d4060 r10: ffff83019fffac88 r11: 000001958595f304 (XEN) r12: ffff83011dae2000 r13: 0000000000000000 r14: 000000000000007f (XEN) r15: ffff83019fe02200 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 00000001261ff000 cr2: 0000000000783000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c480287db8: (XEN) 0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000 (XEN) 000000000000010c ffff830000000000 ffff82c4802c2400 0000000000000002 (XEN) ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18 (XEN) ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6 (XEN) 0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0 (XEN) ffff82c480287ee0 ffff82c480287f18 000001958595f304 ffff83019fffac88 (XEN) ffff8300df4d4060 ffff83019fffa9f0 ffff82c4802c23a0 0000000000000000 (XEN) 0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000 (XEN) ffff82c48014e3c3 000000000000e008 0000000000000246 ffff82c480287ee0 (XEN) 000000000000e010 ffff82c480287f10 ffff82c480150664 0000000000000000 (XEN) ffff8300df2fc000 ffff8300df4d4000 00000000ffffffff ffff82c480287db8 (XEN) 0000000000000000 ffffffffffffffff ffffffff81787160 ffffffff81669fd8 (XEN) ffffffff81669ed0 ffffffff81668000 0000000000000246 ffff8800067c0200 (XEN) 0000019575abe291 0000000000000000 0000000000000000 ffffffff810093aa (XEN) 0000000400000000 00000000deadbeef 00000000deadbeef 0000010000000000 (XEN) ffffffff810093aa 000000000000e033 0000000000000246 ffffffff81669eb8 (XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 ffff8300df2fc000 0000000000000000 (XEN) 0000000000000000 (XEN) Xen call trace: (XEN) [<ffff82c48015c006>] do_IRQ+0x375/0x59c (XEN) [<ffff82c480154ee6>] common_interrupt+0x26/0x30 (XEN) [<ffff82c48014e3c3>] default_idle+0x82/0x87 (XEN) [<ffff82c480150664>] idle_loop+0x5a/0x68 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) FATAL TRAP: vector = 2 (nmi) (XEN) [error_code=0000] , IN INTERRUPT CONTEXT (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... - Dante On Thu, Nov 11, 2010 at 2:32 PM, Dante Cinco <dantecinco@gmail.com> wrote:> With iommu=off,verbose in the Xen commandline, pvops domU works only > with swiotlb=force and with the same performance degradation. Without > swiotlb=force, there''s no NMI but DMA does not work (see Ray Lin''s > reply on Thu 11/11/2010 11:42 AM). > > The XenPCIpassthrough wiki > (http://wiki.xensource.com/xenwiki/XenPCIpassthrough) talks about > setting iommu=pv in order to use the hardware IOMMU (VT-d) passthru > for PV guests but I didn''t see any difference compared to my original > setting (iommu=1,passthrough,no-intremap). Is iommu=pv still required > for this particular pvops domU kernel (xen-pcifront-0.8.2) and if it > is, what should I be looking for in the Xen log (xm dmesg) to verify > its efficacy? > > With my original setting (iommu=1,passthrough,no-intremap), here''s what I see: > > (XEN) [VT-D]dmar.c:702: Host address width 39 > (XEN) [VT-D]dmar.c:717: found ACPI_DMAR_DRHD: > (XEN) [VT-D]dmar.c:413: dmaru->address = e7ffe000 > (XEN) [VT-D]iommu.c:1136: drhd->address = e7ffe000 iommu->reg = ffff82c3fff57000 > (XEN) [VT-D]iommu.c:1138: cap = c90780106f0462 ecap = f0207e > (XEN) [VT-D]dmar.c:356: IOAPIC: 0:1e.1 > (XEN) [VT-D]dmar.c:356: IOAPIC: 0:13.0 > (XEN) [VT-D]dmar.c:427: flags: INCLUDE_ALL > (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR: > (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.7 > (XEN) [VT-D]dmar.c:594: RMRR region: base_addr df7fc000 end_address df7fdfff > (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR: > (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.0 > (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.1 > (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.2 > (XEN) [VT-D]dmar.c:341: endpoint: 0:1d.3 > (XEN) [VT-D]dmar.c:341: endpoint: 2:0.0 > (XEN) [VT-D]dmar.c:341: endpoint: 2:0.2 > (XEN) [VT-D]dmar.c:341: endpoint: 2:0.4 > (XEN) [VT-D]dmar.c:594: RMRR region: base_addr df7f5000 end_address df7fafff > (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR: > (XEN) [VT-D]dmar.c:341: endpoint: 5:0.0 > (XEN) [VT-D]dmar.c:341: endpoint: 2:0.0 > (XEN) [VT-D]dmar.c:341: endpoint: 2:0.2 > (XEN) [VT-D]dmar.c:594: RMRR region: base_addr df63e000 end_address df63ffff > (XEN) [VT-D]dmar.c:727: found ACPI_DMAR_ATSR: > (XEN) [VT-D]dmar.c:622: atsru->all_ports: 0 > (XEN) [VT-D]dmar.c:327: bridge: 0:a.0 start = 0 sec = 7 sub = 7 > (XEN) [VT-D]dmar.c:327: bridge: 0:9.0 start = 0 sec = 8 sub = a > (XEN) [VT-D]dmar.c:327: bridge: 0:8.0 start = 0 sec = b sub = d > (XEN) [VT-D]dmar.c:327: bridge: 0:7.0 start = 0 sec = e sub = 10 > (XEN) [VT-D]dmar.c:327: bridge: 0:6.0 start = 0 sec = 18 sub = 1a > (XEN) [VT-D]dmar.c:327: bridge: 0:5.0 start = 0 sec = 15 sub = 17 > (XEN) [VT-D]dmar.c:327: bridge: 0:4.0 start = 0 sec = 14 sub = 14 > (XEN) [VT-D]dmar.c:327: bridge: 0:3.0 start = 0 sec = 11 sub = 13 > (XEN) [VT-D]dmar.c:327: bridge: 0:2.0 start = 0 sec = 6 sub = 6 > (XEN) [VT-D]dmar.c:327: bridge: 0:1.0 start = 0 sec = 5 sub = 5 > (XEN) Intel VT-d Snoop Control not enabled. > (XEN) Intel VT-d Dom0 DMA Passthrough not enabled. > (XEN) Intel VT-d Queued Invalidation enabled. > (XEN) Intel VT-d Interrupt Remapping not enabled. > (XEN) I/O virtualisation enabled > (XEN) - Dom0 mode: Relaxed > (XEN) Enabled directed EOI with ioapic_ack_old on! > (XEN) [VT-D]iommu.c:743: iommu_enable_translation: iommu->reg = ffff82c3fff57000 > > domU bringup: > > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.3 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.3 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.2 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.2 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.1 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.1 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.0 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.0 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.3 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.3 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.2 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.2 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.1 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.1 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.0 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.0 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.0 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.0 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.1 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.1 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.0 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.0 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.1 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.1 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.0 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.0 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.1 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.1 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.0 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.0 > (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.1 > (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.1 > mapping kernel into physical memory > about to get started... > > - Dante > > On Thu, Nov 11, 2010 at 11:03 AM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: >> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote: >>> Konrad, >>> >>> Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce >>> buffering for IO" in /var/log/kern.log. >>> >>> With iommu=soft and without swiotlb=force, I see the "software bounce >>> buffering" in /var/log/kern.log and an NMI (see below) when I load the >>> kernel module drivers. I made sure the NMI is reproducible and not a >> >> What is the kernel module doing to cause this? DMA? >>> one-time event. >> >> So doing 64-bit DMA causes an NMI. Do you have the Hypervisor''s IOMMU VT-d >> enabled or disabled? (iommu=off,verbose) If you turn it off does this work? >>> >>> /var/log/kern.log (iommu=soft): >>> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) >>> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 >>> software IO TLB at phys 0x5800000 - 0x9800000 >>> >>> (XEN) >>> (XEN) >>> (XEN) NMI - I/O ERROR >>> (XEN) ----[ Xen-4.1-unstable x86_64 debug=y Not tainted ]---- >>> (XEN) CPU: 0 >>> (XEN) RIP: e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 >>> (XEN) RFLAGS: 0000000000000012 CONTEXT: hypervisor >>> (XEN) rax: 0000000000000080 rbx: ffff82c480287c48 rcx: 0000000000000000 >>> (XEN) rdx: 0000000000000080 rsi: 0000000000000080 rdi: ffff82c480287c48 >>> (XEN) rbp: ffff82c480287c78 rsp: ffff82c480287c38 r8: 0000000000000000 >>> (XEN) r9: 0000000000000037 r10: 0000ffff0000ffff r11: 00ff00ff00ff00ff >>> (XEN) r12: ffff82c48029f080 r13: 0000000000000001 r14: 0000000000000008 >>> (XEN) r15: ffff82c4802b0c20 cr0: 000000008005003b cr4: 00000000000026f0 >>> (XEN) cr3: 00000001250a9000 cr2: 00007f6165ae9428 >>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 >>> (XEN) Xen stack trace from rsp=ffff82c480287c38: >>> (XEN) ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000 >>> (XEN) ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300 >>> (XEN) ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001 >>> (XEN) 0000000000000100 0000000000000000 0000000000000002 ffff8300df606000 >>> (XEN) 000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299 >>> (XEN) 0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0 >>> (XEN) ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000 >>> (XEN) 000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea >>> (XEN) ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066 >>> (XEN) ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59 >>> (XEN) ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000 >>> (XEN) ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043 >>> (XEN) 0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000 >>> (XEN) 000000000000010c 0000000000000000 0000000000000000 0000000000000002 >>> (XEN) ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18 >>> (XEN) ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6 >>> (XEN) 0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0 >>> (XEN) ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff >>> (XEN) 0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000 >>> (XEN) 0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000 >>> (XEN) Xen call trace: >>> (XEN) [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10 >>> (XEN) [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302 >>> (XEN) [<ffff82c48011f299>] vcpu_wake+0x243/0x43e >>> (XEN) [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c >>> (XEN) [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f >>> (XEN) [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32 >>> (XEN) [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190 >>> (XEN) [<ffff82c480105f81>] send_guest_pirq+0x54/0x56 >>> (XEN) [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c >>> (XEN) [<ffff82c480154ee6>] common_interrupt+0x26/0x30 >>> (XEN) [<ffff82c48014e3c3>] default_idle+0x82/0x87 >>> (XEN) [<ffff82c480150664>] idle_loop+0x5a/0x68 >>> (XEN) >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) FATAL TRAP: vector = 2 (nmi) >>> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT >>> (XEN) **************************************** >>> (XEN) >>> (XEN) Reboot in five seconds... >>> >>> Dante >>> >>> >>> On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk >>> <konrad.wilk@oracle.com> wrote: >>> > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote: >>> >> We have Fibre Channel HBA devices that we PCI passthrough to our pvops >>> >> domU kernel. Without swiotlb=force in the domU''s kernel command line, >>> >> both domU and dom0 lock up after loading the kernel module drivers for >>> >> the HBA devices. With swiotlb=force, the domU and dom0 are stable >>> > >>> > Whoa. That is not good - what happens if you just pass in iommu=soft? >>> > Does the PCI-DMA: Using.. show up if you don''t pass in any of those parameters? >>> > (I don''t think it does, but just doing ''iommu=soft'' should enable it). >>> > >>> > >>> >> after loading the kernel module drivers but the I/O performance is at >>> >> least an order of magnitude worse than what we were seeing with the >>> >> HVM kernel. I see the following in /var/log/kern.log in the pvops >>> >> domU: >>> >> >>> >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) >>> >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000 >>> >> software IO TLB at phys 0x5800000 - 0x9800000 >>> >> >>> >> Is swiotlb=force responsible for the I/O performance degradation? I >>> >> don''t understand what swiotlb=force does so I would appreciate an >>> >> explanation or a pointer. >>> > >>> > So, you should only need to use ''iommu=soft''. It will enable the Linux kernel IOMMU >>> > to translate the pseudo-PFNs to the real machine frame numbers (bus addresses). >>> > >>> > If your card is 64-bit, then that is all it would do. If however your card is 32-bit >>> > and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page >>> > to memory below 4GB, DMA that, and when done, copy it back to the where the user-space >>> > page is. This is called bounce-buffering and this is why you would use a mix of >>> > pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver. >>> > >>> > However, I think your cards are 64-bit, so you don''t need this bounce-buffering. But >>> > if you say ''swiotlb=force'' it will force _all_ DMAs to go through the bounce-buffer. >>> > >>> > So, try just ''iommu=soft'' and see what happens. >>> > >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-12 15:56 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Thu, Nov 11, 2010 at 12:42:03PM -0700, Lin, Ray wrote:> > Konrad, > > See my response in red.Please don''t top post.> > > -Ray > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad Rzeszutek Wilk > Sent: Thursday, November 11, 2010 11:04 AM > To: Dante Cinco > Cc: Xen-devel > Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough > > On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote: > > Konrad, > > > > Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce > > buffering for IO" in /var/log/kern.log. > > > > With iommu=soft and without swiotlb=force, I see the "software bounce > > buffering" in /var/log/kern.log and an NMI (see below) when I load the > > kernel module drivers. I made sure the NMI is reproducible and not a > > What is the kernel module doing to cause this? DMA???? What did it do?> > one-time event. > > So doing 64-bit DMA causes an NMI. Do you have the Hypervisor''s IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work? > > We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn''t work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times.That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU. Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Lin, Ray
2010-Nov-12 16:20 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
-----Original Message----- From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] Sent: Friday, November 12, 2010 7:57 AM To: Lin, Ray Cc: Dante Cinco; Xen-devel Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough On Thu, Nov 11, 2010 at 12:42:03PM -0700, Lin, Ray wrote:> > Konrad, > > See my response in red.Please don''t top post.> > > -Ray > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad > Rzeszutek Wilk > Sent: Thursday, November 11, 2010 11:04 AM > To: Dante Cinco > Cc: Xen-devel > Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 > pvops domU kernel with PCI passthrough > > On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote: > > Konrad, > > > > Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce > > buffering for IO" in /var/log/kern.log. > > > > With iommu=soft and without swiotlb=force, I see the "software > > bounce buffering" in /var/log/kern.log and an NMI (see below) when I > > load the kernel module drivers. I made sure the NMI is reproducible > > and not a > > What is the kernel module doing to cause this? DMA???? What did it do?> > one-time event. > > So doing 64-bit DMA causes an NMI. Do you have the Hypervisor''s IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work? > > We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn''t work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times.That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU. Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?) Konrad, It''s unlikely the interrupt issue but DMA issue. Here is the sequence how the tachyon device generates the DMA/interrupts, - the tachyon device does the DMA to update the memory which indicates the source of interrupt. - After the DMA is done, the tachyon device trigger an interrupt. - The interrupt service routine of software driver is invoked due to the interrupt - The interrupt service routine checks the source of interrupts by examining the memory which is supposed to be updated by previous DMA. - Even though the interrupt happens, the driver code can''t find the source of interrupt since the DMA doesn''t work properly. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-12 16:55 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
> That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU. > Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?) > > Konrad, > It''s unlikely the interrupt issue but DMA issue. Here is the sequence how the tachyon device generates the DMA/interrupts, > - the tachyon device does the DMA to update the memory which indicates the source of interrupt. > - After the DMA is done, the tachyon device trigger an interrupt. > - The interrupt service routine of software driver is invoked due to the interrupt > - The interrupt service routine checks the source of interrupts by examining the memory which is supposed to be updated by previous DMA. > - Even though the interrupt happens, the driver code can''t find the source of interrupt since the DMA doesn''t work properly.That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs. One way you can figure this is doing something like this to make sure you got the right MFN: add these two: #include <xen/page.h> #include <asm/xen/page.h> phys_addr_t phys = page_to_phys(mem->pages[i]); + if (xen_pv_domain()) { + phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn( + page_to_pfn(mem->pages[i]))); + if (phys != xen_phys) { + printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \ + " CODE UNTESTED!\n", + (unsigned long)phys, + (unsigned long)xen_phys); + WARN_ON_ONCE(phys != xen_phys); + phys = xen_phys; + } + } and using the ''phys'' value from now. If this sounds like black magic, here is a short writeup http://wiki.xensource.com/xenwiki/XenPVOPSDRM look at "Why those patches" section. Lastly, are you using unsigned long for or the phys_addr_t typedefs? The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use ''swiotlb=force''. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down. Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I''ve seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-12 16:58 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Thu, Nov 11, 2010 at 05:02:55PM -0800, Dante Cinco wrote:> Here''s another datapoint: with iommu=1,passthrough,no-intremap,verbose > in the Xen command line and iommu=soft in the pvops domU command line > also results in an NMI (see below). Replacing iommu=soft withOk, so that enabled the VT-D and enables you to do 64-bit DMA.> swiotlb=force in pvops domU works reliably but with the I/O > performance degradation. It seems that regardless of whether iommu is > enabled or disabled in the hypervisor, swiotlb=force is necessary in > the pvops domU.That is bizzare. I am pretty sure it should work just fine with ''iommu=soft''. My test scripts confirm this, but let me run once more just to make sure. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-12 18:29 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Fri, Nov 12, 2010 at 7:56 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Thu, Nov 11, 2010 at 12:42:03PM -0700, Lin, Ray wrote: >> >> Konrad, >> >> See my response in red. > > Please don''t top post. >> >> >> -Ray >> >> -----Original Message----- >> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad Rzeszutek Wilk >> Sent: Thursday, November 11, 2010 11:04 AM >> To: Dante Cinco >> Cc: Xen-devel >> Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough >> >> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote: >> > Konrad, >> > >> > Without swiotlb=force, I don''t see "PCI-DMA: Using software bounce >> > buffering for IO" in /var/log/kern.log. >> > >> > With iommu=soft and without swiotlb=force, I see the "software bounce >> > buffering" in /var/log/kern.log and an NMI (see below) when I load the >> > kernel module drivers. I made sure the NMI is reproducible and not a >> >> What is the kernel module doing to cause this? DMA? > > ??? What did it do? > >> > one-time event. >> >> So doing 64-bit DMA causes an NMI. Do you have the Hypervisor''s IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work? >> >> We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn''t work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times. > > That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it > is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU. > Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?) > >I set the hypervisor boot options to iommu=0 x2apic=0. I booted the pvops domU with swiotlb=force initially since that''s the option that worked in the past. Not long after loading the kernel module drivers, domU hung/froze but dom0 stayed up. I checked the Xen interrupt bindings and I see the PCI-passthrough devices have either (PS--) or (-S--). (XEN) IRQ: 66 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:72 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:127(-S--), (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000200 vec:3b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:126(-S--), (XEN) IRQ: 68 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:8a type=PCI-MSI status=00000010 in-flight=0 domain-list=1:125(-S--), (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000800 vec:43 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:124(PS--), (XEN) IRQ: 70 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:9a type=PCI-MSI status=00000010 in-flight=0 domain-list=1:123(-S--), (XEN) IRQ: 71 affinity:fffffff,ffffffff,ffffffff,ffffffff vec:a2 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:122(PS--), (XEN) IRQ: 72 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:aa type=PCI-MSI status=00000010 in-flight=0 domain-list=1:121(-S--), (XEN) IRQ: 73 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:b2 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:120(PS--), (XEN) IRQ: 74 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:ba type=PCI-MSI status=00000010 in-flight=0 domain-list=1:119(-S--), (XEN) IRQ: 75 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:c2 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:118(PS--), (XEN) IRQ: 76 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:ca type=PCI-MSI status=00000010 in-flight=0 domain-list=1:117(-S--), (XEN) IRQ: 77 affinity:00000000,00000000,00000000,00080000 vec:4b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:116(PS--), (XEN) IRQ: 78 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:da type=PCI-MSI status=00000010 in-flight=0 domain-list=1:115(-S--), (XEN) IRQ: 79 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:23 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:114(PS--), (XEN) IRQ: 80 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:2b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:113(-S--), (XEN) IRQ: 81 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:33 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:112(PS--), I will reboot the pvops domU with iommu=soft and without swiotlb=force next. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Lin, Ray
2010-Nov-12 19:38 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
-----Original Message----- From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] Sent: Friday, November 12, 2010 8:56 AM To: Lin, Ray Cc: Dante Cinco; Xen-devel Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough> That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU. > Had you tried disabling IOMMU and x2apic? (this is all on the > hypervisor line?) > > Konrad, > It''s unlikely the interrupt issue but DMA issue. Here is the sequence > how the tachyon device generates the DMA/interrupts, > - the tachyon device does the DMA to update the memory which indicates the source of interrupt. > - After the DMA is done, the tachyon device trigger an interrupt. > - The interrupt service routine of software driver is invoked due to > the interrupt > - The interrupt service routine checks the source of interrupts by examining the memory which is supposed to be updated by previous DMA. > - Even though the interrupt happens, the driver code can''t find the source of interrupt since the DMA doesn''t work properly.That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs. Our driver uses pci_map_single to get the physical addr to program the chip. One way you can figure this is doing something like this to make sure you got the right MFN: add these two: #include <xen/page.h> #include <asm/xen/page.h> phys_addr_t phys = page_to_phys(mem->pages[i]); + if (xen_pv_domain()) { + phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn( + page_to_pfn(mem->pages[i]))); + if (phys != xen_phys) { + printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \ + " CODE UNTESTED!\n", + (unsigned long)phys, + (unsigned long)xen_phys); + WARN_ON_ONCE(phys != xen_phys); + phys = xen_phys; + } + } and using the ''phys'' value from now. If this sounds like black magic, here is a short writeup http://wiki.xensource.com/xenwiki/XenPVOPSDRM look at "Why those patches" section. Lastly, are you using unsigned long for or the phys_addr_t typedefs? The driver uses dma_addr_t for physical address. The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use ''swiotlb=force''. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down. The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all. Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I''ve seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT? The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip. The driver doesn''t set DMA mask on the device to 32 bit. I''m looking at the driver code to see anything wrong or not. We appreciate your help Konrad. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-12 22:33 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
>> That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs. > > Our driver uses pci_map_single to get the physical addr to program the chip.OK. Good.> > > One way you can figure this is doing something like this to make sure you got the right MFN: > > add these two: > #include <xen/page.h> > #include <asm/xen/page.h> > > phys_addr_t phys = page_to_phys(mem->pages[i]); > + if (xen_pv_domain()) { > + phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn( > + page_to_pfn(mem->pages[i]))); > + if (phys != xen_phys) { > + printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \ > + " CODE UNTESTED!\n", > + (unsigned long)phys, > + (unsigned long)xen_phys); > + WARN_ON_ONCE(phys != xen_phys); > + phys = xen_phys; > + } > + } > and using the ''phys'' value from now. > > > > If this sounds like black magic, here is a short writeup http://wiki.xensource.com/xenwiki/XenPVOPSDRM > > look at "Why those patches" section. > > Lastly, are you using unsigned long for or the phys_addr_t typedefs? > > The driver uses dma_addr_t for physical address.Excellent.> > The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use ''swiotlb=force''. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down. > > The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all.<nods> That makes sense.> > Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I''ve seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT? > > The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip.Oh boy. That complicates it.> The driver doesn''t set DMA mask on the device to 32 bit.Is it set then to 45bit? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Lin, Ray
2010-Nov-12 22:57 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
-----Original Message----- From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] Sent: Friday, November 12, 2010 2:34 PM To: Lin, Ray Cc: Xen-devel; Dante Cinco Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough>> That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs. > > Our driver uses pci_map_single to get the physical addr to program the chip.OK. Good.> > > One way you can figure this is doing something like this to make sure you got the right MFN: > > add these two: > #include <xen/page.h> > #include <asm/xen/page.h> > > phys_addr_t phys = page_to_phys(mem->pages[i]); > + if (xen_pv_domain()) { > + phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn( > + page_to_pfn(mem->pages[i]))); > + if (phys != xen_phys) { > + printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \ > + " CODE UNTESTED!\n", > + (unsigned long)phys, > + (unsigned long)xen_phys); > + WARN_ON_ONCE(phys != xen_phys); > + phys = xen_phys; > + } > + } > and using the ''phys'' value from now. > > > > If this sounds like black magic, here is a short writeup > http://wiki.xensource.com/xenwiki/XenPVOPSDRM > > look at "Why those patches" section. > > Lastly, are you using unsigned long for or the phys_addr_t typedefs? > > The driver uses dma_addr_t for physical address.Excellent.> > The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use ''swiotlb=force''. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down. > > The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all.<nods> That makes sense.> > Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I''ve seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT? > > The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip.Oh boy. That complicates it.> The driver doesn''t set DMA mask on the device to 32 bit.Is it set then to 45bit? The driver doesn''t use pci_set_dma_mask to set the HW_DMA_MASK. The tachyon chip should support 64-bit dma transfer even though dma programmable address is limited to 32-bit/45-bit address. The chip should fill the upper address with 0. I''m confirming it with their fae now. In the mean time, I try to manipulate pci_set_dma_mask to see it make the difference or not. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-16 17:07 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Fri, Nov 12, 2010 at 2:33 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:>>> That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs. >> >> Our driver uses pci_map_single to get the physical addr to program the chip. > > OK. Good. >> >> >> One way you can figure this is doing something like this to make sure you got the right MFN: >> >> add these two: >> #include <xen/page.h> >> #include <asm/xen/page.h> >> >> phys_addr_t phys = page_to_phys(mem->pages[i]); >> + if (xen_pv_domain()) { >> + phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn( >> + page_to_pfn(mem->pages[i]))); >> + if (phys != xen_phys) { >> + printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \ >> + " CODE UNTESTED!\n", >> + (unsigned long)phys, >> + (unsigned long)xen_phys); >> + WARN_ON_ONCE(phys != xen_phys); >> + phys = xen_phys; >> + } >> + } >> and using the ''phys'' value from now. >> >> >> >> If this sounds like black magic, here is a short writeup http://wiki.xensource.com/xenwiki/XenPVOPSDRM >> >> look at "Why those patches" section. >> >> Lastly, are you using unsigned long for or the phys_addr_t typedefs? >> >> The driver uses dma_addr_t for physical address. > > Excellent. >> >> The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use ''swiotlb=force''. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down. >> >> The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all. > > <nods> That makes sense. >> >> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I''ve seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT? >> >> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip. > > Oh boy. That complicates it. > >> The driver doesn''t set DMA mask on the device to 32 bit. > > Is it set then to 45bit? >We were not explicitly setting the DMA mask. pci_alloc_coherent was always returning 32 bits but pci_map_single was returning a 34-bit address which we truncate by casting it to a uint32_t since the Tachyon''s HBA register is only 32 bits. With swiotlb=force, both returned 32 bits without explicitly setting the DMA mask. Once we set the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However with iommu=soft (and no more swiotlb=force), we''re still stuck with the abysmal I/O performance (same as when we had swiotlb=force). In pvops domU (xen-pcifront-0.8.2), what does iommu=soft do? What''s the default if we don''t specify it? Without it, we get no I/Os (it seems the interrupts and/or DMA don''t work). Are there any profiling tools you can suggest for domU? I was able to apply Dulloor''s xenoprofile patch to our dom0 kernel (2.6.32.25-pvops) but not to xen-pcifront-0.8.2. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-16 18:57 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
> >> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I''ve seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT? > >> > >> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip. > > > > Oh boy. That complicates it. > > > >> The driver doesn''t set DMA mask on the device to 32 bit. > > > > Is it set then to 45bit? > > > > We were not explicitly setting the DMA mask. pci_alloc_coherent wasYou should. But only once (during startup).> always returning 32 bits but pci_map_single was returning a 34-bit > address which we truncate by casting it to a uint32_t since theTruncating any bus (DMA) address is a big no no.> Tachyon''s HBA register is only 32 bits. With swiotlb=force, bothNot knowing the driver I can''t comment here much, but 1). When you say ''HBA registers'' I think PCI MMIO BARs. Those are usually found beneath the 4GB limit and you get the virtual address when doing ioremap (or the pci equivalant). And the bus address is definitly under the 4GB. 2). After you have done that, set your pci_dma_mask to 34-bit, and then 2). For all other operations where you can do 34-bit use the pci_map _single. The swiotlb buffer looks at the dma_mask (and if there is no set it assumes 32bit), and if it finds the physical address to be within the DMA mask it will gladly translate the physical to bus and nothing else. If however the physical address is way beyound the bus address it will give you the bounce buffer which you will later have to copy from (using pci_sync..). I''ve written a little blurp at the bottom of the email explaining this in more details. Or is the issue that when you write to your HBA register the DMA address, the HBA register can _only_ deal with 32-bit values (4bytes)? In which case the PCI device seems to be limited to addressing only up to 4GB, right?> returned 32 bits without explicitly setting the DMA mask. Once we set > the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However > with iommu=soft (and no more swiotlb=force), we''re still stuck with > the abysmal I/O performance (same as when we had swiotlb=force).Right, that is expected.> In pvops domU (xen-pcifront-0.8.2), what does iommu=soft do? What''s > the default if we don''t specify it? Without it, we get no I/Os (itIf you don''t specify it you can''t do PCI passthrough in PV guests. It is automatically enabled when you boot Linux as Dom0.> seems the interrupts and/or DMA don''t work).It has two purposes: 1). The predominant and which is used for both DomU and Dom0 is to translate physical address to machine frame numbers (PFNs->MFNs). Xen PV guests have a P2M array that is consulted when setting virtual addresses (PTEs). For PCI BARs, they are equivalant (PFN == MFN), but for memory regions they can be discontigous, and in decreasing order. If you would traverse the P2M list you could see: p2m(0x1000)==0x5121, p2m(0x1001)==0x5120, p2m(0x1002)==0x5119. So obviously we need a lookup mechanism to say find for virtual address 0xfffff8000010000 the DMA address (bus address). Naively on baremetal on X86 you could use virt_to_phy which would get you PFN 0x10000. On Xen however, we need to consult the P2M array. For example, for p2m[0x10000], the real machine frame number might 0x102323. So when you do ''pci_map_*'' Xen-SWIOTLB looks up the P2M to find you the machine frame number and returns that (dma address aka bus address). That is the value you tell the HBA to transform from/to. If you don''t enable Xen-SWIOTLB, and use the native one (or none at all), you end up programming the PCI driver with bogus data since the bus address you are giving the card does not correspond to the real bus address. 2). Using our example before, the p2m[0x10000] returned MFN 0x102323. That MFN is above 4GB (0x100000) and if your device can _only_ do PCI Memory Write and PCI Memory Read b/c it only has 32-bit address bits we need some way of still getting the contents of 0x102323 to the PCI card. This is where bounce buffers come in play. During bootup, Xen-SWIOTLB initializes a 64MB chunk of space that is underneath the 4GB space - it is also contingous. When you do ''pci_map_*'' Xen-SWIOTLB looks at the DMA mask you have, the MFN, and if DMA mask & MFN > DMA mask it copies the value from 0x102323 to one it''ss buffers, gives you the MFN of its buffer (say 0x20000) and you program that in the PCI card. When you get an interrupt from the PCI card, you call pci_sync_* which copies from MFN 0x20000 to 0x102323 and sticks the MFN 0x20000 back on the list of buffers to be used. And now you have in MFN 0x102323 the result.> > Are there any profiling tools you can suggest for domU? I was able to > apply Dulloor''s xenoprofile patch to our dom0 kernel (2.6.32.25-pvops) > but not to xen-pcifront-0.8.2.Oh boy. I don''t sorry. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-16 19:43 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Tue, Nov 16, 2010 at 10:57 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:>> >> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I''ve seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT? >> >> >> >> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip. >> > >> > Oh boy. That complicates it. >> > >> >> The driver doesn''t set DMA mask on the device to 32 bit. >> > >> > Is it set then to 45bit? >> > >> >> We were not explicitly setting the DMA mask. pci_alloc_coherent was > > You should. But only once (during startup). > >> always returning 32 bits but pci_map_single was returning a 34-bit >> address which we truncate by casting it to a uint32_t since the > > Truncating any bus (DMA) address is a big no no. > >> Tachyon''s HBA register is only 32 bits. With swiotlb=force, both > > Not knowing the driver I can''t comment here much, but > 1). When you say ''HBA registers'' I think PCI MMIO BARs. Those are > usually found beneath the 4GB limit and you get the virtual > address when doing ioremap (or the pci equivalant). And the > bus address is definitly under the 4GB. > 2). After you have done that, set your pci_dma_mask to 34-bit, and then > 2). For all other operations where you can do 34-bit use the pci_map > _single. The swiotlb buffer looks at the dma_mask (and if there > is no set it assumes 32bit), and if it finds the physical address > to be within the DMA mask it will gladly translate the physical > to bus and nothing else. If however the physical address is way > beyound the bus address it will give you the bounce buffer which > you will later have to copy from (using pci_sync..). I''ve written > a little blurp at the bottom of the email explaining this in more details. > > Or is the issue that when you write to your HBA register the DMA > address, the HBA register can _only_ deal with 32-bit values (4bytes)?The HBA register which is using the address returned by pci_map_single is limited to a 32-bit value.> In which case the PCI device seems to be limited to addressing only up to 4GB, right?The HBA has some 32-bit registers and some that are 45-bit.> >> returned 32 bits without explicitly setting the DMA mask. Once we set >> the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However >> with iommu=soft (and no more swiotlb=force), we''re still stuck with >> the abysmal I/O performance (same as when we had swiotlb=force). > > Right, that is expected.So with iommu=soft, all I/Os have to go through Xen-SWIOTLB which explains why we''re seeing the abysmal I/O performance, right? Is it true then that with an HVM domU kernel and PCI passthrough, it does not use Xen-SWIOTLB and therefore results in better performance?> >> In pvops domU (xen-pcifront-0.8.2), what does iommu=soft do? What''s >> the default if we don''t specify it? Without it, we get no I/Os (it > > If you don''t specify it you can''t do PCI passthrough in PV guests. > It is automatically enabled when you boot Linux as Dom0. > >> seems the interrupts and/or DMA don''t work). > > It has two purposes: > > 1). The predominant and which is used for both DomU and Dom0 is to > translate physical address to machine frame numbers (PFNs->MFNs). > Xen PV guests have a P2M array that is consulted when setting > virtual addresses (PTEs). For PCI BARs, they are equivalant > (PFN == MFN), but for memory regions they can be discontigous, > and in decreasing order. If you would traverse the P2M list you > could see: p2m(0x1000)==0x5121, p2m(0x1001)==0x5120, p2m(0x1002)==0x5119. > > So obviously we need a lookup mechanism to say find for > virtual address 0xfffff8000010000 the DMA address (bus address). > Naively on baremetal on X86 you could use virt_to_phy which would > get you PFN 0x10000. On Xen however, we need to consult the P2M array. > For example, for p2m[0x10000], the real machine frame number might 0x102323. > > So when you do ''pci_map_*'' Xen-SWIOTLB looks up the P2M to find you the > machine frame number and returns that (dma address aka bus address). That > is the value you tell the HBA to transform from/to. > > If you don''t enable Xen-SWIOTLB, and use the native one (or none at all), > you end up programming the PCI driver with bogus data since the bus address you > are giving the card does not correspond to the real bus address. > > 2). Using our example before, the p2m[0x10000] returned MFN 0x102323. That > MFN is above 4GB (0x100000) and if your device can _only_ do PCI Memory Write > and PCI Memory Read b/c it only has 32-bit address bits we need some way > of still getting the contents of 0x102323 to the PCI card. This is where > bounce buffers come in play. During bootup, Xen-SWIOTLB initializes a 64MB > chunk of space that is underneath the 4GB space - it is also contingous. > When you do ''pci_map_*'' Xen-SWIOTLB looks at the DMA mask you have, the MFN, > and if DMA mask & MFN > DMA mask it copies the value from 0x102323 to one it''ss > buffers, gives you the MFN of its buffer (say 0x20000) and you program that > in the PCI card. When you get an interrupt from the PCI card, you call > pci_sync_* which copies from MFN 0x20000 to 0x102323 and sticks the MFN 0x20000 > back on the list of buffers to be used. And now you have in MFN 0x102323 the > result. > >> >> Are there any profiling tools you can suggest for domU? I was able to >> apply Dulloor''s xenoprofile patch to our dom0 kernel (2.6.32.25-pvops) >> but not to xen-pcifront-0.8.2. > > Oh boy. I don''t sorry. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-16 20:15 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
> > Or is the issue that when you write to your HBA register the DMA > > address, the HBA register can _only_ deal with 32-bit values (4bytes)? > > The HBA register which is using the address returned by pci_map_single > is limited to a 32-bit value. > > > In which case the PCI device seems to be limited to addressing only up to 4GB, right? > > The HBA has some 32-bit registers and some that are 45-bit.Ugh, so can you set up pci coherent DMA pools at startup for the 32-bit registers. Then set the pci_dma_mask to 45-bit and use pci_map_single for all others.> > > > >> returned 32 bits without explicitly setting the DMA mask. Once we set > >> the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However > >> with iommu=soft (and no more swiotlb=force), we''re still stuck with > >> the abysmal I/O performance (same as when we had swiotlb=force). > > > > Right, that is expected. > > So with iommu=soft, all I/Os have to go through Xen-SWIOTLB which > explains why we''re seeing the abysmal I/O performance, right?You are simplifying it. You are seeing abysmal I/O performance b/c you are doing bounce buffering. You can fix this by making the driver have a 32-bit pool allocated at startup and use that just for the HBA registers that can only do 32-bit, and then for the rest use the pci_map_single and use DMA mask 45-bit.> > Is it true then that with an HVM domU kernel and PCI passthrough, it > does not use Xen-SWIOTLB and therefore results in better performance?Yes and no. If you allocate to your HVM guests more than 4GB you are going to hit the same issues with the bounce buffer. If you give your guest less than 4GB, there is no SWIOTLB running in the guest and QEMU along with the hypervisor end up using the hardware one (currently Xen hypervisor supports AMD V-i and Intel VT-d). In your case it is the VT-d - at which point the VT-d will remap your GMFN to MFNs. And the VT-d will be responsible for translating the DMA address that the PCI card will try to access to the real MFN. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-18 01:09 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Tue, Nov 16, 2010 at 12:15 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:>> > Or is the issue that when you write to your HBA register the DMA >> > address, the HBA register can _only_ deal with 32-bit values (4bytes)? >> >> The HBA register which is using the address returned by pci_map_single >> is limited to a 32-bit value. >> >> > In which case the PCI device seems to be limited to addressing only up to 4GB, right? >> >> The HBA has some 32-bit registers and some that are 45-bit. > > Ugh, so can you set up pci coherent DMA pools at startup for the 32-bit > registers. Then set the pci_dma_mask to 45-bit and use pci_map_single for > all others. >> >> > >> >> returned 32 bits without explicitly setting the DMA mask. Once we set >> >> the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However >> >> with iommu=soft (and no more swiotlb=force), we''re still stuck with >> >> the abysmal I/O performance (same as when we had swiotlb=force). >> > >> > Right, that is expected. >> >> So with iommu=soft, all I/Os have to go through Xen-SWIOTLB which >> explains why we''re seeing the abysmal I/O performance, right? > > You are simplifying it. You are seeing abysmal I/O performance b/c you > are doing bounce buffering. You can fix this by making the driver > have a 32-bit pool allocated at startup and use that just for the > HBA registers that can only do 32-bit, and then for the rest use > the pci_map_single and use DMA mask 45-bit.I wanted to confirm that bounce buffering was indeed occurring so I modified swiotlb.c in the kernel and added printks in the following functions: swiotlb_bounce swiotlb_tbl_map_single swiotlb_tbl_unmap_single Sure enough we were calling all 3 five times per I/O. We took your suggestion and replaced pci_map_single with pci_pool_alloc. The swiotlb calls were gone but the I/O performance only improved 6% (29k IOPS to 31k IOPS) which is still abysmal. Any suggestions on where to look next? I have one question about the P2M array: Does the P2M lookup occur every DMA or just during the allocation? What I''m getting at is this: Is the Xen-SWIOTLB a central resource that could be a bottleneck?> >> >> Is it true then that with an HVM domU kernel and PCI passthrough, it >> does not use Xen-SWIOTLB and therefore results in better performance? > > Yes and no. > > If you allocate to your HVM guests more than 4GB you are going to > hit the same issues with the bounce buffer. > > If you give your guest less than 4GB, there is no SWIOTLB running in the guest > and QEMU along with the hypervisor end up using the hardware one (currently > Xen hypervisor supports AMD V-i and Intel VT-d). In your case it is the VT-d > - at which point the VT-d will remap your GMFN to MFNs. And the VT-d will > be responsible for translating the DMA address that the PCI card will > try to access to the real MFN. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-18 17:19 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
Keir, Dan, Mathieu, Chris, Mukesh, This fellow is passing in a PCI device to his Xen PV guest and trying to get high IOPS. The kernel he is using is a 2.6.36 with tglx''s sparse_irq rework.> I wanted to confirm that bounce buffering was indeed occurring so I > modified swiotlb.c in the kernel and added printks in the following > functions: > swiotlb_bounce > swiotlb_tbl_map_single > swiotlb_tbl_unmap_single > Sure enough we were calling all 3 five times per I/O. We took your > suggestion and replaced pci_map_single with pci_pool_alloc. The > swiotlb calls were gone but the I/O performance only improved 6% (29k > IOPS to 31k IOPS) which is still abysmal.Hey! 6% that is nothing to sneeze at.> > Any suggestions on where to look next? I have one question about theSo since you are talking IOPS I figured you must be using fio to run those numbers. And since you mentioned HVM at some point, you are not running this PV domain as a back-end for another PV guest. You are probably going to run some form of iSCSI target and stuff those down the PCI device. Couple of things that pop in my head.. but lets first address your question.> P2M array: Does the P2M lookup occur every DMA or just during the > allocation? What I''m getting at is this: Is the Xen-SWIOTLB a centralIt only occurs during allocation. Also since you are bypassing the bounce buffer those calls are done without any spinlock. The lookup of P2M is bitshifting, division - and are constant - so O(1).> resource that could be a bottleneck?Doubt it. Your best bet to figure this out is to play with ftrace, or perf trace. But I don''t know how well they work with Xen nowadays - Jeremy and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu got it working? So the next couple of possiblities are: 1). you are hitting the spinlock issues on ''struct request'' or any of the paths on the I/O. Oracle did a lot of work on those - and one way to find this out is to look at tracing and see where the contention is. I don''t know where or if those patches have been posted upstream.. but as said, if you are seeing the spinlock usage high - that might be it. 1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. Otherwise you are going to hit dreadfull conditions. 2). You are hitting the 64-bit syscall wall. Basically your user-mode application (fio) is doing a write(), which used to be int 0x80 but now is a syscall. The syscall gets trapped in the hypervisor which has to call in your PV kernel. You get hit with two context switches for each ''write()'' call. The solution is to use a 32-bit DomU as the guest user application and guest kernel run in different rings. 3). Xen CPU pools. You didn''t say where the application that sends the IOs is located. But if it was in a seperate domain then you will want to use Xen CPU pools. Basically this way you can get gang-scheduling where the guest that submits the I/O and the guest that picks up the I/O are running right after each other. I don''t know much more details, but this is what I understand it does. 4). CPU/MSI-X affinity. I think you already did this, but make sure you pin your guest to specific CPUs and also pin the MSI-X (vectors) to the proper destination. You can use the ''xm debug-keys i'' to see the MSI-X affinity - it is a mask and basically see if it overlays the CPUs you are running your guest at. Not sure how to actually set the MSI-X affinity ... now that I think. Keir or some of the Intel folks might know better. 5). Andrew, Mukesh, Keir, Dan, any other ideas? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Mason
2010-Nov-18 17:28 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
Excerpts from Konrad Rzeszutek Wilk''s message of 2010-11-18 12:19:36 -0500:> Keir, Dan, Mathieu, Chris, Mukesh, > > This fellow is passing in a PCI device to his Xen PV guest and trying > to get high IOPS. The kernel he is using is a 2.6.36 with tglx''s > sparse_irq rework. > > > I wanted to confirm that bounce buffering was indeed occurring so I > > modified swiotlb.c in the kernel and added printks in the following > > functions: > > swiotlb_bounce > > swiotlb_tbl_map_single > > swiotlb_tbl_unmap_single > > Sure enough we were calling all 3 five times per I/O. We took your > > suggestion and replaced pci_map_single with pci_pool_alloc. The > > swiotlb calls were gone but the I/O performance only improved 6% (29k > > IOPS to 31k IOPS) which is still abysmal. > > Hey! 6% that is nothing to sneeze at.How fast does it go on bare metal? I usually do four things: 1) perf record -g -a -f ''sleep 15'' (use perf report to look at the biggest CPU hogs) 2) mpstat -P ALL 1 to find the CPU doing all the softirq processing 3) perf record -g -C N -f ''sleep 15'' where N was the CPU in mpstat -P ALL that was doing all the softirq processing 4) Turn off the intel iommu. This isn''t an option of for virtualized though, but I''d try it on/off on bare metal. -chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Mathieu Desnoyers
2010-Nov-18 17:54 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
* Konrad Rzeszutek Wilk (konrad.wilk@oracle.com) wrote:> Keir, Dan, Mathieu, Chris, Mukesh,[...]> Doubt it. Your best bet to figure this out is to play with ftrace, or > perf trace. But I don''t know how well they work with Xen nowadays - Jeremy > and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu got > it working?I did port LTTng to the Xen hypervisor in a past life, but I did not have time to maintain this port in parallel with the Linux kernel LTTng. So I doubt these bits would be very useful today, as a new port would be needed for compatibility with newer lttng tools. If you can afford to use older Xen hypervisors with older Linux kernels and old LTTng/LTTV versions, then you could gather a synchronized trace across the hypervisor/Dom0/DomUs, but it would require some work for recent Xen versions. Currently, we''ve been focusing our efforts on tracing of KVM, which works very well. We support analysis of traces taken from different host/guest domains, as long as the TSCs are synchronized. So an option here would be to deploy LTTng on both your dom0 and domU kernels, gather traces of both in parallel while you run your workload, and compare the resulting traces (load both dom0 and domU traces into one trace set within lttv). Comparing the I/O behavior with a bare-metal trace should give a good insight into what''s different. At least you''ll be able to follow the path taken by each I/O request, except for what''s happening in Xen, which will be a black box. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-18 18:43 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> Keir, Dan, Mathieu, Chris, Mukesh, > > This fellow is passing in a PCI device to his Xen PV guest and trying > to get high IOPS. The kernel he is using is a 2.6.36 with tglx''s > sparse_irq rework. > >> I wanted to confirm that bounce buffering was indeed occurring so I >> modified swiotlb.c in the kernel and added printks in the following >> functions: >> swiotlb_bounce >> swiotlb_tbl_map_single >> swiotlb_tbl_unmap_single >> Sure enough we were calling all 3 five times per I/O. We took your >> suggestion and replaced pci_map_single with pci_pool_alloc. The >> swiotlb calls were gone but the I/O performance only improved 6% (29k >> IOPS to 31k IOPS) which is still abysmal. > > Hey! 6% that is nothing to sneeze at.When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at least 20x (~700k IOPS).> >> >> Any suggestions on where to look next? I have one question about the > > So since you are talking IOPS I figured you must be using fio to run those > numbers. And since you mentioned HVM at some point, you are not running > this PV domain as a back-end for another PV guest. You are probably going > to run some form of iSCSI target and stuff those down the PCI device.Our setup is pure Fibre Channel. We''re using a physically separate system (Linux-based also) to initiate the SCSI I/Os.> > Couple of things that pop in my head.. but lets first address your question. > >> P2M array: Does the P2M lookup occur every DMA or just during the >> allocation? What I''m getting at is this: Is the Xen-SWIOTLB a central > > It only occurs during allocation. Also since you are bypassing the > bounce buffer those calls are done without any spinlock. The lookup > of P2M is bitshifting, division - and are constant - so O(1). > >> resource that could be a bottleneck? > > Doubt it. Your best bet to figure this out is to play with ftrace, or > perf trace. But I don''t know how well they work with Xen nowadays - Jeremy > and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu got > it working? > > So the next couple of possiblities are: > 1). you are hitting the spinlock issues on ''struct request'' or any of > the paths on the I/O. Oracle did a lot of work on those - and one > way to find this out is to look at tracing and see where the contention is. > I don''t know where or if those patches have been posted upstream.. but as said, > if you are seeing the spinlock usage high - that might be it. > 1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. OtherwiseI checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=y> you are going to hit dreadfull conditions. > 2). You are hitting the 64-bit syscall wall. Basically your user-mode > application (fio) is doing a write(), which used to be int 0x80 but now > is a syscall. The syscall gets trapped in the hypervisor which has to > call in your PV kernel. You get hit with two context switches for each > ''write()'' call. The solution is to use a 32-bit DomU as the guest user > application and guest kernel run in different rings.There is no user space application that is involved with the I/O. It''s all kernel driver code that handles the I/O.> 3). Xen CPU pools. You didn''t say where the application that sends the IOs > is located. But if it was in a seperate domain then you will want to use > Xen CPU pools. Basically this way you can get gang-scheduling where the > guest that submits the I/O and the guest that picks up the I/O are running > right after each other. I don''t know much more details, but this is what > I understand it does. > 4). CPU/MSI-X affinity. I think you already did this, but make sure you pin > your guest to specific CPUs and also pin the MSI-X (vectors) to the proper > destination. You can use the ''xm debug-keys i'' to see the MSI-X affinity - it > is a mask and basically see if it overlays the CPUs you are running your guest > at. Not sure how to actually set the MSI-X affinity ... now that I think. > Keir or some of the Intel folks might know better.There 16 devices (multi-function) that are PCI-passed through to domU. There are 16 VCPUs in domU and all are pinned to individual PCPUs (24-CPU platform). Each IRQ in domU is affinitized to a CPU. This strategy has worked well for us with the HVM kernel. Here''s the output of ''xm debug-keys i'' (XEN) IRQ: 67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a type=PCI-MSI status=00000010 in-flight=0 domain-list=1:127(----), (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000200 vec:43 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:126(----), (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000400 vec:83 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:125(----), (XEN) IRQ: 70 affinity:00000000,00000000,00000000,00000800 vec:4b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:124(----), (XEN) IRQ: 71 affinity:00000000,00000000,00000000,00001000 vec:8b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:123(----), (XEN) IRQ: 72 affinity:00000000,00000000,00000000,00002000 vec:53 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:122(----), (XEN) IRQ: 73 affinity:00000000,00000000,00000000,00004000 vec:93 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:121(----), (XEN) IRQ: 74 affinity:00000000,00000000,00000000,00008000 vec:5b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:120(----), (XEN) IRQ: 75 affinity:00000000,00000000,00000000,00010000 vec:9b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:119(----), (XEN) IRQ: 76 affinity:00000000,00000000,00000000,00020000 vec:63 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:118(----), (XEN) IRQ: 77 affinity:00000000,00000000,00000000,00040000 vec:a3 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:117(----), (XEN) IRQ: 78 affinity:00000000,00000000,00000000,00080000 vec:6b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:116(----), (XEN) IRQ: 79 affinity:00000000,00000000,00000000,00100000 vec:ab type=PCI-MSI status=00000010 in-flight=0 domain-list=1:115(----), (XEN) IRQ: 80 affinity:00000000,00000000,00000000,00200000 vec:73 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:114(----), (XEN) IRQ: 81 affinity:00000000,00000000,00000000,00400000 vec:b3 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:113(----), (XEN) IRQ: 82 affinity:00000000,00000000,00000000,00800000 vec:7b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:112(----),> 5). Andrew, Mukesh, Keir, Dan, any other ideas? >We''re also trying Chris'' 4 things to try and will consider Mathieu''s LTT suggestion. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Lin, Ray
2010-Nov-18 18:52 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
-----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Dante Cinco Sent: Thursday, November 18, 2010 10:44 AM To: Konrad Rzeszutek Wilk Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; andrew.thomas@oracle.com; keir.fraser@eu.citrix.com; chris.mason@oracle.com Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> Keir, Dan, Mathieu, Chris, Mukesh, > > This fellow is passing in a PCI device to his Xen PV guest and trying > to get high IOPS. The kernel he is using is a 2.6.36 with tglx''s > sparse_irq rework. > >> I wanted to confirm that bounce buffering was indeed occurring so I >> modified swiotlb.c in the kernel and added printks in the following >> functions: >> swiotlb_bounce >> swiotlb_tbl_map_single >> swiotlb_tbl_unmap_single >> Sure enough we were calling all 3 five times per I/O. We took your >> suggestion and replaced pci_map_single with pci_pool_alloc. The >> swiotlb calls were gone but the I/O performance only improved 6% (29k >> IOPS to 31k IOPS) which is still abysmal. > > Hey! 6% that is nothing to sneeze at.When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at least 20x (~700k IOPS).> >> >> Any suggestions on where to look next? I have one question about the > > So since you are talking IOPS I figured you must be using fio to run > those numbers. And since you mentioned HVM at some point, you are not > running this PV domain as a back-end for another PV guest. You are > probably going to run some form of iSCSI target and stuff those down the PCI device.Our setup is pure Fibre Channel. We''re using a physically separate system (Linux-based also) to initiate the SCSI I/Os.> > Couple of things that pop in my head.. but lets first address your question. > >> P2M array: Does the P2M lookup occur every DMA or just during the >> allocation? What I''m getting at is this: Is the Xen-SWIOTLB a central > > It only occurs during allocation. Also since you are bypassing the > bounce buffer those calls are done without any spinlock. The lookup of > P2M is bitshifting, division - and are constant - so O(1). > >> resource that could be a bottleneck? > > Doubt it. Your best bet to figure this out is to play with ftrace, or > perf trace. But I don''t know how well they work with Xen nowadays - > Jeremy and Mathieu Desnoyers poked it a bit and I think I overheard > that Mathieu got it working? > > So the next couple of possiblities are: > 1). you are hitting the spinlock issues on ''struct request'' or any of > the paths on the I/O. Oracle did a lot of work on those - and one > way to find this out is to look at tracing and see where the contention is. > I don''t know where or if those patches have been posted upstream.. > but as said, > if you are seeing the spinlock usage high - that might be it. > 1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. > OtherwiseI checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=y The platform we''re running has Intel Xeon E5540 and X58 chipset. Here is the kernel configuration associated with processor. Is there anything we could tune to improve the performance ? # # Processor type and features # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_SMP=y CONFIG_SPARSE_IRQ=y CONFIG_NUMA_IRQ_DESC=y CONFIG_X86_MPPARSE=y # CONFIG_X86_EXTENDED_PLATFORM is not set CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_PARAVIRT_GUEST=y CONFIG_XEN=y CONFIG_XEN_PVHVM=y CONFIG_XEN_MAX_DOMAIN_MEMORY=8 CONFIG_XEN_SAVE_RESTORE=y CONFIG_XEN_DEBUG_FS=y CONFIG_KVM_CLOCK=y CONFIG_KVM_GUEST=y CONFIG_PARAVIRT=y CONFIG_PARAVIRT_SPINLOCKS=y CONFIG_PARAVIRT_CLOCK=y # CONFIG_PARAVIRT_DEBUG is not set CONFIG_NO_BOOTMEM=y # CONFIG_MEMTEST is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set # CONFIG_MATOM is not set CONFIG_GENERIC_CPU=y CONFIG_X86_CPU=y CONFIG_X86_INTERNODE_CACHE_SHIFT=7 CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_XADD=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_TSC=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_CMOV=y CONFIG_X86_MINIMUM_CPU_FAMILY=64 CONFIG_X86_DEBUGCTLMSR=y CONFIG_CPU_SUP_INTEL=y CONFIG_CPU_SUP_AMD=y CONFIG_CPU_SUP_CENTAUR=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_DMI=y CONFIG_GART_IOMMU=y CONFIG_CALGARY_IOMMU=y CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y CONFIG_AMD_IOMMU=y CONFIG_AMD_IOMMU_STATS=y CONFIG_SWIOTLB=y CONFIG_IOMMU_HELPER=y CONFIG_IOMMU_API=y # CONFIG_MAXSMP is not set CONFIG_NR_CPUS=32 CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y CONFIG_X86_MCE_AMD=y CONFIG_X86_MCE_THRESHOLD=y CONFIG_X86_MCE_INJECT=y CONFIG_X86_THERMAL_VECTOR=y # CONFIG_I8K is not set CONFIG_MICROCODE=y CONFIG_MICROCODE_INTEL=y CONFIG_MICROCODE_AMD=y CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_DIRECT_GBPAGES=y CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_X86_64_ACPI_NUMA=y CONFIG_NODES_SPAN_OTHER_NODES=y # CONFIG_NUMA_EMU is not set CONFIG_NODES_SHIFT=6 CONFIG_ARCH_PROC_KCORE_TEXT=y CONFIG_ARCH_SPARSEMEM_DEFAULT=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000 CONFIG_SELECT_MEMORY_MODEL=y CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_HAVE_MEMORY_PRESENT=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y CONFIG_SPARSEMEM_VMEMMAP=y # CONFIG_MEMORY_HOTPLUG is not set CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_SPLIT_PTLOCK_CPUS=4 # CONFIG_COMPACTION is not set CONFIG_MIGRATION=y CONFIG_PHYS_ADDR_T_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y # CONFIG_KSM is not set CONFIG_DEFAULT_MMAP_MIN_ADDR=4096 CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y # CONFIG_MEMORY_FAILURE is not set CONFIG_X86_CHECK_BIOS_CORRUPTION=y CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y CONFIG_X86_RESERVE_LOW_64K=y CONFIG_MTRR=y # CONFIG_MTRR_SANITIZER is not set CONFIG_X86_PAT=y CONFIG_ARCH_USES_PG_UNCACHED=y CONFIG_EFI=y CONFIG_SECCOMP=y # CONFIG_CC_STACKPROTECTOR is not set CONFIG_HZ_100=y # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=100 CONFIG_SCHED_HRTICK=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_KEXEC_JUMP=y CONFIG_PHYSICAL_START=0x1000000 CONFIG_RELOCATABLE=y CONFIG_PHYSICAL_ALIGN=0x1000000 CONFIG_HOTPLUG_CPU=y # CONFIG_COMPAT_VDSO is not set # CONFIG_CMDLINE_BOOL is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y CONFIG_USE_PERCPU_NUMA_NODE_ID=y> you are going to hit dreadfull conditions. > 2). You are hitting the 64-bit syscall wall. Basically your user-mode > application (fio) is doing a write(), which used to be int 0x80 > but now > is a syscall. The syscall gets trapped in the hypervisor which has > to > call in your PV kernel. You get hit with two context switches for > each > ''write()'' call. The solution is to use a 32-bit DomU as the guest > user > application and guest kernel run in different rings.There is no user space application that is involved with the I/O. It''s all kernel driver code that handles the I/O.> 3). Xen CPU pools. You didn''t say where the application that sends > the IOs > is located. But if it was in a seperate domain then you will want > to use > Xen CPU pools. Basically this way you can get gang-scheduling > where the > guest that submits the I/O and the guest that picks up the I/O are > running > right after each other. I don''t know much more details, but this > is what > I understand it does. > 4). CPU/MSI-X affinity. I think you already did this, but make sure > you pin > your guest to specific CPUs and also pin the MSI-X (vectors) to > the proper > destination. You can use the ''xm debug-keys i'' to see the MSI-X > affinity - it > is a mask and basically see if it overlays the CPUs you are > running your guest > at. Not sure how to actually set the MSI-X affinity ... now that I think. > Keir or some of the Intel folks might know better.There 16 devices (multi-function) that are PCI-passed through to domU. There are 16 VCPUs in domU and all are pinned to individual PCPUs (24-CPU platform). Each IRQ in domU is affinitized to a CPU. This strategy has worked well for us with the HVM kernel. Here''s the output of ''xm debug-keys i'' (XEN) IRQ: 67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a type=PCI-MSI status=00000010 in-flight=0 domain-list=1:127(----), (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000200 vec:43 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:126(----), (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000400 vec:83 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:125(----), (XEN) IRQ: 70 affinity:00000000,00000000,00000000,00000800 vec:4b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:124(----), (XEN) IRQ: 71 affinity:00000000,00000000,00000000,00001000 vec:8b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:123(----), (XEN) IRQ: 72 affinity:00000000,00000000,00000000,00002000 vec:53 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:122(----), (XEN) IRQ: 73 affinity:00000000,00000000,00000000,00004000 vec:93 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:121(----), (XEN) IRQ: 74 affinity:00000000,00000000,00000000,00008000 vec:5b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:120(----), (XEN) IRQ: 75 affinity:00000000,00000000,00000000,00010000 vec:9b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:119(----), (XEN) IRQ: 76 affinity:00000000,00000000,00000000,00020000 vec:63 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:118(----), (XEN) IRQ: 77 affinity:00000000,00000000,00000000,00040000 vec:a3 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:117(----), (XEN) IRQ: 78 affinity:00000000,00000000,00000000,00080000 vec:6b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:116(----), (XEN) IRQ: 79 affinity:00000000,00000000,00000000,00100000 vec:ab type=PCI-MSI status=00000010 in-flight=0 domain-list=1:115(----), (XEN) IRQ: 80 affinity:00000000,00000000,00000000,00200000 vec:73 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:114(----), (XEN) IRQ: 81 affinity:00000000,00000000,00000000,00400000 vec:b3 type=PCI-MSI status=00000010 in-flight=0 domain-list=1:113(----), (XEN) IRQ: 82 affinity:00000000,00000000,00000000,00800000 vec:7b type=PCI-MSI status=00000010 in-flight=0 domain-list=1:112(----),> 5). Andrew, Mukesh, Keir, Dan, any other ideas? >We''re also trying Chris'' 4 things to try and will consider Mathieu''s LTT suggestion. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-18 19:35 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
I mentioned earlier in an previous post to this thread that I''m able to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the domU kernel. So I can''t do active-domain profiling but I''m able to do passive-domain profiling but I don''t know how reliable the results are since it shows pvclock_clocksource_read as the top consumer of CPU cycles at 28%. CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 918089 27.9310 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel pvclock_clocksource_read 217811 6.6265 domain1-modules domain1-modules /domain1-modules 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug mutex_spin_on_owner 186684 5.6795 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel __xen_spin_lock 149514 4.5487 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel __write_lock_failed 123278 3.7505 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel __kernel_text_address 122906 3.7392 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel xen_spin_unlock 90903 2.7655 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel __spin_time_accum 85880 2.6127 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel __module_address 75223 2.2885 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel print_context_stack 66778 2.0316 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel __module_text_address 57389 1.7459 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel is_module_text_address 47282 1.4385 xen-syms-4.1-unstable domain1-xen syscall_enter 47219 1.4365 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel prio_tree_insert 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug pvclock_clocksource_read 44501 1.3539 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel prio_tree_left 32482 0.9882 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug domain1-kernel native_read_tsc I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the I/Os were running. Here''s the command I used: opcontrol --start --xen=/boot/xen-syms-4.1-unstable --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug --passive-domains=1 --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the Xen command line. Otherwise, oprofile only gives the samples from CPU0. I''m going to try perf next. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Nov-18 21:20 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
In case it is related: http://lists.xensource.com/archives/html/xen-devel/2010-07/msg01247.html Although I never went further on this investigation, it appeared to me that pvclock_clocksource_read was getting called at least an order-of-magnitude more frequently than expected in some circumstances for some kernels. And IIRC it was scaled by the number of vcpus.> -----Original Message----- > From: Dante Cinco [mailto:dantecinco@gmail.com] > Sent: Thursday, November 18, 2010 12:36 PM > To: Konrad Rzeszutek Wilk > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason > Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 > pvops domU kernel with PCI passthrough > > I mentioned earlier in an previous post to this thread that I''m able > to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the > domU kernel. So I can''t do active-domain profiling but I''m able to do > passive-domain profiling but I don''t know how reliable the results are > since it shows pvclock_clocksource_read as the top consumer of CPU > cycles at 28%. > > CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a > unit mask of 0x00 (No unit mask) count 100000 > samples % image name app name > symbol name > 918089 27.9310 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel pvclock_clocksource_read > 217811 6.6265 domain1-modules domain1-modules > /domain1-modules > 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > mutex_spin_on_owner > 186684 5.6795 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __xen_spin_lock > 149514 4.5487 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __write_lock_failed > 123278 3.7505 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __kernel_text_address > 122906 3.7392 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel xen_spin_unlock > 90903 2.7655 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __spin_time_accum > 85880 2.6127 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_address > 75223 2.2885 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel print_context_stack > 66778 2.0316 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_text_address > 57389 1.7459 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel is_module_text_address > 47282 1.4385 xen-syms-4.1-unstable domain1-xen > syscall_enter > 47219 1.4365 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_insert > 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > pvclock_clocksource_read > 44501 1.3539 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_left > 32482 0.9882 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel native_read_tsc > > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the > I/Os were running. Here''s the command I used: > > opcontrol --start --xen=/boot/xen-syms-4.1-unstable > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > --passive-domains=1 > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu- > 5.11.dcinco-debug > > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the > Xen command line. Otherwise, oprofile only gives the samples from > CPU0. > > I''m going to try perf next. > > - Dante > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Lin, Ray
2010-Nov-18 21:39 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
-----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Dan Magenheimer Sent: Thursday, November 18, 2010 1:21 PM To: Dante Cinco; Konrad Wilk Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason Subject: RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough In case it is related: http://lists.xensource.com/archives/html/xen-devel/2010-07/msg01247.html Although I never went further on this investigation, it appeared to me that pvclock_clocksource_read was getting called at least an order-of-magnitude more frequently than expected in some circumstances for some kernels. And IIRC it was scaled by the number of vcpus. We did suspect it, since our old setting was HZ=1000 and we assigned more than 10 VCPUs to domU. But we don''t see the performance difference with HZ=100.> -----Original Message----- > From: Dante Cinco [mailto:dantecinco@gmail.com] > Sent: Thursday, November 18, 2010 12:36 PM > To: Konrad Rzeszutek Wilk > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason > Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 > pvops domU kernel with PCI passthrough > > I mentioned earlier in an previous post to this thread that I''m able > to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the > domU kernel. So I can''t do active-domain profiling but I''m able to do > passive-domain profiling but I don''t know how reliable the results are > since it shows pvclock_clocksource_read as the top consumer of CPU > cycles at 28%. > > CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a > unit mask of 0x00 (No unit mask) count 100000 > samples % image name app name > symbol name > 918089 27.9310 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel pvclock_clocksource_read > 217811 6.6265 domain1-modules domain1-modules > /domain1-modules > 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > mutex_spin_on_owner > 186684 5.6795 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __xen_spin_lock > 149514 4.5487 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __write_lock_failed > 123278 3.7505 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __kernel_text_address > 122906 3.7392 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel xen_spin_unlock > 90903 2.7655 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __spin_time_accum > 85880 2.6127 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_address > 75223 2.2885 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel print_context_stack > 66778 2.0316 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_text_address > 57389 1.7459 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel is_module_text_address > 47282 1.4385 xen-syms-4.1-unstable domain1-xen > syscall_enter > 47219 1.4365 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_insert > 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > pvclock_clocksource_read > 44501 1.3539 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_left > 32482 0.9882 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel native_read_tsc > > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the > I/Os were running. Here''s the command I used: > > opcontrol --start --xen=/boot/xen-syms-4.1-unstable > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > --passive-domains=1 > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu- > 5.11.dcinco-debug > > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the > Xen command line. Otherwise, oprofile only gives the samples from > CPU0. > > I''m going to try perf next. > > - Dante > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Nov-19 00:20 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
> We did suspect it, since our old setting was HZ=1000 and we assigned > more than 10 VCPUs to domU. But we don''t see the performance difference > with HZ=100.FWIW, it didn''t appear that the problems were proportional to HZ. Seemed more that somehow the pvclock became incorrect and spent a lot of time rereading the pvclock value.> -----Original Message----- > From: Lin, Ray [mailto:Ray.Lin@lsi.com] > Sent: Thursday, November 18, 2010 2:40 PM > To: Dan Magenheimer; Dante Cinco; Konrad Wilk > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason > Subject: RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 > pvops domU kernel with PCI passthrough > > > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel- > bounces@lists.xensource.com] On Behalf Of Dan Magenheimer > Sent: Thursday, November 18, 2010 1:21 PM > To: Dante Cinco; Konrad Wilk > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason > Subject: RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 > pvops domU kernel with PCI passthrough > > In case it is related: > http://lists.xensource.com/archives/html/xen-devel/2010- > 07/msg01247.html > > Although I never went further on this investigation, it appeared to me > that pvclock_clocksource_read was getting called at least an order-of- > magnitude more frequently than expected in some circumstances for some > kernels. And IIRC it was scaled by the number of vcpus. > > We did suspect it, since our old setting was HZ=1000 and we assigned > more than 10 VCPUs to domU. But we don''t see the performance difference > with HZ=100. > > > -----Original Message----- > > From: Dante Cinco [mailto:dantecinco@gmail.com] > > Sent: Thursday, November 18, 2010 12:36 PM > > To: Konrad Rzeszutek Wilk > > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; > > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason > > Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 > > pvops domU kernel with PCI passthrough > > > > I mentioned earlier in an previous post to this thread that I''m able > > to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the > > domU kernel. So I can''t do active-domain profiling but I''m able to do > > passive-domain profiling but I don''t know how reliable the results > are > > since it shows pvclock_clocksource_read as the top consumer of CPU > > cycles at 28%. > > > > CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) > > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a > > unit mask of 0x00 (No unit mask) count 100000 > > samples % image name app name > > symbol name > > 918089 27.9310 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel pvclock_clocksource_read > > 217811 6.6265 domain1-modules domain1-modules > > /domain1-modules > > 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco- > debug > > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > > mutex_spin_on_owner > > 186684 5.6795 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel __xen_spin_lock > > 149514 4.5487 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel __write_lock_failed > > 123278 3.7505 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel __kernel_text_address > > 122906 3.7392 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel xen_spin_unlock > > 90903 2.7655 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel __spin_time_accum > > 85880 2.6127 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel __module_address > > 75223 2.2885 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel print_context_stack > > 66778 2.0316 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel __module_text_address > > 57389 1.7459 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel is_module_text_address > > 47282 1.4385 xen-syms-4.1-unstable domain1-xen > > syscall_enter > > 47219 1.4365 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel prio_tree_insert > > 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco- > debug > > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > > pvclock_clocksource_read > > 44501 1.3539 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel prio_tree_left > > 32482 0.9882 > > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > domain1-kernel native_read_tsc > > > > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while > the > > I/Os were running. Here''s the command I used: > > > > opcontrol --start --xen=/boot/xen-syms-4.1-unstable > > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > > --passive-domains=1 > > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu- > > 5.11.dcinco-debug > > > > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in > the > > Xen command line. Otherwise, oprofile only gives the samples from > > CPU0. > > > > I''m going to try perf next. > > > > - Dante > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-19 01:38 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Thu, Nov 18, 2010 at 4:20 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:>> We did suspect it, since our old setting was HZ=1000 and we assigned >> more than 10 VCPUs to domU. But we don''t see the performance difference >> with HZ=100. > > FWIW, it didn''t appear that the problems were proportional to HZ. > Seemed more that somehow the pvclock became incorrect and spent > a lot of time rereading the pvclock value.We decided to enable lock stat in the kernel to track down all those lock activities in the profile report. The first thing I noticed was kmemleak was at the top of the list (/proc/lock_stat) so we disabled kmemleak. This boosted our I/O performance to 119k IOPS (from 31k). One of our developers (Bruce Edge) suggested killing ntpd so I did. This resulted in another significant bump in I/O performance to 209k IOPS. The question now is why ntpd? Is it the source of all or most of those pvclock_clocksource_read in the profile report?> >> -----Original Message----- >> From: Lin, Ray [mailto:Ray.Lin@lsi.com] >> Sent: Thursday, November 18, 2010 2:40 PM >> To: Dan Magenheimer; Dante Cinco; Konrad Wilk >> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; >> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason >> Subject: RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 >> pvops domU kernel with PCI passthrough >> >> >> >> -----Original Message----- >> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel- >> bounces@lists.xensource.com] On Behalf Of Dan Magenheimer >> Sent: Thursday, November 18, 2010 1:21 PM >> To: Dante Cinco; Konrad Wilk >> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; >> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason >> Subject: RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 >> pvops domU kernel with PCI passthrough >> >> In case it is related: >> http://lists.xensource.com/archives/html/xen-devel/2010- >> 07/msg01247.html >> >> Although I never went further on this investigation, it appeared to me >> that pvclock_clocksource_read was getting called at least an order-of- >> magnitude more frequently than expected in some circumstances for some >> kernels. And IIRC it was scaled by the number of vcpus. >> >> We did suspect it, since our old setting was HZ=1000 and we assigned >> more than 10 VCPUs to domU. But we don''t see the performance difference >> with HZ=100. >> >> > -----Original Message----- >> > From: Dante Cinco [mailto:dantecinco@gmail.com] >> > Sent: Thursday, November 18, 2010 12:36 PM >> > To: Konrad Rzeszutek Wilk >> > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; >> > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason >> > Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 >> > pvops domU kernel with PCI passthrough >> > >> > I mentioned earlier in an previous post to this thread that I''m able >> > to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the >> > domU kernel. So I can''t do active-domain profiling but I''m able to do >> > passive-domain profiling but I don''t know how reliable the results >> are >> > since it shows pvclock_clocksource_read as the top consumer of CPU >> > cycles at 28%. >> > >> > CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) >> > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a >> > unit mask of 0x00 (No unit mask) count 100000 >> > samples % image name app name >> > symbol name >> > 918089 27.9310 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel pvclock_clocksource_read >> > 217811 6.6265 domain1-modules domain1-modules >> > /domain1-modules >> > 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco- >> debug >> > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> > mutex_spin_on_owner >> > 186684 5.6795 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel __xen_spin_lock >> > 149514 4.5487 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel __write_lock_failed >> > 123278 3.7505 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel __kernel_text_address >> > 122906 3.7392 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel xen_spin_unlock >> > 90903 2.7655 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel __spin_time_accum >> > 85880 2.6127 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel __module_address >> > 75223 2.2885 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel print_context_stack >> > 66778 2.0316 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel __module_text_address >> > 57389 1.7459 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel is_module_text_address >> > 47282 1.4385 xen-syms-4.1-unstable domain1-xen >> > syscall_enter >> > 47219 1.4365 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel prio_tree_insert >> > 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco- >> debug >> > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> > pvclock_clocksource_read >> > 44501 1.3539 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel prio_tree_left >> > 32482 0.9882 >> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> > domain1-kernel native_read_tsc >> > >> > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while >> the >> > I/Os were running. Here''s the command I used: >> > >> > opcontrol --start --xen=/boot/xen-syms-4.1-unstable >> > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> > --passive-domains=1 >> > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu- >> > 5.11.dcinco-debug >> > >> > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in >> the >> > Xen command line. Otherwise, oprofile only gives the samples from >> > CPU0. >> > >> > I''m going to try perf next. >> > >> > - Dante >> > >> > _______________________________________________ >> > Xen-devel mailing list >> > Xen-devel@lists.xensource.com >> > http://lists.xensource.com/xen-devel >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-19 17:10 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On 11/18/2010 11:35 AM, Dante Cinco wrote:> I mentioned earlier in an previous post to this thread that I''m able > to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the > domU kernel. So I can''t do active-domain profiling but I''m able to do > passive-domain profiling but I don''t know how reliable the results are > since it shows pvclock_clocksource_read as the top consumer of CPU > cycles at 28%.Is rdtsc emulation on? (I forget what the incantation is for that now.) J> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a > unit mask of 0x00 (No unit mask) count 100000 > samples % image name app name symbol name > 918089 27.9310 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel pvclock_clocksource_read > 217811 6.6265 domain1-modules domain1-modules > /domain1-modules > 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > mutex_spin_on_owner > 186684 5.6795 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __xen_spin_lock > 149514 4.5487 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __write_lock_failed > 123278 3.7505 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __kernel_text_address > 122906 3.7392 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel xen_spin_unlock > 90903 2.7655 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __spin_time_accum > 85880 2.6127 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_address > 75223 2.2885 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel print_context_stack > 66778 2.0316 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_text_address > 57389 1.7459 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel is_module_text_address > 47282 1.4385 xen-syms-4.1-unstable domain1-xen > syscall_enter > 47219 1.4365 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_insert > 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > pvclock_clocksource_read > 44501 1.3539 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_left > 32482 0.9882 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel native_read_tsc > > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the > I/Os were running. Here''s the command I used: > > opcontrol --start --xen=/boot/xen-syms-4.1-unstable > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > --passive-domains=1 > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the > Xen command line. Otherwise, oprofile only gives the samples from > CPU0. > > I''m going to try perf next. > > - Dante > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-19 17:52 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Fri, Nov 19, 2010 at 9:10 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> On 11/18/2010 11:35 AM, Dante Cinco wrote: >> I mentioned earlier in an previous post to this thread that I''m able >> to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the >> domU kernel. So I can''t do active-domain profiling but I''m able to do >> passive-domain profiling but I don''t know how reliable the results are >> since it shows pvclock_clocksource_read as the top consumer of CPU >> cycles at 28%. > > Is rdtsc emulation on? (I forget what the incantation is for that now.)How do I check if rdtsc emulation is on? Does ''xm debug-keys s'' do it? (XEN) *** Serial input -> Xen (type ''CTRL-a'' three times to switch input to DOM0) (XEN) TSC marked as reliable, warp = 0 (count=2) (XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1 (XEN) No domains have emulated TSC I''m using xen-unstable-4.1 (22388:87f248de5230). - Dante> > J > >> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a >> unit mask of 0x00 (No unit mask) count 100000 >> samples % image name app name symbol name >> 918089 27.9310 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel pvclock_clocksource_read >> 217811 6.6265 domain1-modules domain1-modules >> /domain1-modules >> 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> mutex_spin_on_owner >> 186684 5.6795 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel __xen_spin_lock >> 149514 4.5487 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel __write_lock_failed >> 123278 3.7505 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel __kernel_text_address >> 122906 3.7392 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel xen_spin_unlock >> 90903 2.7655 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel __spin_time_accum >> 85880 2.6127 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel __module_address >> 75223 2.2885 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel print_context_stack >> 66778 2.0316 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel __module_text_address >> 57389 1.7459 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel is_module_text_address >> 47282 1.4385 xen-syms-4.1-unstable domain1-xen >> syscall_enter >> 47219 1.4365 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel prio_tree_insert >> 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> pvclock_clocksource_read >> 44501 1.3539 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel prio_tree_left >> 32482 0.9882 >> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> domain1-kernel native_read_tsc >> >> I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the >> I/Os were running. Here''s the command I used: >> >> opcontrol --start --xen=/boot/xen-syms-4.1-unstable >> --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug >> --passive-domains=1 >> --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug >> >> I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the >> Xen command line. Otherwise, oprofile only gives the samples from >> CPU0. >> >> I''m going to try perf next. >> >> - Dante >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Lin, Ray
2010-Nov-19 17:55 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
-----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Jeremy Fitzhardinge Sent: Friday, November 19, 2010 9:10 AM To: Dante Cinco Cc: Xen-devel; mathieu.desnoyers@polymtl.ca; andrew.thomas@oracle.com; Konrad Rzeszutek Wilk; keir.fraser@eu.citrix.com; chris.mason@oracle.com Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough On 11/18/2010 11:35 AM, Dante Cinco wrote:> I mentioned earlier in an previous post to this thread that I''m able > to apply Dulloor''s xenoprofile patch to the dom0 kernel but not the > domU kernel. So I can''t do active-domain profiling but I''m able to do > passive-domain profiling but I don''t know how reliable the results are > since it shows pvclock_clocksource_read as the top consumer of CPU > cycles at 28%.Is rdtsc emulation on? (I forget what the incantation is for that now.) J We don''t specify it for the domU. The default should be tsc_mode==0. Is the PV domain always Enabling the emulation if tsc_mode==0 ? -Ray> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a > unit mask of 0x00 (No unit mask) count 100000 > samples % image name app name symbol name > 918089 27.9310 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel pvclock_clocksource_read > 217811 6.6265 domain1-modules domain1-modules > /domain1-modules > 188327 5.7295 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > mutex_spin_on_owner > 186684 5.6795 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __xen_spin_lock > 149514 4.5487 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __write_lock_failed > 123278 3.7505 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __kernel_text_address > 122906 3.7392 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel xen_spin_unlock > 90903 2.7655 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __spin_time_accum > 85880 2.6127 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_address > 75223 2.2885 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel print_context_stack > 66778 2.0316 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel __module_text_address > 57389 1.7459 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel is_module_text_address > 47282 1.4385 xen-syms-4.1-unstable domain1-xen > syscall_enter > 47219 1.4365 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_insert > 46495 1.4145 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > pvclock_clocksource_read > 44501 1.3539 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel prio_tree_left > 32482 0.9882 > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug > domain1-kernel native_read_tsc > > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the > I/Os were running. Here''s the command I used: > > opcontrol --start --xen=/boot/xen-syms-4.1-unstable > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug > --passive-domains=1 > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.d > cinco-debug > > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the > Xen command line. Otherwise, oprofile only gives the samples from > CPU0. > > I''m going to try perf next. > > - Dante > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-19 17:58 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On 19/11/2010 17:52, "Dante Cinco" <dantecinco@gmail.com> wrote:> How do I check if rdtsc emulation is on? Does ''xm debug-keys s'' do it? > > (XEN) *** Serial input -> Xen (type ''CTRL-a'' three times to switch > input to DOM0) > (XEN) TSC marked as reliable, warp = 0 (count=2) > (XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1 > (XEN) No domains have emulated TSCTSC emulation is not enabled. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Nov-19 22:36 UTC
RE: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
> From: Keir Fraser [mailto:keir@xen.org] > Sent: Friday, November 19, 2010 10:58 AM > To: Dante Cinco; Jeremy Fitzhardinge > Cc: Xen-devel; mathieu.desnoyers@polymtl.ca; Chris Mason; Andrew > Thomas; Konrad Rzeszutek Wilk > Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 > pvops domU kernel with PCI passthrough > > On 19/11/2010 17:52, "Dante Cinco" <dantecinco@gmail.com> wrote: > > > How do I check if rdtsc emulation is on? Does ''xm debug-keys s'' do > it? > > > > (XEN) *** Serial input -> Xen (type ''CTRL-a'' three times to switch > > input to DOM0) > > (XEN) TSC marked as reliable, warp = 0 (count=2) > > (XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1 > > (XEN) No domains have emulated TSC > > TSC emulation is not enabled.I *think* "No domains have emulated TSC" will be printed if there are no domains other than dom0 currently running, so this may not be definitive. Also note that tsc_mode=0 means "do the right thing for this hardware platform" but, if the domain is saved/restored or live-migrated, TSC will start being emulated. See tscmode.txt in xen/Documentation for more detail. Lastly, I haven''t tested this code in quite some time, the code for PV and HVM is different, and I''ve never tested it with xl, only with xm. So bitrot is possible, though hopefully unlikely. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dante Cinco
2010-Nov-20 00:13 UTC
Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
On Fri, Nov 19, 2010 at 2:36 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:>> From: Keir Fraser [mailto:keir@xen.org] >> Sent: Friday, November 19, 2010 10:58 AM >> To: Dante Cinco; Jeremy Fitzhardinge >> Cc: Xen-devel; mathieu.desnoyers@polymtl.ca; Chris Mason; Andrew >> Thomas; Konrad Rzeszutek Wilk >> Subject: Re: [Xen-devel] swiotlb=force in Konrad''s xen-pcifront-0.8.2 >> pvops domU kernel with PCI passthrough >> >> On 19/11/2010 17:52, "Dante Cinco" <dantecinco@gmail.com> wrote: >> >> > How do I check if rdtsc emulation is on? Does ''xm debug-keys s'' do >> it? >> > >> > (XEN) *** Serial input -> Xen (type ''CTRL-a'' three times to switch >> > input to DOM0) >> > (XEN) TSC marked as reliable, warp = 0 (count=2) >> > (XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1 >> > (XEN) No domains have emulated TSC >> >> TSC emulation is not enabled. > > I *think* "No domains have emulated TSC" will be printed > if there are no domains other than dom0 currently running, > so this may not be definitive.The pvops domU was running when I captured that Xen console output and I also looked at /var/log/xen/xend.log and saw ''tsc_mode 0'' without explicitly setting tsc_mode in the domain''s cfg file.> > Also note that tsc_mode=0 means "do the right thing for > this hardware platform" but, if the domain is saved/restored > or live-migrated, TSC will start being emulated. See > tscmode.txt in xen/Documentation for more detail.We have not done any save/restore on domU.> > Lastly, I haven''t tested this code in quite some time, > the code for PV and HVM is different, and I''ve never > tested it with xl, only with xm. So bitrot is possible, > though hopefully unlikely. > > Thanks, > Dan >pvclock_clocksource_read is no longer the top symbol (was 28% of the CPU samples) in the latest xenoprofile report. I had mistakenly attributed the huge I/O performance gain (from 119k IOPS to 209k IOPS) to the act of killing ntpd but that was not the case. In fact, the performance gain was due to turning off lock stat. I had enabled lock stat in the kernel to try to track down the lock-associated symbols in the profile report. I had forgotten that I had turned off lock stat (echo 0 >/proc/sys/kernel/lock_stat) just before I killed ntpd. When I disabled lock stat in the kernel, I was able to get 209k IOPS without killing ntpd. The latest xenoprofile report doesn''t even have pvclock_clocksource_read in the top 10. All the I/O processing in domU (domID=1) is done in our kernel driver modules so domain1-modules is expected to be at the top of the list. CPU: Intel Architectural Perfmon, speed 2665.97 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 542839 17.2427 domain1-modules domain1-modules /domain1-modules 378968 12.0375 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel xen_spin_unlock 250342 7.9518 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug mutex_spin_on_owner 206585 6.5620 xen-syms-4.1-unstable domain1-xen syscall_enter 123021 3.9076 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel lock_release 103703 3.2940 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel __lock_acquire 100973 3.2073 domain1-xen-unknown domain1-xen-unknown /domain1-xen-unknown 94449 3.0001 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel hypercall_page 67145 2.1328 xen-syms-4.1-unstable domain1-xen restore_all_guest 64460 2.0475 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel xen_spin_trylock 62415 1.9825 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel xen_restore_fl_direct 51822 1.6461 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel native_read_tsc 45901 1.4580 vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug pvclock_clocksource_read 44398 1.4103 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel debug_locks_off 42191 1.3402 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel find_next_bit 41913 1.3313 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel do_raw_spin_lock 41424 1.3158 vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug domain1-kernel lock_acquire 39275 1.2475 xen-syms-4.1-unstable domain1-xen do_xen_version _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel