Andrew Theurer
2005-May-31 22:01 UTC
[Xen-devel] More network tests with xenoprofile this time
I had a chance to run a couple of the netperf tests with xenoprofile. I am still having some trouble with multi-domain profiles (probably user error), but I have been able to profile dom0 while running 2 types of tests. I was surprised to see as much as 50% cpu in hypervisor on these tests: netperf tcp_stream 16k msg size, dom1 -> dom2 dom0 is on cpu0, HT thread 0, dom1 is on cpu1, HT thread 0, dom2 is on cpu1, HT thread 1. Throughput is ~900 Mbps xenoprofile opreport: 4914314 61.2189 vmlinux-2.6.11-xen0-up 3022609 37.6534 xen-unstable-syms 79516 0.9906 oprofiled 3602 0.0449 libc-2.3.3.so 2764 0.0344 libpython2.3.so.1.0 xenoprofile opreport -l: 1656571 20.64 vmlinux-2.6.11-xen0-up skb_copy_bits 457043 5.69 vmlinux-2.6.11-xen0-up net_tx_action 361259 4.50 xen-unstable-syms do_mmuext_op 325335 4.05 xen-unstable-syms find_domain_by_id 277331 3.45 xen-unstable-syms __copy_from_user_ll 242850 3.03 xen-unstable-syms do_update_va_mapping 208405 2.60 vmlinux-2.6.11-xen0-up kfree 200640 2.50 xen-unstable-syms do_mmu_update 199645 2.49 xen-unstable-syms get_page_from_l1e 189219 2.36 xen-unstable-syms put_page_from_l1e 185831 2.31 xen-unstable-syms get_page_type 172362 2.15 vmlinux-2.6.11-xen0-up make_rx_response 171329 2.13 vmlinux-2.6.11-xen0-up nf_iterate 165977 2.07 vmlinux-2.6.11-xen0-up net_rx_action 165341 2.06 xen-unstable-syms mod_l1_entry 156055 1.94 vmlinux-2.6.11-xen0-up nf_hook_slow 116876 1.46 xen-unstable-syms alloc_domheap_pages 116650 1.45 xen-unstable-syms evtchn_send 116215 1.45 vmlinux-2.6.11-xen0-up fdb_insert 111314 1.39 vmlinux-2.6.11-xen0-up make_tx_response 108480 1.35 xen-unstable-syms alloc_heap_pages 107896 1.34 xen-unstable-syms hypercall 99013 1.23 vmlinux-2.6.11-xen0-up netif_be_start_xmit 91792 1.14 vmlinux-2.6.11-xen0-up br_handle_frame netperf tcp_stream 16k msg size, dom1 -> external host dom0 is on cpu0, HT thread 0, dom1 is on cpu1, HT thread 1. Throughput is ~940 Mbps, wire speed. xenoprofile opreport: 4244562 49.9375 xen-unstable-syms 4110594 48.3614 vmlinux-2.6.11-xen0-up 132643 1.5606 oprofiled 4212 0.0496 libc-2.3.3.so 2892 0.0340 libpython2.3.so.1.0 xenoprofile opreport -l: 828587 9.75 xen-unstable-syms end_level_ioapic_irq 712035 8.38 xen-unstable-syms mask_and_ack_level_ioapic_irq 370265 4.36 vmlinux-2.6.11-xen0-up net_tx_action 323797 3.81 vmlinux-2.6.11-xen0-up ohci_irq 282005 3.32 vmlinux-2.6.11-xen0-up tg3_interrupt 273161 3.21 xen-unstable-syms find_domain_by_id 234726 2.76 xen-unstable-syms hypercall 206693 2.43 xen-unstable-syms do_update_va_mapping 203758 2.40 xen-unstable-syms __copy_from_user_ll 201665 2.37 xen-unstable-syms do_mmuext_op 195020 2.29 vmlinux-2.6.11-xen0-up nf_iterate 184295 2.17 vmlinux-2.6.11-xen0-up nf_hook_slow 172110 2.02 vmlinux-2.6.11-xen0-up tg3_rx 164337 1.93 vmlinux-2.6.11-xen0-up net_rx_action 141999 1.67 xen-unstable-syms do_mmu_update 139120 1.64 vmlinux-2.6.11-xen0-up fdb_insert 122483 1.44 xen-unstable-syms mod_l1_entry 122017 1.44 xen-unstable-syms put_page_from_l1e 111159 1.31 xen-unstable-syms get_page_from_l1e 109921 1.29 xen-unstable-syms do_IRQ 99847 1.17 vmlinux-2.6.11-xen0-up br_handle_frame 99709 1.17 xen-unstable-syms get_page_type 93613 1.10 vmlinux-2.6.11-xen0-up kfree 90885 1.07 vmlinux-2.6.11-xen0-up end_pirq -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2005-May-31 22:16 UTC
RE: [Xen-devel] More network tests with xenoprofile this time
> I had a chance to run a couple of the netperf tests with > xenoprofile. I am still having some trouble with > multi-domain profiles (probably user error), but I have been > able to profile dom0 while running 2 types of tests. I was > surprised to see as much as 50% cpu in hypervisor on these tests: > > netperf tcp_stream 16k msg size, dom1 -> dom2 dom0 is on > cpu0, HT thread 0, dom1 is on cpu1, HT thread 0, > dom2 is on cpu1, HT thread 1.Let''s ignore the domU <-> domU results for the moment as we know about the problem with lack of batching in this scenario. Let''s dig into the dom1 -> external. First off, are these figures just for CPU 0 HT 0? i.e. just dom0 so we don''t see where time goes in the domU? How is idle time on the CPU reported? Spending 18% of the time handling interrupts in Xen is surprisingy (at least to me). What interrupt rate are you observing? What are the default tg3 interrupt coallescing settings? What interrupt rate do you get on native? Also, what hypercall rate are you seeing? (It would be good to put this in context of the rx/tx packet rates). Is the Ethernet NIC sharing an interrupt with the USB controller per chance? Seeing find_domain_by_id and copy_from_user so high up the list is pretty surprising. Cheers, Ian> netperf tcp_stream 16k msg size, dom1 -> external host dom0 > is on cpu0, HT thread 0, dom1 is on cpu1, HT thread 1. > > > Throughput is ~940 Mbps, wire speed. > > xenoprofile opreport: > > 4244562 49.9375 xen-unstable-syms > 4110594 48.3614 vmlinux-2.6.11-xen0-up > 132643 1.5606 oprofiled > 4212 0.0496 libc-2.3.3.so > 2892 0.0340 libpython2.3.so.1.0 > > xenoprofile opreport -l: > > 828587 9.75 xen-unstable-syms end_level_ioapic_irq > 712035 8.38 xen-unstable-syms mask_and_ack_level_ioapic_irq > 370265 4.36 vmlinux-2.6.11-xen0-up net_tx_action > 323797 3.81 vmlinux-2.6.11-xen0-up ohci_irq > 282005 3.32 vmlinux-2.6.11-xen0-up tg3_interrupt > 273161 3.21 xen-unstable-syms find_domain_by_id > 234726 2.76 xen-unstable-syms hypercall > 206693 2.43 xen-unstable-syms do_update_va_mapping > 203758 2.40 xen-unstable-syms __copy_from_user_ll > 201665 2.37 xen-unstable-syms do_mmuext_op > 195020 2.29 vmlinux-2.6.11-xen0-up nf_iterate > 184295 2.17 vmlinux-2.6.11-xen0-up nf_hook_slow > 172110 2.02 vmlinux-2.6.11-xen0-up tg3_rx > 164337 1.93 vmlinux-2.6.11-xen0-up net_rx_action > 141999 1.67 xen-unstable-syms do_mmu_update > 139120 1.64 vmlinux-2.6.11-xen0-up fdb_insert > 122483 1.44 xen-unstable-syms mod_l1_entry > 122017 1.44 xen-unstable-syms put_page_from_l1e > 111159 1.31 xen-unstable-syms get_page_from_l1e > 109921 1.29 xen-unstable-syms do_IRQ > 99847 1.17 vmlinux-2.6.11-xen0-up br_handle_frame > 99709 1.17 xen-unstable-syms get_page_type > 93613 1.10 vmlinux-2.6.11-xen0-up kfree > 90885 1.07 vmlinux-2.6.11-xen0-up end_pirq > > > -Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Theurer
2005-May-31 22:38 UTC
Re: [Xen-devel] More network tests with xenoprofile this time
On Tuesday 31 May 2005 17:16, Ian Pratt wrote:> > I had a chance to run a couple of the netperf tests with > > xenoprofile. I am still having some trouble with > > multi-domain profiles (probably user error), but I have been > > able to profile dom0 while running 2 types of tests. I was > > surprised to see as much as 50% cpu in hypervisor on these tests: > > > > netperf tcp_stream 16k msg size, dom1 -> dom2 dom0 is on > > cpu0, HT thread 0, dom1 is on cpu1, HT thread 0, > > dom2 is on cpu1, HT thread 1. > > Let''s ignore the domU <-> domU results for the moment as we know > about the problem with lack of batching in this scenario. Let''s dig > into the dom1 -> external. > > First off, are these figures just for CPU 0 HT 0? i.e. just dom0 so > we don''t see where time goes in the domU? How is idle time on the CPU > reported?Yes, this is just for CPU 0 HT 0. DomU is pinned to its own cpu, which is CPU 1 HT 0. I have cpu util from polling xc_domain_get_cpu_usage() for both domains, which is (an exerpt from the whole run, in 3 second intervals): cpu0: [100.4] d0-0[100.4] cpu2: [045.1] d1-0[045.1] cpu0: [100.0] d0-0[100.0] cpu2: [045.1] d1-0[045.1] cpu0: [099.6] d0-0[099.6] cpu2: [045.1] d1-0[045.1] cpu0: [101.3] d0-0[101.3] cpu2: [045.3] d1-0[045.3] cpu0: [099.7] d0-0[099.7] cpu2: [045.1] d1-0[045.1] cpu0: [099.7] d0-0[099.7] cpu2: [045.0] d1-0[045.0] This is fairly consistent for the whole test.> Spending 18% of the time handling interrupts in Xen is surprisingy > (at least to me). > > What interrupt rate are you observing? What are the default tg3 > interrupt coallescing settings? What interrupt rate do you get on > native? Also, what hypercall rate are you seeing? > > (It would be good to put this in context of the rx/tx packet rates).I don''t have that data from this test, but I am queuing up another with sar, so I should have it soon. I will also queue up a test with just baremetal linux so we can compare int rates, etc.> Is the Ethernet NIC sharing an interrupt with the USB controller per > chance?Not as far as I can tell: CPU0 1: 8 Phys-irq i8042 3: 0 Phys-irq acpi 4: 3031 Phys-irq serial 11: 6764395 Phys-irq ohci_hcd 12: 93 Phys-irq i8042 15: 38687 Phys-irq ide1 18: 39398 Phys-irq qla2300 22: 47905 Phys-irq ioc0 24: 6037311 Phys-irq eth0 256: 7 Dynamic-irq ctrl-if 257: 182396 Dynamic-irq timer0 258: 0 Dynamic-irq net-be-dbg 259: 83437 Dynamic-irq blkif-backend 260: 1688517 Dynamic-irq vif1.0> > Seeing find_domain_by_id and copy_from_user so high up the list is > pretty surprising.Yes. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2005-May-31 22:48 UTC
RE: [Xen-devel] More network tests with xenoprofile this time
> I have cpu util from polling xc_domain_get_cpu_usage() for > both domains, which is (an exerpt from the whole run, in 3 > second intervals): > > cpu0: [100.4] d0-0[100.4] > cpu2: [045.1] d1-0[045.1]OK, so you''re confident idle time would be reported OK if there was any.> > Is the Ethernet NIC sharing an interrupt with the USB > controller per > > chance? > > Not as far as I can tell: > > CPU0 > 11: 6764395 Phys-irq ohci_hcd > 24: 6037311 Phys-irq eth0 > 260: 1688517 Dynamic-irq vif1.0Anyone care to suggest hy ohci_hcd is taking so many interrupts? Looks very fishy to me. I take it you''re not using a USB Ethernet NIC? :-) What happens if you boot ''nousb'' ?> > Seeing find_domain_by_id and copy_from_user so high up the list is > > pretty surprising. > > Yes.Definitely worth looking in to... Sorry for all the questions. This work is much appreciated. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2005-Jun-01 00:15 UTC
RE: [Xen-devel] More network tests with xenoprofile this time
Andrew, You may want to take a look at the folowing paper which is being presented at VEE''05 (June 11 and 12, 2005). http://www.hpl.hp.com/research/dca/system/papers/xenoprof-vee05.pdf It presents network performance results using xenoprof. This was done for xen 2.0.3. The profile you reported has some similarities with our results although the exact numbers are different. But that is expected, since you are running a different version of Xen on a different hardware. We have seen that a significant amount of time was spent on handling interrupts in Xen, as well. We have also seen that a significant amount of time is spent on the hypervisor (+/- 40%) for the dom1 <-> external case, measured both at dom1 and at dom0. (in our case we instrumented the receive side) When we run the benchmark on dom0 the time spent on Xen is reduced to (+/-20%). Most of this extra Xen overhead when running a guest seems to come from the page transfer between domain 0 and the guest (see table 6 and discussion on paper). The paper omits the complete oprofile reports for brevity. I will be happy to send you any detailed oprofile report we have generated for the paper, if you want to compare it with your results. Just let me know ... Renato>> -----Original Message----- >> From: xen-devel-bounces@lists.xensource.com >> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Ian Pratt >> Sent: Tuesday, May 31, 2005 3:16 PM >> To: Andrew Theurer; xen-devel@lists.xensource.com >> Subject: RE: [Xen-devel] More network tests with xenoprofile >> this time >> >> >> > I had a chance to run a couple of the netperf tests with >> > xenoprofile. I am still having some trouble with >> > multi-domain profiles (probably user error), but I have been >> > able to profile dom0 while running 2 types of tests. I was >> > surprised to see as much as 50% cpu in hypervisor on these tests: >> > >> > netperf tcp_stream 16k msg size, dom1 -> dom2 dom0 is on >> > cpu0, HT thread 0, dom1 is on cpu1, HT thread 0, >> > dom2 is on cpu1, HT thread 1. >> >> Let''s ignore the domU <-> domU results for the moment as we >> know about the problem with lack of batching in this >> scenario. Let''s dig into the dom1 -> external. >> >> First off, are these figures just for CPU 0 HT 0? i.e. just >> dom0 so we don''t see where time goes in the domU? How is >> idle time on the CPU reported? >> >> Spending 18% of the time handling interrupts in Xen is >> surprisingy (at least to me). >> >> What interrupt rate are you observing? What are the default >> tg3 interrupt coallescing settings? What interrupt rate do >> you get on native? Also, what hypercall rate are you seeing? >> >> (It would be good to put this in context of the rx/tx packet rates). >> >> Is the Ethernet NIC sharing an interrupt with the USB >> controller per chance? >> >> Seeing find_domain_by_id and copy_from_user so high up the >> list is pretty surprising. >> >> Cheers, >> Ian >> >> > netperf tcp_stream 16k msg size, dom1 -> external host dom0 >> > is on cpu0, HT thread 0, dom1 is on cpu1, HT thread 1. >> > >> > >> > Throughput is ~940 Mbps, wire speed. >> > >> > xenoprofile opreport: >> > >> > 4244562 49.9375 xen-unstable-syms >> > 4110594 48.3614 vmlinux-2.6.11-xen0-up >> > 132643 1.5606 oprofiled >> > 4212 0.0496 libc-2.3.3.so >> > 2892 0.0340 libpython2.3.so.1.0 >> > >> > xenoprofile opreport -l: >> > >> > 828587 9.75 xen-unstable-syms end_level_ioapic_irq >> > 712035 8.38 xen-unstable-syms >> mask_and_ack_level_ioapic_irq >> > 370265 4.36 vmlinux-2.6.11-xen0-up net_tx_action >> > 323797 3.81 vmlinux-2.6.11-xen0-up ohci_irq >> > 282005 3.32 vmlinux-2.6.11-xen0-up tg3_interrupt >> > 273161 3.21 xen-unstable-syms find_domain_by_id >> > 234726 2.76 xen-unstable-syms hypercall >> > 206693 2.43 xen-unstable-syms do_update_va_mapping >> > 203758 2.40 xen-unstable-syms __copy_from_user_ll >> > 201665 2.37 xen-unstable-syms do_mmuext_op >> > 195020 2.29 vmlinux-2.6.11-xen0-up nf_iterate >> > 184295 2.17 vmlinux-2.6.11-xen0-up nf_hook_slow >> > 172110 2.02 vmlinux-2.6.11-xen0-up tg3_rx >> > 164337 1.93 vmlinux-2.6.11-xen0-up net_rx_action >> > 141999 1.67 xen-unstable-syms do_mmu_update >> > 139120 1.64 vmlinux-2.6.11-xen0-up fdb_insert >> > 122483 1.44 xen-unstable-syms mod_l1_entry >> > 122017 1.44 xen-unstable-syms put_page_from_l1e >> > 111159 1.31 xen-unstable-syms get_page_from_l1e >> > 109921 1.29 xen-unstable-syms do_IRQ >> > 99847 1.17 vmlinux-2.6.11-xen0-up br_handle_frame >> > 99709 1.17 xen-unstable-syms get_page_type >> > 93613 1.10 vmlinux-2.6.11-xen0-up kfree >> > 90885 1.07 vmlinux-2.6.11-xen0-up end_pirq >> > >> > >> > -Andrew >> > >> > _______________________________________________ >> > Xen-devel mailing list >> > Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel >> > >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jon Mason
2005-Jun-01 20:03 UTC
Re: [Xen-devel] More network tests with xenoprofile this time
On Tuesday 31 May 2005 05:48 pm, Ian Pratt wrote:> > I have cpu util from polling xc_domain_get_cpu_usage() for > > both domains, which is (an exerpt from the whole run, in 3 > > second intervals): > > > > cpu0: [100.4] d0-0[100.4] > > cpu2: [045.1] d1-0[045.1] > > OK, so you''re confident idle time would be reported OK if there was any. > > > > Is the Ethernet NIC sharing an interrupt with the USB > > > > controller per > > > > > chance? > > > > Not as far as I can tell: > > > > CPU0 > > 11: 6764395 Phys-irq ohci_hcd > > 24: 6037311 Phys-irq eth0 > > 260: 1688517 Dynamic-irq vif1.0 > > Anyone care to suggest hy ohci_hcd is taking so many interrupts? Looks > very fishy to me. I take it you''re not using a USB Ethernet NIC? :-)The bladecenters have a shared USB connected to all the blades. I would imagine it is the keyboard/mouse or USB CDROM connected to this bus that is generating all of these interrupts.> What happens if you boot ''nousb'' ?This shouldn''t hurt anything, unless Andrew needs access to kdb or cdrom. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Theurer
2005-Jun-01 20:21 UTC
Re: [Xen-devel] More network tests with xenoprofile this time
On Wednesday 01 June 2005 15:03, Jon Mason wrote:> On Tuesday 31 May 2005 05:48 pm, Ian Pratt wrote: > > > I have cpu util from polling xc_domain_get_cpu_usage() for > > > both domains, which is (an exerpt from the whole run, in 3 > > > second intervals): > > > > > > cpu0: [100.4] d0-0[100.4] > > > cpu2: [045.1] d1-0[045.1] > > > > OK, so you''re confident idle time would be reported OK if there was > > any. > > > > > > Is the Ethernet NIC sharing an interrupt with the USB > > > > > > controller per > > > > > > > chance? > > > > > > Not as far as I can tell: > > > > > > CPU0 > > > 11: 6764395 Phys-irq ohci_hcd > > > 24: 6037311 Phys-irq eth0 > > > 260: 1688517 Dynamic-irq vif1.0 > > > > Anyone care to suggest hy ohci_hcd is taking so many interrupts? > > Looks very fishy to me. I take it you''re not using a USB Ethernet > > NIC? :-) > > The bladecenters have a shared USB connected to all the blades. I > would imagine it is the keyboard/mouse or USB CDROM connected to this > bus that is generating all of these interrupts. > > > What happens if you boot ''nousb'' ? > > This shouldn''t hurt anything, unless Andrew needs access to kdb or > cdrom.This is on a x336 system, P4 Xeon, not much USB really needed. I did not see any difference in performace or the profile with nousb. I also tried disbaling the locks in find_domain_by_id and saw no difference. I''m curious to see how things differ with dom0 on CPU-0 HT-0 and dom1 on CPU-0 HT-1. I will probably try that next. FWIW, baremetal linux used about 33% of one cpu to drive the same throughput. int''s/sec was 41k/sec for baremetal vs 59k/sec for dom0. I don''t have the breakdown of int/sec per interrupt number yet. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Theurer
2005-Jun-02 14:53 UTC
Re: [Xen-devel] More network tests with xenoprofile this time
On Wednesday 01 June 2005 15:21, Andrew Theurer wrote:> On Wednesday 01 June 2005 15:03, Jon Mason wrote: > > On Tuesday 31 May 2005 05:48 pm, Ian Pratt wrote: > > > > I have cpu util from polling xc_domain_get_cpu_usage() for > > > > both domains, which is (an exerpt from the whole run, in 3 > > > > second intervals): > > > > > > > > cpu0: [100.4] d0-0[100.4] > > > > cpu2: [045.1] d1-0[045.1] > > > > > > OK, so you''re confident idle time would be reported OK if there > > > was any. > > > > > > > > Is the Ethernet NIC sharing an interrupt with the USB > > > > > > > > controller per > > > > > > > > > chance? > > > > > > > > Not as far as I can tell: > > > > > > > > CPU0 > > > > 11: 6764395 Phys-irq ohci_hcd > > > > 24: 6037311 Phys-irq eth0 > > > > 260: 1688517 Dynamic-irq vif1.0 > > > > > > Anyone care to suggest hy ohci_hcd is taking so many interrupts? > > > Looks very fishy to me. I take it you''re not using a USB Ethernet > > > NIC? :-) > > > > The bladecenters have a shared USB connected to all the blades. I > > would imagine it is the keyboard/mouse or USB CDROM connected to > > this bus that is generating all of these interrupts. > > > > > What happens if you boot ''nousb'' ? > > > > This shouldn''t hurt anything, unless Andrew needs access to kdb or > > cdrom. > > This is on a x336 system, P4 Xeon, not much USB really needed. I did > not see any difference in performace or the profile with nousb. > > I also tried disbaling the locks in find_domain_by_id and saw no > difference. I''m curious to see how things differ with dom0 on CPU-0 > HT-0 and dom1 on CPU-0 HT-1. I will probably try that next. > > FWIW, baremetal linux used about 33% of one cpu to drive the same > throughput. int''s/sec was 41k/sec for baremetal vs 59k/sec for dom0. > I don''t have the breakdown of int/sec per interrupt number yet.Wanted to follow up, one correction, I did not have usb disabled properly, and with properly removing usb, there is a slight reduction in irq handling overhead as a result: 542129 6.2205 xen-unstable-syms mask_and_ack_level_ioapic_irq 506060 5.8067 xen-unstable-syms end_level_ioapic_irq 475786 5.4593 vmlinux-2.6.11-xen0-up net_tx_action 376309 4.3179 vmlinux-2.6.11-xen0-up tg3_interrupt 263008 3.0178 xen-unstable-syms find_domain_by_id 239789 2.7514 xen-unstable-syms hypercall 224547 2.5765 vmlinux-2.6.11-xen0-up nf_iterate ...vs about 8-9% each for the top two functions before. The interrupt rate for the tg3 adapter is very high still, about 24k/sec. At that rate it does not appear to have any interrupt coalescing going on, so I am going to look into that. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
William Cohen
2005-Jun-07 21:47 UTC
Re: [Xen-devel] More network tests with xenoprofile this time
Santos, Jose Renato G wrote:> Andrew, > > You may want to take a look at the folowing paper > which is being presented at VEE''05 (June 11 and 12, 2005). > > http://www.hpl.hp.com/research/dca/system/papers/xenoprof-vee05.pdf > > It presents network performance results using xenoprof. > This was done for xen 2.0.3. The profile you reported > has some similarities with our results although the > exact numbers are different. But that is expected, since > you are running a different version of Xen on a different > hardware. > We have seen that a significant amount of time was spent > on handling interrupts in Xen, as well. > We have also seen that a significant amount of time is > spent on the hypervisor (+/- 40%) for the dom1 <-> external > case, measured both at dom1 and at dom0. > (in our case we instrumented the receive side) > When we run the benchmark on dom0 the time spent on Xen > is reduced to (+/-20%). > Most of this extra Xen overhead when running a guest > seems to come from the page transfer between > domain 0 and the guest (see table 6 and discussion > on paper). > > The paper omits the complete oprofile reports > for brevity. I will be happy to send you any > detailed oprofile report we have generated for the > paper, if you want to compare it with your results. > Just let me know ... > > RenatoHi Renato, The article was an interesting application of the xenoprof. It seem like it would be useful to also have data collected using the cycle counts (GLOBAL_POWER_EVENTS on P4) to give some indication of areas with high overhead operations. There may be some areas with few very expensive instructions. Calling attention to those areas would help improve performance. The increases in I-TLB and D-TLB events for Xen-domain0 shown in Figure 4 are surprising. Why would the working sets be that much larger for Xen-domain0 than regular linux, particularly for code? Is there an table similar to table 3 for I-TLB event sample locations? Can''t the VMM use a 4-MB page and the Xen-domain0 kernel shouldn''t be that much larger than regular linux kernel? How were TLB flushes ruled out as a cause? Could the PERFCOUNTER_CPU counters in perfc_defn.h be used to see if the VMM is doing a lot of TLB flushes? Also how much of I-TLB and D-TLB events are due to the P4 architecture? Are the results so dramatic for a Athlon or AMD64 processors? -Will _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Theurer
2005-Jun-08 19:20 UTC
Re: [Xen-devel] More network tests with xenoprofile this time
> Hi Renato, > > The article was an interesting application of the xenoprof. > > It seem like it would be useful to also have data collected using the > cycle counts (GLOBAL_POWER_EVENTS on P4) to give some indication of > areas with high overhead operations. There may be some areas with few > very expensive instructions. Calling attention to those areas would > help improve performance. > > The increases in I-TLB and D-TLB events for Xen-domain0 shown in > Figure 4 are surprising. Why would the working sets be that much > larger for Xen-domain0 than regular linux, particularly for code? Is > there an table similar to table 3 for I-TLB event sample locations? > > Can''t the VMM use a 4-MB page and the Xen-domain0 kernel shouldn''t be > that much larger than regular linux kernel? How were TLB flushes > ruled out as a cause? Could the PERFCOUNTER_CPU counters in > perfc_defn.h be used to see if the VMM is doing a lot of TLB flushes?I had the same concern as you, and IMO, it seemed unlikely that the working set for dom0 would be so much larger to cause significant amount of TLB miss. I also suspect TLB flushes to be the problem, but I have not had a chance to look at it. I hope to very soon. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2005-Jun-17 19:39 UTC
RE: [Xen-devel] More network tests with xenoprofile this time
William and Andrew Sorry for the delay in replying. I have been traveling and did not have email access while away.> > Hi Renato, > > The article was an interesting application of the xenoprof. > > It seem like it would be useful to also have data collected using the > cycle counts (GLOBAL_POWER_EVENTS on P4) to give some indication of > areas with high overhead operations. There may be some areas with few > very expensive instructions. Calling attention to those areas > would help > improve performance.Yes, you are right. We have in fact collected GLOBAL_POWER_EVENTS, but did not include in the paper due to space limitations. I have attached oprofile results for our ttcp like benchmark(receive side) for the case with 1 NIC (both cycle counts and instructions). As you can see there are some functions with very expensive instructions. For example "hypercall" add anly 0.6% additional instructions but these consume 3.0% more clock cycles; "unmask_IO_APIC_irq" add 0.25% instructions but consume 5% more cycles. It would be interesting to investigate these and see if we can optimize them.> > The increases in I-TLB and D-TLB events for Xen-domain0 shown > in Figure > 4 are surprising. Why would the working sets be that much larger for > Xen-domain0 than regular linux, particularly for code? Is > there an table > similar to table 3 for I-TLB event sample locations? >Yes, we were also surprised by these results. I have attached the complete I-TLB and D_TLB oprofile results (for the 3 NICs case) (note these are on a different type of machine than the other 2 attached oprofile results) Aravind instrumented the macros in xen/include/asm-x86/flushtlb.h. I am not sure if he used PERFCOUNTER_CPU or if he included his own instrumentation. With this instrumentation we did not observe any TLB flush, but I suppose we could have missed TLB flushes that did not use the macro... I think it would be a good idea to investigate this further to confirm that TLB flushes are not happening. One additional observation is that in general the number of misses in NOT proportional to the size of the working set. It is possible that a small increase in the working set significantly increase the number of misses. Therefore it is possible that the increase in TLB misses is in fact due to a larger working set. But, I agree we have to investigate this further to get confirmation ...> Can''t the VMM use a 4-MB page and the Xen-domain0 kernel shouldn''t be > that much larger than regular linux kernel? > How were TLB flushes ruled > out as a cause? Could the PERFCOUNTER_CPU counters in perfc_defn.h be > used to see if the VMM is doing a lot of TLB flushes? > > Also how much of I-TLB and D-TLB events are due to the P4 > architecture? > Are the results so dramatic for a Athlon or AMD64 processors? >We did not try this on any other architecture. Right now xenoprof is only supported on P4. Support for other architectures is not on top of our priority list. Regards Renato> -Will > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Theurer
2005-Jun-22 02:22 UTC
Re: [Xen-devel] More network tests with xenoprofile this time
FWIW, I took a look at find_domain_by_id() and swapped out the rw lock with a spin lock for domlist_lock. The time spent in the this function was reduced by 18%. This function was certainly not the "hottest" one I recorded while running netperf, but every little bit helps I suppose. Below are before/after xenoprofile snippets: before: 547036 6.32 xen-unstable-syms mask_and_ack_level_ioapic_irq 510448 5.90 xen-unstable-syms end_level_ioapic_irq 463386 5.35 vmlinux-2.6.11-xen0-up net_tx_action 371072 4.29 tg3.ko tg3_interrupt 261341 3.02 xen-unstable-syms find_domain_by_id 237601 2.75 xen-unstable-syms hypercall 228649 2.64 vmlinux-2.6.11-xen0-up nf_iterate 215634 2.49 xen-unstable-syms do_update_va_mapping 214077 2.47 vmlinux-2.6.11-xen0-up net_rx_action after: 549276 6.35 xen-unstable-syms mask_and_ack_level_ioapic_irq 511693 5.91 xen-unstable-syms end_level_ioapic_irq 466873 5.39 vmlinux-2.6.11-xen0-up net_tx_action 375702 4.34 tg3.ko tg3_interrupt 239219 2.76 xen-unstable-syms hypercall 230641 2.67 vmlinux-2.6.11-xen0-up nf_iterate 220480 2.55 xen-unstable-syms do_update_va_mapping 217472 2.51 tg3.ko tg3_rx 217029 2.51 vmlinux-2.6.11-xen0-up net_rx_action 214271 2.48 xen-unstable-syms find_domain_by_id Has anyone thought about using read-copy-update in Xen? I plan on looking at the two irq functions next. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel