Wei Liu
2013-Jun-12 10:14 UTC
Interesting observation with network event notification and batching
Hi all I''m hacking on a netback trying to identify whether TLB flushes causes heavy performance penalty on Tx path. The hack is quite nasty (you would not want to know, trust me). Basically what is doesn''t is, 1) alter network protocol to pass along mfns instead of grant references, 2) when the backend sees a new mfn, map it RO and cache it in its own address space. With this hack, now we have some sort of zero-copy TX path. Backend doesn''t need to issue any grant copy / map operation any more. When it sees a new packet in the ring, it just needs to pick up the pages in its own address space and assemble packets with those pages then pass the packet on to network stack. In theory this should boost performance, but in practice it is the other way around. This hack makes Xen network more than 50% slower than before (OMG). Further investigation shows that with this hack the batching ability is gone. Before this hack, netback batches like 64 slots in one interrupt event, however after this hack, it only batches 3 slots in one interrupt event -- that''s no batching at all because we can expect one packet to occupy 3 slots. Time to have some figures (iperf from DomU to Dom0). Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots per batch 64. After the hack, throughput: 2.5 Gb/s, average slots per batch 3. After the hack, adds in 64 HYPERVISOR_xen_version (it just does context switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots per batch 6. After the hack, adds in 256 HYPERVISOR_xen_version (it just does context switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots per batch 26. After the hack, adds in 512 HYPERVISOR_xen_version (it just does context switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots per batch 26. After the hack, adds in 768 HYPERVISOR_xen_version (it just does context switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots per batch 25. After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots per batch 25. Average slots per batch is calculate as followed: 1. count total_slots processed from start of day 2. count tx_count which is the number of tx_action function gets invoked 3. avg_slots_per_tx = total_slots / tx_count The counter-intuition figures imply that there is something wrong with the currently batching mechanism. Probably we need to fine-tune the batching behavior for network and play with event pointers in the ring (actually I''m looking into it now). It would be good to have some input on this. Konrad, IIRC you once mentioned you discovered something with event notification, what''s that? To all, any thoughts? Wei.
Konrad Rzeszutek Wilk
2013-Jun-14 18:53 UTC
Re: Interesting observation with network event notification and batching
On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote:> Hi all > > I''m hacking on a netback trying to identify whether TLB flushes causes > heavy performance penalty on Tx path. The hack is quite nasty (you would > not want to know, trust me). > > Basically what is doesn''t is, 1) alter network protocol to pass alongYou probably meant: "what it does" ?> mfns instead of grant references, 2) when the backend sees a new mfn, > map it RO and cache it in its own address space. > > With this hack, now we have some sort of zero-copy TX path. Backend > doesn''t need to issue any grant copy / map operation any more. When it > sees a new packet in the ring, it just needs to pick up the pages > in its own address space and assemble packets with those pages then pass > the packet on to network stack.Uh, so not sure I understand the RO part. If dom0 is mapping it won''t that trigger a PTE update? And doesn''t somebody (either the guest or initial domain) do a grant mapping to let the hypervisor know it is OK to map a grant? Or is dom0 actually permitted to map the MFN of any guest without using the grants? In which case you are then using the _PAGE_IOMAP somewhere and setting up vmap entries with the MFN''s that point to the foreign domain - I think?> > In theory this should boost performance, but in practice it is the other > way around. This hack makes Xen network more than 50% slower than before > (OMG). Further investigation shows that with this hack the batching > ability is gone. Before this hack, netback batches like 64 slots in oneThat is quite interesting.> interrupt event, however after this hack, it only batches 3 slots in one > interrupt event -- that''s no batching at all because we can expect one > packet to occupy 3 slots.Right.> > Time to have some figures (iperf from DomU to Dom0). > > Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots > per batch 64. > > After the hack, throughput: 2.5 Gb/s, average slots per batch 3. > > After the hack, adds in 64 HYPERVISOR_xen_version (it just does context > switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots > per batch 6. > > After the hack, adds in 256 HYPERVISOR_xen_version (it just does context > switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots > per batch 26. > > After the hack, adds in 512 HYPERVISOR_xen_version (it just does context > switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots > per batch 26. > > After the hack, adds in 768 HYPERVISOR_xen_version (it just does context > switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots > per batch 25. > > After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context > switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots > per batch 25. >How do you get it to do more HYPERVISR_xen_version? Did you just add a (for i = 1024; i>0;i--) hypervisor_yield(); in netback?> Average slots per batch is calculate as followed: > 1. count total_slots processed from start of day > 2. count tx_count which is the number of tx_action function gets > invoked > 3. avg_slots_per_tx = total_slots / tx_count > > The counter-intuition figures imply that there is something wrong with > the currently batching mechanism. Probably we need to fine-tune the > batching behavior for network and play with event pointers in the ring > (actually I''m looking into it now). It would be good to have some input > on this.I am still unsure I understand hwo your changes would incur more of the yields.> > Konrad, IIRC you once mentioned you discovered something with event > notification, what''s that?They were bizzare. I naively expected some form of # of physical NIC interrupts to be around the same as the VIF or less. And I figured that the amount of interrupts would be constant irregardless of the size of the packets. In other words #packets == #interrupts. In reality the number of interrupts the VIF had was about the same while for the NIC it would fluctuate. (I can''t remember the details). But it was odd and I didn''t go deeper in it to figure out what was happening. And also to figure out if for the VIF we could do something of #packets != #interrupts. And hopefully some mechanism to adjust so that the amount of interrupts would be lesser per packets (hand waving here).
Wei Liu
2013-Jun-16 09:54 UTC
Re: Interesting observation with network event notification and batching
On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk wrote:> On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote: > > Hi all > > > > I''m hacking on a netback trying to identify whether TLB flushes causes > > heavy performance penalty on Tx path. The hack is quite nasty (you would > > not want to know, trust me). > > > > Basically what is doesn''t is, 1) alter network protocol to pass along > > You probably meant: "what it does" ? >Oh yes! Muscle memory got me!> > mfns instead of grant references, 2) when the backend sees a new mfn, > > map it RO and cache it in its own address space. > > > > With this hack, now we have some sort of zero-copy TX path. Backend > > doesn''t need to issue any grant copy / map operation any more. When it > > sees a new packet in the ring, it just needs to pick up the pages > > in its own address space and assemble packets with those pages then pass > > the packet on to network stack. > > Uh, so not sure I understand the RO part. If dom0 is mapping it won''t > that trigger a PTE update? And doesn''t somebody (either the guest or > initial domain) do a grant mapping to let the hypervisor know it is > OK to map a grant? >It is very easy to issue HYPERVISOR_mmu_udpate to alter Dom0''s mapping, because Dom0 is priveleged.> Or is dom0 actually permitted to map the MFN of any guest without using > the grants? In which case you are then using the _PAGE_IOMAP > somewhere and setting up vmap entries with the MFN''s that point to the > foreign domain - I think? >Sort of, but I didn''t use vmap, I used alloc_page to get actual pages. Then I modified the underlying PTE to point to the MFN from netfront.> > > > In theory this should boost performance, but in practice it is the other > > way around. This hack makes Xen network more than 50% slower than before > > (OMG). Further investigation shows that with this hack the batching > > ability is gone. Before this hack, netback batches like 64 slots in one > > That is quite interesting. > > > interrupt event, however after this hack, it only batches 3 slots in one > > interrupt event -- that''s no batching at all because we can expect one > > packet to occupy 3 slots. > > Right. > > > > Time to have some figures (iperf from DomU to Dom0). > > > > Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots > > per batch 64. > > > > After the hack, throughput: 2.5 Gb/s, average slots per batch 3. > > > > After the hack, adds in 64 HYPERVISOR_xen_version (it just does context > > switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots > > per batch 6. > > > > After the hack, adds in 256 HYPERVISOR_xen_version (it just does context > > switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots > > per batch 26. > > > > After the hack, adds in 512 HYPERVISOR_xen_version (it just does context > > switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots > > per batch 26. > > > > After the hack, adds in 768 HYPERVISOR_xen_version (it just does context > > switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots > > per batch 25. > > > > After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context > > switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots > > per batch 25. > > > > How do you get it to do more HYPERVISR_xen_version? Did you just add > a (for i = 1024; i>0;i--) hypervisor_yield();for (i = 0; i < X; i++) (void)HYPERVISOR_xen_version(0, NULL);> > in netback? > > Average slots per batch is calculate as followed: > > 1. count total_slots processed from start of day > > 2. count tx_count which is the number of tx_action function gets > > invoked > > 3. avg_slots_per_tx = total_slots / tx_count > > > > The counter-intuition figures imply that there is something wrong with > > the currently batching mechanism. Probably we need to fine-tune the > > batching behavior for network and play with event pointers in the ring > > (actually I''m looking into it now). It would be good to have some input > > on this. > > I am still unsure I understand hwo your changes would incur more > of the yields.It''s not yielding. At least that''s not the purpose of that hypercall. HYPERVISOR_xen_version(0, NULL) only does guest -> hypervisor -> guest context switching. The original purpose of HYPERVISOR_xen_version(0, NULL) is to force guest to check pending events. Since you mentioned yeilding, I will also try to do yielding and post figures.> > > > Konrad, IIRC you once mentioned you discovered something with event > > notification, what''s that? > > They were bizzare. I naively expected some form of # of physical NIC > interrupts to be around the same as the VIF or less. And I figured > that the amount of interrupts would be constant irregardless of the > size of the packets. In other words #packets == #interrupts. >It could be that the frontend notifies the backend for every packet it sends. This is not desirable and I don''t expect the ring to behave that way.> In reality the number of interrupts the VIF had was about the same while > for the NIC it would fluctuate. (I can''t remember the details). >I''m not sure I understand you here. But for the NIC, if you see the number of interrupt goes from high to low that''s expected. When the NIC has very high interrupt rate it turns to polling mode.> But it was odd and I didn''t go deeper in it to figure out what > was happening. And also to figure out if for the VIF we could > do something of #packets != #interrupts. And hopefully some > mechanism to adjust so that the amount of interrupts would > be lesser per packets (hand waving here).I''m trying to do this now. Wei.
Wei Liu
2013-Jun-16 12:46 UTC
Re: Interesting observation with network event notification and batching
On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk wrote: [...]>> How do you get it to do more HYPERVISR_xen_version? Did you just add > a (for i = 1024; i>0;i--) hypervisor_yield(); >Here are the figures to replace HYPERVISOR_xen_version(0, NULL) with HYPERVISOR_sched_op(SCHEDOP_yield, NULL). 64 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 5.15G/s, average slots per tx 25 128 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 7.75G/s, average slots per tx 26 512 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 1.74G/s, average slots per tx 18 1024 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 998M/s, average slots per tx 18 Please note that Dom0 and DomU runs on different PCPUs. I think this kind of behavior has something to do with scheduler. But down to the bottom we should really fix notification mechanism. Wei.
Ian Campbell
2013-Jun-17 09:38 UTC
Re: Interesting observation with network event notification and batching
On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:> > > Konrad, IIRC you once mentioned you discovered something with event > > > notification, what''s that? > > > > They were bizzare. I naively expected some form of # of physical NIC > > interrupts to be around the same as the VIF or less. And I figured > > that the amount of interrupts would be constant irregardless of the > > size of the packets. In other words #packets == #interrupts. > > > > It could be that the frontend notifies the backend for every packet it > sends. This is not desirable and I don''t expect the ring to behave that > way.It is probably worth checking that things are working how we think they should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at suitable points to maximise batching. Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops loop right? This would push the req_event pointer to just after the last request, meaning the net request enqueued by the frontend would cause a notification -- even though the backend is actually still continuing to process requests and would have picked up that packet without further notification. n this case there is a fair bit of work left in the backend for this iteration i.e. plenty of opportunity for the frontend to queue more requests. The comments in ring.h say: * These macros will set the req_event/rsp_event field to trigger a * notification on the very next message that is enqueued. If you want to * create batches of work (i.e., only receive a notification after several * messages have been enqueued) then you will need to create a customised * version of the FINAL_CHECK macro in your own code, which sets the event * field appropriately. Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop (and other similar loops) and add a FINAL check at the very end?> > But it was odd and I didn''t go deeper in it to figure out what > > was happening. And also to figure out if for the VIF we could > > do something of #packets != #interrupts. And hopefully some > > mechanism to adjust so that the amount of interrupts would > > be lesser per packets (hand waving here). > > I''m trying to do this now.What scheme do you have in mind?
Andrew Bennieston
2013-Jun-17 09:56 UTC
Re: Interesting observation with network event notification and batching
On 17/06/13 10:38, Ian Campbell wrote:> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: >>>> Konrad, IIRC you once mentioned you discovered something with event >>>> notification, what''s that? >>> >>> They were bizzare. I naively expected some form of # of physical NIC >>> interrupts to be around the same as the VIF or less. And I figured >>> that the amount of interrupts would be constant irregardless of the >>> size of the packets. In other words #packets == #interrupts. >>> >> >> It could be that the frontend notifies the backend for every packet it >> sends. This is not desirable and I don''t expect the ring to behave that >> way.I have observed this kind of behaviour during network performance tests in which I periodically checked the ring state during an iperf session. It looked to me like the frontend was sending notifications far too often, but that the backend was sending them very infrequently, so the Tx (from guest) ring was mostly empty and the Rx (to guest) ring was mostly full. This has the effect of both front and backend having to block occasionally waiting for the other end to clear or fill a ring, even though there is more data available. My initial theory was that this was caused in part by the shared event channel, however I expect that Wei is testing on top of a kernel with his split event channel features?> > It is probably worth checking that things are working how we think they > should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and > netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at > suitable points to maximise batching. > > Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops > loop right? This would push the req_event pointer to just after the last > request, meaning the net request enqueued by the frontend would cause a > notification -- even though the backend is actually still continuing to > process requests and would have picked up that packet without further > notification. n this case there is a fair bit of work left in the > backend for this iteration i.e. plenty of opportunity for the frontend > to queue more requests. > > The comments in ring.h say: > * These macros will set the req_event/rsp_event field to trigger a > * notification on the very next message that is enqueued. If you want to > * create batches of work (i.e., only receive a notification after several > * messages have been enqueued) then you will need to create a customised > * version of the FINAL_CHECK macro in your own code, which sets the event > * field appropriately. > > Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop > (and other similar loops) and add a FINAL check at the very end? > >>> But it was odd and I didn''t go deeper in it to figure out what >>> was happening. And also to figure out if for the VIF we could >>> do something of #packets != #interrupts. And hopefully some >>> mechanism to adjust so that the amount of interrupts would >>> be lesser per packets (hand waving here). >> >> I''m trying to do this now. > > What scheme do you have in mind?As I mentioned above, filling a ring completely appears to be almost as bad as sending too many notifications. The ideal scheme may involve trying to balance the ring at some "half-full" state, depending on the capacity for the front- and backends to process requests and responses. Andrew.
Jan Beulich
2013-Jun-17 10:06 UTC
Re: Interesting observation with network event notification and batching
>>> On 17.06.13 at 11:38, Ian Campbell <Ian.Campbell@citrix.com> wrote: > On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: >> > > Konrad, IIRC you once mentioned you discovered something with event >> > > notification, what''s that? >> > >> > They were bizzare. I naively expected some form of # of physical NIC >> > interrupts to be around the same as the VIF or less. And I figured >> > that the amount of interrupts would be constant irregardless of the >> > size of the packets. In other words #packets == #interrupts. >> > >> >> It could be that the frontend notifies the backend for every packet it >> sends. This is not desirable and I don''t expect the ring to behave that >> way. > > It is probably worth checking that things are working how we think they > should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and > netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at > suitable points to maximise batching. > > Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops > loop right? This would push the req_event pointer to just after the last > request, meaning the net request enqueued by the frontend would cause a > notification -- even though the backend is actually still continuing to > process requests and would have picked up that packet without further > notification. n this case there is a fair bit of work left in the > backend for this iteration i.e. plenty of opportunity for the frontend > to queue more requests. > > The comments in ring.h say: > * These macros will set the req_event/rsp_event field to trigger a > * notification on the very next message that is enqueued. If you want to > * create batches of work (i.e., only receive a notification after several > * messages have been enqueued) then you will need to create a customised > * version of the FINAL_CHECK macro in your own code, which sets the event > * field appropriately. > > Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop > (and other similar loops) and add a FINAL check at the very end?But then again the macro doesn''t update req_event when there are unconsumed requests already upon entry to the macro. Jan
Ian Campbell
2013-Jun-17 10:16 UTC
Re: Interesting observation with network event notification and batching
On Mon, 2013-06-17 at 11:06 +0100, Jan Beulich wrote:> >>> On 17.06.13 at 11:38, Ian Campbell <Ian.Campbell@citrix.com> wrote: > > On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: > >> > > Konrad, IIRC you once mentioned you discovered something with event > >> > > notification, what''s that? > >> > > >> > They were bizzare. I naively expected some form of # of physical NIC > >> > interrupts to be around the same as the VIF or less. And I figured > >> > that the amount of interrupts would be constant irregardless of the > >> > size of the packets. In other words #packets == #interrupts. > >> > > >> > >> It could be that the frontend notifies the backend for every packet it > >> sends. This is not desirable and I don''t expect the ring to behave that > >> way. > > > > It is probably worth checking that things are working how we think they > > should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and > > netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at > > suitable points to maximise batching. > > > > Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops > > loop right? This would push the req_event pointer to just after the last > > request, meaning the net request enqueued by the frontend would cause a > > notification -- even though the backend is actually still continuing to > > process requests and would have picked up that packet without further > > notification. n this case there is a fair bit of work left in the > > backend for this iteration i.e. plenty of opportunity for the frontend > > to queue more requests. > > > > The comments in ring.h say: > > * These macros will set the req_event/rsp_event field to trigger a > > * notification on the very next message that is enqueued. If you want to > > * create batches of work (i.e., only receive a notification after several > > * messages have been enqueued) then you will need to create a customised > > * version of the FINAL_CHECK macro in your own code, which sets the event > > * field appropriately. > > > > Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop > > (and other similar loops) and add a FINAL check at the very end? > > But then again the macro doesn''t update req_event when there > are unconsumed requests already upon entry to the macro.My concern was that when we process the last request currently on the ring we immediately set it forward, even though netback goes on to do a bunch more work (including e.g. the grant copies) before looping back and looking for more work. That''s a potentially large window for the frontend to enqueue and then needlessly notify a new packet. It could potentially lead to a pathological case of notifying every packet unnecessarily. Ian.
Wei Liu
2013-Jun-17 10:35 UTC
Re: Interesting observation with network event notification and batching
On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell wrote:> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: > > > > Konrad, IIRC you once mentioned you discovered something with event > > > > notification, what''s that? > > > > > > They were bizzare. I naively expected some form of # of physical NIC > > > interrupts to be around the same as the VIF or less. And I figured > > > that the amount of interrupts would be constant irregardless of the > > > size of the packets. In other words #packets == #interrupts. > > > > > > > It could be that the frontend notifies the backend for every packet it > > sends. This is not desirable and I don''t expect the ring to behave that > > way. > > It is probably worth checking that things are working how we think they > should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and > netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at > suitable points to maximise batching. > > Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops > loop right? This would push the req_event pointer to just after the last > request, meaning the net request enqueued by the frontend would cause a > notification -- even though the backend is actually still continuing to > process requests and would have picked up that packet without further > notification. n this case there is a fair bit of work left in the > backend for this iteration i.e. plenty of opportunity for the frontend > to queue more requests. > > The comments in ring.h say: > * These macros will set the req_event/rsp_event field to trigger a > * notification on the very next message that is enqueued. If you want to > * create batches of work (i.e., only receive a notification after several > * messages have been enqueued) then you will need to create a customised > * version of the FINAL_CHECK macro in your own code, which sets the event > * field appropriately. > > Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop > (and other similar loops) and add a FINAL check at the very end? > > > > But it was odd and I didn''t go deeper in it to figure out what > > > was happening. And also to figure out if for the VIF we could > > > do something of #packets != #interrupts. And hopefully some > > > mechanism to adjust so that the amount of interrupts would > > > be lesser per packets (hand waving here). > > > > I''m trying to do this now. > > What scheme do you have in mind?Basically the one you mentioned above. Playing with various event pointers now. Wei.>
Wei Liu
2013-Jun-17 10:46 UTC
Re: Interesting observation with network event notification and batching
On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:> On 17/06/13 10:38, Ian Campbell wrote: > >On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: > >>>>Konrad, IIRC you once mentioned you discovered something with event > >>>>notification, what''s that? > >>> > >>>They were bizzare. I naively expected some form of # of physical NIC > >>>interrupts to be around the same as the VIF or less. And I figured > >>>that the amount of interrupts would be constant irregardless of the > >>>size of the packets. In other words #packets == #interrupts. > >>> > >> > >>It could be that the frontend notifies the backend for every packet it > >>sends. This is not desirable and I don''t expect the ring to behave that > >>way. > > I have observed this kind of behaviour during network performance > tests in which I periodically checked the ring state during an iperf > session. It looked to me like the frontend was sending notifications > far too often, but that the backend was sending them very > infrequently, so the Tx (from guest) ring was mostly empty and the > Rx (to guest) ring was mostly full. This has the effect of both > front and backend having to block occasionally waiting for the other > end to clear or fill a ring, even though there is more data > available. > > My initial theory was that this was caused in part by the shared > event channel, however I expect that Wei is testing on top of a > kernel with his split event channel features? >Yes, with split event channels. And during tests the interrupt counts, frontend TX has 6 figures interrupt number while frontend RX has 2 figures number.> > > >It is probably worth checking that things are working how we think they > >should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and > >netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at > >suitable points to maximise batching. > > > >Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops > >loop right? This would push the req_event pointer to just after the last > >request, meaning the net request enqueued by the frontend would cause a > >notification -- even though the backend is actually still continuing to > >process requests and would have picked up that packet without further > >notification. n this case there is a fair bit of work left in the > >backend for this iteration i.e. plenty of opportunity for the frontend > >to queue more requests. > > > >The comments in ring.h say: > > * These macros will set the req_event/rsp_event field to trigger a > > * notification on the very next message that is enqueued. If you want to > > * create batches of work (i.e., only receive a notification after several > > * messages have been enqueued) then you will need to create a customised > > * version of the FINAL_CHECK macro in your own code, which sets the event > > * field appropriately. > > > >Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop > >(and other similar loops) and add a FINAL check at the very end? > > > >>>But it was odd and I didn''t go deeper in it to figure out what > >>>was happening. And also to figure out if for the VIF we could > >>>do something of #packets != #interrupts. And hopefully some > >>>mechanism to adjust so that the amount of interrupts would > >>>be lesser per packets (hand waving here). > >> > >>I''m trying to do this now. > > > >What scheme do you have in mind? > > As I mentioned above, filling a ring completely appears to be almost > as bad as sending too many notifications. The ideal scheme may > involve trying to balance the ring at some "half-full" state, > depending on the capacity for the front- and backends to process > requests and responses. >I don''t think filling the ring full causes any problem, that''s just conceptually the same as "half-full" state if you need to throttle the ring. The real problem is how to do notifications correctly. Wei.> Andrew.
Andrew Bennieston
2013-Jun-17 10:56 UTC
Re: Interesting observation with network event notification and batching
On 17/06/13 11:46, Wei Liu wrote:> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote: >> On 17/06/13 10:38, Ian Campbell wrote: >>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: >>>>>> Konrad, IIRC you once mentioned you discovered something with event >>>>>> notification, what''s that? >>>>> >>>>> They were bizzare. I naively expected some form of # of physical NIC >>>>> interrupts to be around the same as the VIF or less. And I figured >>>>> that the amount of interrupts would be constant irregardless of the >>>>> size of the packets. In other words #packets == #interrupts. >>>>> >>>> >>>> It could be that the frontend notifies the backend for every packet it >>>> sends. This is not desirable and I don''t expect the ring to behave that >>>> way. >> >> I have observed this kind of behaviour during network performance >> tests in which I periodically checked the ring state during an iperf >> session. It looked to me like the frontend was sending notifications >> far too often, but that the backend was sending them very >> infrequently, so the Tx (from guest) ring was mostly empty and the >> Rx (to guest) ring was mostly full. This has the effect of both >> front and backend having to block occasionally waiting for the other >> end to clear or fill a ring, even though there is more data >> available. >> >> My initial theory was that this was caused in part by the shared >> event channel, however I expect that Wei is testing on top of a >> kernel with his split event channel features? >> > > Yes, with split event channels. > > And during tests the interrupt counts, frontend TX has 6 figures > interrupt number while frontend RX has 2 figures number. > >>> >>> It is probably worth checking that things are working how we think they >>> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and >>> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at >>> suitable points to maximise batching. >>> >>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops >>> loop right? This would push the req_event pointer to just after the last >>> request, meaning the net request enqueued by the frontend would cause a >>> notification -- even though the backend is actually still continuing to >>> process requests and would have picked up that packet without further >>> notification. n this case there is a fair bit of work left in the >>> backend for this iteration i.e. plenty of opportunity for the frontend >>> to queue more requests. >>> >>> The comments in ring.h say: >>> * These macros will set the req_event/rsp_event field to trigger a >>> * notification on the very next message that is enqueued. If you want to >>> * create batches of work (i.e., only receive a notification after several >>> * messages have been enqueued) then you will need to create a customised >>> * version of the FINAL_CHECK macro in your own code, which sets the event >>> * field appropriately. >>> >>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop >>> (and other similar loops) and add a FINAL check at the very end? >>> >>>>> But it was odd and I didn''t go deeper in it to figure out what >>>>> was happening. And also to figure out if for the VIF we could >>>>> do something of #packets != #interrupts. And hopefully some >>>>> mechanism to adjust so that the amount of interrupts would >>>>> be lesser per packets (hand waving here). >>>> >>>> I''m trying to do this now. >>> >>> What scheme do you have in mind? >> >> As I mentioned above, filling a ring completely appears to be almost >> as bad as sending too many notifications. The ideal scheme may >> involve trying to balance the ring at some "half-full" state, >> depending on the capacity for the front- and backends to process >> requests and responses. >> > > I don''t think filling the ring full causes any problem, that''s just > conceptually the same as "half-full" state if you need to throttle the > ring.My understanding was that filling the ring will cause the producer to sleep until slots become available (i.e. the until the consumer notifies it that it has removed something from the ring). I''m just concerned that overly aggressive batching may lead to a situation where the consumer is sitting idle, waiting for a notification that the producer hasn''t yet sent because it can still fill more slots on the ring. When the ring is completely full, the producer would have to wait for the ring to partially empty. At this point, the consumer would hold off notifying because it can still batch more processing, so the producer is left waiting. (Repeat as required). It would be better to have both producer and consumer running concurrently. I mention this mainly so that we don''t end up with a swing to the polar opposite of what we have now, which (to my mind) is just as bad. Clearly this is an edge case, but if there''s a reason I''m missing that this can''t happen (e.g. after a period of inactivity) then don''t hesitate to point it out :) (Perhaps "half-full" was misleading... the optimal state may be "just enough room for one more packet", or something along those lines...) Andrew
Ian Campbell
2013-Jun-17 11:08 UTC
Re: Interesting observation with network event notification and batching
On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston wrote:> On 17/06/13 11:46, Wei Liu wrote: > > On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote: > >> On 17/06/13 10:38, Ian Campbell wrote: > >>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: > >>>>>> Konrad, IIRC you once mentioned you discovered something with event > >>>>>> notification, what''s that? > >>>>> > >>>>> They were bizzare. I naively expected some form of # of physical NIC > >>>>> interrupts to be around the same as the VIF or less. And I figured > >>>>> that the amount of interrupts would be constant irregardless of the > >>>>> size of the packets. In other words #packets == #interrupts. > >>>>> > >>>> > >>>> It could be that the frontend notifies the backend for every packet it > >>>> sends. This is not desirable and I don''t expect the ring to behave that > >>>> way. > >> > >> I have observed this kind of behaviour during network performance > >> tests in which I periodically checked the ring state during an iperf > >> session. It looked to me like the frontend was sending notifications > >> far too often, but that the backend was sending them very > >> infrequently, so the Tx (from guest) ring was mostly empty and the > >> Rx (to guest) ring was mostly full. This has the effect of both > >> front and backend having to block occasionally waiting for the other > >> end to clear or fill a ring, even though there is more data > >> available. > >> > >> My initial theory was that this was caused in part by the shared > >> event channel, however I expect that Wei is testing on top of a > >> kernel with his split event channel features? > >> > > > > Yes, with split event channels. > > > > And during tests the interrupt counts, frontend TX has 6 figures > > interrupt number while frontend RX has 2 figures number. > > > >>> > >>> It is probably worth checking that things are working how we think they > >>> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and > >>> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at > >>> suitable points to maximise batching. > >>> > >>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops > >>> loop right? This would push the req_event pointer to just after the last > >>> request, meaning the net request enqueued by the frontend would cause a > >>> notification -- even though the backend is actually still continuing to > >>> process requests and would have picked up that packet without further > >>> notification. n this case there is a fair bit of work left in the > >>> backend for this iteration i.e. plenty of opportunity for the frontend > >>> to queue more requests. > >>> > >>> The comments in ring.h say: > >>> * These macros will set the req_event/rsp_event field to trigger a > >>> * notification on the very next message that is enqueued. If you want to > >>> * create batches of work (i.e., only receive a notification after several > >>> * messages have been enqueued) then you will need to create a customised > >>> * version of the FINAL_CHECK macro in your own code, which sets the event > >>> * field appropriately. > >>> > >>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop > >>> (and other similar loops) and add a FINAL check at the very end? > >>> > >>>>> But it was odd and I didn''t go deeper in it to figure out what > >>>>> was happening. And also to figure out if for the VIF we could > >>>>> do something of #packets != #interrupts. And hopefully some > >>>>> mechanism to adjust so that the amount of interrupts would > >>>>> be lesser per packets (hand waving here). > >>>> > >>>> I''m trying to do this now. > >>> > >>> What scheme do you have in mind? > >> > >> As I mentioned above, filling a ring completely appears to be almost > >> as bad as sending too many notifications. The ideal scheme may > >> involve trying to balance the ring at some "half-full" state, > >> depending on the capacity for the front- and backends to process > >> requests and responses. > >> > > > > I don''t think filling the ring full causes any problem, that''s just > > conceptually the same as "half-full" state if you need to throttle the > > ring. > My understanding was that filling the ring will cause the producer to > sleep until slots become available (i.e. the until the consumer notifies > it that it has removed something from the ring). > > I''m just concerned that overly aggressive batching may lead to a > situation where the consumer is sitting idle, waiting for a notification > that the producer hasn''t yet sent because it can still fill more slots > on the ring. When the ring is completely full, the producer would have > to wait for the ring to partially empty. At this point, the consumer > would hold off notifying because it can still batch more processing, so > the producer is left waiting. (Repeat as required). It would be better > to have both producer and consumer running concurrently. > > I mention this mainly so that we don''t end up with a swing to the polar > opposite of what we have now, which (to my mind) is just as bad. Clearly > this is an edge case, but if there''s a reason I''m missing that this > can''t happen (e.g. after a period of inactivity) then don''t hesitate to > point it out :)Doesn''t the separation between req_event and rsp_event help here? So if the producer fills the ring, it will sleep, but set rsp_event appropriately that when the backend completes some (but not all) work it will be woken up so that it can put extra stuff on the ring. It shouldn''t need to wait for the backend to process the whole batch for this.> > (Perhaps "half-full" was misleading... the optimal state may be "just > enough room for one more packet", or something along those lines...) > > Andrew >
annie li
2013-Jun-17 11:34 UTC
Re: Interesting observation with network event notification and batching
On 2013-6-17 18:35, Wei Liu wrote:> On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell wrote: >> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: >>>>> Konrad, IIRC you once mentioned you discovered something with event >>>>> notification, what''s that? >>>> They were bizzare. I naively expected some form of # of physical NIC >>>> interrupts to be around the same as the VIF or less. And I figured >>>> that the amount of interrupts would be constant irregardless of the >>>> size of the packets. In other words #packets == #interrupts. >>>> >>> It could be that the frontend notifies the backend for every packet it >>> sends. This is not desirable and I don''t expect the ring to behave that >>> way. >> It is probably worth checking that things are working how we think they >> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and >> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at >> suitable points to maximise batching. >> >> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops >> loop right? This would push the req_event pointer to just after the last >> request, meaning the net request enqueued by the frontend would cause a >> notification -- even though the backend is actually still continuing to >> process requests and would have picked up that packet without further >> notification. n this case there is a fair bit of work left in the >> backend for this iteration i.e. plenty of opportunity for the frontend >> to queue more requests. >> >> The comments in ring.h say: >> * These macros will set the req_event/rsp_event field to trigger a >> * notification on the very next message that is enqueued. If you want to >> * create batches of work (i.e., only receive a notification after several >> * messages have been enqueued) then you will need to create a customised >> * version of the FINAL_CHECK macro in your own code, which sets the event >> * field appropriately. >> >> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop >> (and other similar loops) and add a FINAL check at the very end? >> >>>> But it was odd and I didn''t go deeper in it to figure out what >>>> was happening. And also to figure out if for the VIF we could >>>> do something of #packets != #interrupts. And hopefully some >>>> mechanism to adjust so that the amount of interrupts would >>>> be lesser per packets (hand waving here). >>> I''m trying to do this now. >> What scheme do you have in mind? > Basically the one you mentioned above. > > Playing with various event pointers now.Did you collect data of how much requests netback processes when req_event is updated in RING_FINAL_CHECK_FOR_REQUESTS? I assume this value is pretty small from your test result. How about not updating req_event every time when there is no unconsumed request in RING_FINAL_CHECK_FOR_REQUESTS? Thanks Annie> > > Wei. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Andrew Bennieston
2013-Jun-17 11:55 UTC
Re: Interesting observation with network event notification and batching
On 17/06/13 12:08, Ian Campbell wrote:> On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston wrote: >> On 17/06/13 11:46, Wei Liu wrote: >>> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote: >>>> On 17/06/13 10:38, Ian Campbell wrote: >>>>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote: >>>>>>>> Konrad, IIRC you once mentioned you discovered something with event >>>>>>>> notification, what''s that? >>>>>>> >>>>>>> They were bizzare. I naively expected some form of # of physical NIC >>>>>>> interrupts to be around the same as the VIF or less. And I figured >>>>>>> that the amount of interrupts would be constant irregardless of the >>>>>>> size of the packets. In other words #packets == #interrupts. >>>>>>> >>>>>> >>>>>> It could be that the frontend notifies the backend for every packet it >>>>>> sends. This is not desirable and I don''t expect the ring to behave that >>>>>> way. >>>> >>>> I have observed this kind of behaviour during network performance >>>> tests in which I periodically checked the ring state during an iperf >>>> session. It looked to me like the frontend was sending notifications >>>> far too often, but that the backend was sending them very >>>> infrequently, so the Tx (from guest) ring was mostly empty and the >>>> Rx (to guest) ring was mostly full. This has the effect of both >>>> front and backend having to block occasionally waiting for the other >>>> end to clear or fill a ring, even though there is more data >>>> available. >>>> >>>> My initial theory was that this was caused in part by the shared >>>> event channel, however I expect that Wei is testing on top of a >>>> kernel with his split event channel features? >>>> >>> >>> Yes, with split event channels. >>> >>> And during tests the interrupt counts, frontend TX has 6 figures >>> interrupt number while frontend RX has 2 figures number. >>> >>>>> >>>>> It is probably worth checking that things are working how we think they >>>>> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and >>>>> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at >>>>> suitable points to maximise batching. >>>>> >>>>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops >>>>> loop right? This would push the req_event pointer to just after the last >>>>> request, meaning the net request enqueued by the frontend would cause a >>>>> notification -- even though the backend is actually still continuing to >>>>> process requests and would have picked up that packet without further >>>>> notification. n this case there is a fair bit of work left in the >>>>> backend for this iteration i.e. plenty of opportunity for the frontend >>>>> to queue more requests. >>>>> >>>>> The comments in ring.h say: >>>>> * These macros will set the req_event/rsp_event field to trigger a >>>>> * notification on the very next message that is enqueued. If you want to >>>>> * create batches of work (i.e., only receive a notification after several >>>>> * messages have been enqueued) then you will need to create a customised >>>>> * version of the FINAL_CHECK macro in your own code, which sets the event >>>>> * field appropriately. >>>>> >>>>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop >>>>> (and other similar loops) and add a FINAL check at the very end? >>>>> >>>>>>> But it was odd and I didn''t go deeper in it to figure out what >>>>>>> was happening. And also to figure out if for the VIF we could >>>>>>> do something of #packets != #interrupts. And hopefully some >>>>>>> mechanism to adjust so that the amount of interrupts would >>>>>>> be lesser per packets (hand waving here). >>>>>> >>>>>> I''m trying to do this now. >>>>> >>>>> What scheme do you have in mind? >>>> >>>> As I mentioned above, filling a ring completely appears to be almost >>>> as bad as sending too many notifications. The ideal scheme may >>>> involve trying to balance the ring at some "half-full" state, >>>> depending on the capacity for the front- and backends to process >>>> requests and responses. >>>> >>> >>> I don''t think filling the ring full causes any problem, that''s just >>> conceptually the same as "half-full" state if you need to throttle the >>> ring. >> My understanding was that filling the ring will cause the producer to >> sleep until slots become available (i.e. the until the consumer notifies >> it that it has removed something from the ring). >> >> I''m just concerned that overly aggressive batching may lead to a >> situation where the consumer is sitting idle, waiting for a notification >> that the producer hasn''t yet sent because it can still fill more slots >> on the ring. When the ring is completely full, the producer would have >> to wait for the ring to partially empty. At this point, the consumer >> would hold off notifying because it can still batch more processing, so >> the producer is left waiting. (Repeat as required). It would be better >> to have both producer and consumer running concurrently. >> >> I mention this mainly so that we don''t end up with a swing to the polar >> opposite of what we have now, which (to my mind) is just as bad. Clearly >> this is an edge case, but if there''s a reason I''m missing that this >> can''t happen (e.g. after a period of inactivity) then don''t hesitate to >> point it out :) > > Doesn''t the separation between req_event and rsp_event help here? > > So if the producer fills the ring, it will sleep, but set rsp_event > appropriately that when the backend completes some (but not all) work it > will be woken up so that it can put extra stuff on the ring. > > It shouldn''t need to wait for the backend to process the whole batch for > this.Right. As long as this logic doesn''t get inadvertently changed in an attempt to improve batching of events!> >> >> (Perhaps "half-full" was misleading... the optimal state may be "just >> enough room for one more packet", or something along those lines...) >> >> Andrew >> > >
Wei Liu
2013-Jun-28 16:15 UTC
Re: Interesting observation with network event notification and batching
Hi all, After collecting more stats and comparing copying / mapping cases, I now have some more interesting finds, which might contradict what I said before. I tuned the runes I used for benchmark to make sure iperf and netperf generate large packets (~64K). Here are the runes I use: iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 COPY MAP iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) PPI 2.90 1.07 SPI 37.75 13.69 PPN 2.90 1.07 SPN 37.75 13.69 tx_count 31808 174769 nr_napi_schedule 31805 174697 total_packets 92354 187408 total_reqs 1200793 2392614 netperf Tput: 5.8Gb/s 10.5Gb/s PPI 2.13 1.00 SPI 36.70 16.73 PPN 2.13 1.31 SPN 36.70 16.75 tx_count 57635 205599 nr_napi_schedule 57633 205311 total_packets 122800 270254 total_reqs 2115068 3439751 PPI: packets processed per interrupt SPI: slots processed per interrupt PPN: packets processed per napi schedule SPN: slots processed per napi schedule tx_count: interrupt count total_reqs: total slots used during test * Notification and batching Is notification and batching really a problem? I''m not so sure now. My first thought when I didn''t measure PPI / PPN / SPI / SPN in copying case was that "in that case netback *must* have better batching" which turned out not very true -- copying mode makes netback slower, however the batching gained is not hugh. Ideally we still want to batch as much as possible. Possible way includes playing with the ''weight'' parameter in NAPI. But as the figures show batching seems not to be very important for throughput, at least for now. If the NAPI framework and netfront / netback are doing their jobs as designed we might not need to worry about this now. Andrew, do you have any thought on this? You found out that NAPI didn''t scale well with multi-threaded iperf in DomU, do you have any handle how that can happen? * Thoughts on zero-copy TX With this hack we are able to achieve 10Gb/s single stream, which is good. But, with classic XenoLinux kernel which has zero copy TX we didn''t able to achieve this. I also developed another zero copy netback prototype one year ago with Ian''s out-of-tree skb frag destructor patch series. That prototype couldn''t achieve 10Gb/s either (IIRC the performance was more or less the same as copying mode, about 6~7Gb/s). My hack maps all necessary pages permantently, there is no unmap, we skip lots of page table manipulation and TLB flushes. So my basic conclusion is that page table manipulation and TLB flushes do incur heavy performance penalty. This hack can be upstreamed in no way. If we''re to re-introduce zero-copy TX, we would need to implement some sort of lazy flushing mechanism. I haven''t thought this through. Presumably this mechanism would also benefit blk somehow? I''m not sure yet. Could persistent mapping (with the to-be-developed reclaim / MRU list mechanism) be useful here? So that we can unify blk and net drivers? * Changes required to introduce zero-copy TX 1. SKB frag destructor series: to track life cycle of SKB frags. This is not yet upstreamed. 2. Mechanism to negotiate max slots frontend can use: mapping requires backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. 3. Lazy flushing mechanism or persistent grants: ??? Wei. * Note In my previous tests I only ran iperf and didn''t have the right rune to generate large packets. Iperf seems to have a behavior to increase packet size as time goes by. In the copying case the packet size was increased to 64K eventually while in the mapping case odd thing happened (I believe that must due to the bug in my hack :-/) -- packet size was always the default size (8K). Adding ''-l 131072'' to iperf makes sure that the packet is always 64K.
annie li
2013-Jul-01 07:48 UTC
Re: Interesting observation with network event notification and batching
On 2013-6-29 0:15, Wei Liu wrote:> Hi all, > > After collecting more stats and comparing copying / mapping cases, I now > have some more interesting finds, which might contradict what I said > before. > > I tuned the runes I used for benchmark to make sure iperf and netperf > generate large packets (~64K). Here are the runes I use: > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > COPY MAP > iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets?> PPI 2.90 1.07 > SPI 37.75 13.69 > PPN 2.90 1.07 > SPN 37.75 13.69 > tx_count 31808 174769Seems interrupt count does not affect the performance at all with -l 131072 -w 128k.> nr_napi_schedule 31805 174697 > total_packets 92354 187408 > total_reqs 1200793 2392614 > > netperf Tput: 5.8Gb/s 10.5Gb/s > PPI 2.13 1.00 > SPI 36.70 16.73 > PPN 2.13 1.31 > SPN 36.70 16.75 > tx_count 57635 205599 > nr_napi_schedule 57633 205311 > total_packets 122800 270254 > total_reqs 2115068 3439751 > > PPI: packets processed per interrupt > SPI: slots processed per interrupt > PPN: packets processed per napi schedule > SPN: slots processed per napi schedule > tx_count: interrupt count > total_reqs: total slots used during test > > * Notification and batching > > Is notification and batching really a problem? I''m not so sure now. My > first thought when I didn''t measure PPI / PPN / SPI / SPN in copying > case was that "in that case netback *must* have better batching" which > turned out not very true -- copying mode makes netback slower, however > the batching gained is not hugh. > > Ideally we still want to batch as much as possible. Possible way > includes playing with the ''weight'' parameter in NAPI. But as the figures > show batching seems not to be very important for throughput, at least > for now. If the NAPI framework and netfront / netback are doing their > jobs as designed we might not need to worry about this now. > > Andrew, do you have any thought on this? You found out that NAPI didn''t > scale well with multi-threaded iperf in DomU, do you have any handle how > that can happen? > > * Thoughts on zero-copy TX > > With this hack we are able to achieve 10Gb/s single stream, which is > good. But, with classic XenoLinux kernel which has zero copy TX we > didn''t able to achieve this. I also developed another zero copy netback > prototype one year ago with Ian''s out-of-tree skb frag destructor patch > series. That prototype couldn''t achieve 10Gb/s either (IIRC the > performance was more or less the same as copying mode, about 6~7Gb/s). > > My hack maps all necessary pages permantently, there is no unmap, we > skip lots of page table manipulation and TLB flushes. So my basic > conclusion is that page table manipulation and TLB flushes do incur > heavy performance penalty. > > This hack can be upstreamed in no way. If we''re to re-introduce > zero-copy TX, we would need to implement some sort of lazy flushing > mechanism. I haven''t thought this through. Presumably this mechanism > would also benefit blk somehow? I''m not sure yet. > > Could persistent mapping (with the to-be-developed reclaim / MRU list > mechanism) be useful here? So that we can unify blk and net drivers? > > * Changes required to introduce zero-copy TX > > 1. SKB frag destructor series: to track life cycle of SKB frags. This is > not yet upstreamed.Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>> > 2. Mechanism to negotiate max slots frontend can use: mapping requires > backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. > > 3. Lazy flushing mechanism or persistent grants: ???I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar results with larger packet size too. Thanks Annie _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Wei Liu
2013-Jul-01 08:54 UTC
Re: Interesting observation with network event notification and batching
On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:> > On 2013-6-29 0:15, Wei Liu wrote: > >Hi all, > > > >After collecting more stats and comparing copying / mapping cases, I now > >have some more interesting finds, which might contradict what I said > >before. > > > >I tuned the runes I used for benchmark to make sure iperf and netperf > >generate large packets (~64K). Here are the runes I use: > > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > > > COPY MAP > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > So with default iperf setting, copy is about 7.9G, and map is about > 2.5G? How about the result of netperf without large packets? >First question, yes. Second question, 5.8Gb/s. And I believe for the copying scheme without large packet the throuput is more or less the same.> > PPI 2.90 1.07 > > SPI 37.75 13.69 > > PPN 2.90 1.07 > > SPN 37.75 13.69 > > tx_count 31808 174769 > > Seems interrupt count does not affect the performance at all with -l > 131072 -w 128k. >Right.> > nr_napi_schedule 31805 174697 > > total_packets 92354 187408 > > total_reqs 1200793 2392614 > > > >netperf Tput: 5.8Gb/s 10.5Gb/s > > PPI 2.13 1.00 > > SPI 36.70 16.73 > > PPN 2.13 1.31 > > SPN 36.70 16.75 > > tx_count 57635 205599 > > nr_napi_schedule 57633 205311 > > total_packets 122800 270254 > > total_reqs 2115068 3439751 > > > > PPI: packets processed per interrupt > > SPI: slots processed per interrupt > > PPN: packets processed per napi schedule > > SPN: slots processed per napi schedule > > tx_count: interrupt count > > total_reqs: total slots used during test > > > >* Notification and batching > > > >Is notification and batching really a problem? I''m not so sure now. My > >first thought when I didn''t measure PPI / PPN / SPI / SPN in copying > >case was that "in that case netback *must* have better batching" which > >turned out not very true -- copying mode makes netback slower, however > >the batching gained is not hugh. > > > >Ideally we still want to batch as much as possible. Possible way > >includes playing with the ''weight'' parameter in NAPI. But as the figures > >show batching seems not to be very important for throughput, at least > >for now. If the NAPI framework and netfront / netback are doing their > >jobs as designed we might not need to worry about this now. > > > >Andrew, do you have any thought on this? You found out that NAPI didn''t > >scale well with multi-threaded iperf in DomU, do you have any handle how > >that can happen? > > > >* Thoughts on zero-copy TX > > > >With this hack we are able to achieve 10Gb/s single stream, which is > >good. But, with classic XenoLinux kernel which has zero copy TX we > >didn''t able to achieve this. I also developed another zero copy netback > >prototype one year ago with Ian''s out-of-tree skb frag destructor patch > >series. That prototype couldn''t achieve 10Gb/s either (IIRC the > >performance was more or less the same as copying mode, about 6~7Gb/s). > > > >My hack maps all necessary pages permantently, there is no unmap, we > >skip lots of page table manipulation and TLB flushes. So my basic > >conclusion is that page table manipulation and TLB flushes do incur > >heavy performance penalty. > > > >This hack can be upstreamed in no way. If we''re to re-introduce > >zero-copy TX, we would need to implement some sort of lazy flushing > >mechanism. I haven''t thought this through. Presumably this mechanism > >would also benefit blk somehow? I''m not sure yet. > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list > >mechanism) be useful here? So that we can unify blk and net drivers? > > > >* Changes required to introduce zero-copy TX > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is > >not yet upstreamed. > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> >Yes. But I believe there''s been several versions posted. The link you have is not the latest version.> > > >2. Mechanism to negotiate max slots frontend can use: mapping requires > >backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. > > > >3. Lazy flushing mechanism or persistent grants: ??? > > I did some test with persistent grants before, it did not show > better performance than grant copy. But I was using the default > params of netperf, and not tried large packet size. Your results > reminds me that maybe persistent grants would get similar results > with larger packet size too. >"No better performance" -- that''s because both mechanisms are copying? However I presume persistent grant can scale better? From an earlier email last week, I read that copying is done by the guest so that this mechanism scales much better than hypervisor copying in blk''s case. Wei.> Thanks > Annie >
Stefano Stabellini
2013-Jul-01 14:19 UTC
Re: Interesting observation with network event notification and batching
Could you please use plain text emails in the future? On Mon, 1 Jul 2013, annie li wrote:> On 2013-6-29 0:15, Wei Liu wrote: > > Hi all, > > After collecting more stats and comparing copying / mapping cases, I now > have some more interesting finds, which might contradict what I said > before. > > I tuned the runes I used for benchmark to make sure iperf and netperf > generate large packets (~64K). Here are the runes I use: > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > COPY MAP > iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > > So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets? > > PPI 2.90 1.07 > SPI 37.75 13.69 > PPN 2.90 1.07 > SPN 37.75 13.69 > tx_count 31808 174769 > > > Seems interrupt count does not affect the performance at all with -l 131072 -w 128k. > > nr_napi_schedule 31805 174697 > total_packets 92354 187408 > total_reqs 1200793 2392614 > > netperf Tput: 5.8Gb/s 10.5Gb/s > PPI 2.13 1.00 > SPI 36.70 16.73 > PPN 2.13 1.31 > SPN 36.70 16.75 > tx_count 57635 205599 > nr_napi_schedule 57633 205311 > total_packets 122800 270254 > total_reqs 2115068 3439751 > > PPI: packets processed per interrupt > SPI: slots processed per interrupt > PPN: packets processed per napi schedule > SPN: slots processed per napi schedule > tx_count: interrupt count > total_reqs: total slots used during test > > * Notification and batching > > Is notification and batching really a problem? I''m not so sure now. My > first thought when I didn''t measure PPI / PPN / SPI / SPN in copying > case was that "in that case netback *must* have better batching" which > turned out not very true -- copying mode makes netback slower, however > the batching gained is not hugh. > > Ideally we still want to batch as much as possible. Possible way > includes playing with the ''weight'' parameter in NAPI. But as the figures > show batching seems not to be very important for throughput, at least > for now. If the NAPI framework and netfront / netback are doing their > jobs as designed we might not need to worry about this now. > > Andrew, do you have any thought on this? You found out that NAPI didn''t > scale well with multi-threaded iperf in DomU, do you have any handle how > that can happen? > > * Thoughts on zero-copy TX > > With this hack we are able to achieve 10Gb/s single stream, which is > good. But, with classic XenoLinux kernel which has zero copy TX we > didn''t able to achieve this. I also developed another zero copy netback > prototype one year ago with Ian''s out-of-tree skb frag destructor patch > series. That prototype couldn''t achieve 10Gb/s either (IIRC the > performance was more or less the same as copying mode, about 6~7Gb/s). > > My hack maps all necessary pages permantently, there is no unmap, we > skip lots of page table manipulation and TLB flushes. So my basic > conclusion is that page table manipulation and TLB flushes do incur > heavy performance penalty. > > This hack can be upstreamed in no way. If we''re to re-introduce > zero-copy TX, we would need to implement some sort of lazy flushing > mechanism. I haven''t thought this through. Presumably this mechanism > would also benefit blk somehow? I''m not sure yet. > > Could persistent mapping (with the to-be-developed reclaim / MRU list > mechanism) be useful here? So that we can unify blk and net drivers? > > * Changes required to introduce zero-copy TX > > 1. SKB frag destructor series: to track life cycle of SKB frags. This is > not yet upstreamed. > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > > 2. Mechanism to negotiate max slots frontend can use: mapping requires > backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. > > 3. Lazy flushing mechanism or persistent grants: ??? > > > I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default > params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar > results with larger packet size too. > > Thanks > Annie > > >
Stefano Stabellini
2013-Jul-01 14:29 UTC
Re: Interesting observation with network event notification and batching
On Mon, 1 Jul 2013, Wei Liu wrote:> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote: > > > > On 2013-6-29 0:15, Wei Liu wrote: > > >Hi all, > > > > > >After collecting more stats and comparing copying / mapping cases, I now > > >have some more interesting finds, which might contradict what I said > > >before. > > > > > >I tuned the runes I used for benchmark to make sure iperf and netperf > > >generate large packets (~64K). Here are the runes I use: > > > > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > > > > > COPY MAP > > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > > > So with default iperf setting, copy is about 7.9G, and map is about > > 2.5G? How about the result of netperf without large packets? > > > > First question, yes. > > Second question, 5.8Gb/s. And I believe for the copying scheme without > large packet the throuput is more or less the same. > > > > PPI 2.90 1.07 > > > SPI 37.75 13.69 > > > PPN 2.90 1.07 > > > SPN 37.75 13.69 > > > tx_count 31808 174769 > > > > Seems interrupt count does not affect the performance at all with -l > > 131072 -w 128k. > > > > Right. > > > > nr_napi_schedule 31805 174697 > > > total_packets 92354 187408 > > > total_reqs 1200793 2392614 > > > > > >netperf Tput: 5.8Gb/s 10.5Gb/s > > > PPI 2.13 1.00 > > > SPI 36.70 16.73 > > > PPN 2.13 1.31 > > > SPN 36.70 16.75 > > > tx_count 57635 205599 > > > nr_napi_schedule 57633 205311 > > > total_packets 122800 270254 > > > total_reqs 2115068 3439751 > > > > > > PPI: packets processed per interrupt > > > SPI: slots processed per interrupt > > > PPN: packets processed per napi schedule > > > SPN: slots processed per napi schedule > > > tx_count: interrupt count > > > total_reqs: total slots used during test > > > > > >* Notification and batching > > > > > >Is notification and batching really a problem? I''m not so sure now. My > > >first thought when I didn''t measure PPI / PPN / SPI / SPN in copying > > >case was that "in that case netback *must* have better batching" which > > >turned out not very true -- copying mode makes netback slower, however > > >the batching gained is not hugh. > > > > > >Ideally we still want to batch as much as possible. Possible way > > >includes playing with the ''weight'' parameter in NAPI. But as the figures > > >show batching seems not to be very important for throughput, at least > > >for now. If the NAPI framework and netfront / netback are doing their > > >jobs as designed we might not need to worry about this now. > > > > > >Andrew, do you have any thought on this? You found out that NAPI didn''t > > >scale well with multi-threaded iperf in DomU, do you have any handle how > > >that can happen? > > > > > >* Thoughts on zero-copy TX > > > > > >With this hack we are able to achieve 10Gb/s single stream, which is > > >good. But, with classic XenoLinux kernel which has zero copy TX we > > >didn''t able to achieve this. I also developed another zero copy netback > > >prototype one year ago with Ian''s out-of-tree skb frag destructor patch > > >series. That prototype couldn''t achieve 10Gb/s either (IIRC the > > >performance was more or less the same as copying mode, about 6~7Gb/s). > > > > > >My hack maps all necessary pages permantently, there is no unmap, we > > >skip lots of page table manipulation and TLB flushes. So my basic > > >conclusion is that page table manipulation and TLB flushes do incur > > >heavy performance penalty. > > > > > >This hack can be upstreamed in no way. If we''re to re-introduce > > >zero-copy TX, we would need to implement some sort of lazy flushing > > >mechanism. I haven''t thought this through. Presumably this mechanism > > >would also benefit blk somehow? I''m not sure yet. > > > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list > > >mechanism) be useful here? So that we can unify blk and net drivers? > > > > > >* Changes required to introduce zero-copy TX > > > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is > > >not yet upstreamed. > > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> > > > > Yes. But I believe there''s been several versions posted. The link you > have is not the latest version. > > > > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires > > >backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. > > > > > >3. Lazy flushing mechanism or persistent grants: ??? > > > > I did some test with persistent grants before, it did not show > > better performance than grant copy. But I was using the default > > params of netperf, and not tried large packet size. Your results > > reminds me that maybe persistent grants would get similar results > > with larger packet size too. > > > > "No better performance" -- that''s because both mechanisms are copying? > However I presume persistent grant can scale better? From an earlier > email last week, I read that copying is done by the guest so that this > mechanism scales much better than hypervisor copying in blk''s case.Yes, I always expected persistent grants to be faster then gnttab_copy but I was very surprised by the difference in performances: http://marc.info/?l=xen-devel&m=137234605929944 I think it''s worth trying persistent grants on PV network, although it''s very unlikely that they are going to improve the throughput by 5 Gb/s. Also once we have both PV block and network using persistent grants, we might incur the grant table limit, see this email: http://marc.info/?l=xen-devel&m=137183474618974
Wei Liu
2013-Jul-01 14:39 UTC
Re: Interesting observation with network event notification and batching
On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:> On Mon, 1 Jul 2013, Wei Liu wrote: > > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote: > > > > > > On 2013-6-29 0:15, Wei Liu wrote: > > > >Hi all, > > > > > > > >After collecting more stats and comparing copying / mapping cases, I now > > > >have some more interesting finds, which might contradict what I said > > > >before. > > > > > > > >I tuned the runes I used for benchmark to make sure iperf and netperf > > > >generate large packets (~64K). Here are the runes I use: > > > > > > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > > > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > > > > > > > COPY MAP > > > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > > > > > So with default iperf setting, copy is about 7.9G, and map is about > > > 2.5G? How about the result of netperf without large packets? > > > > > > > First question, yes. > > > > Second question, 5.8Gb/s. And I believe for the copying scheme without > > large packet the throuput is more or less the same. > > > > > > PPI 2.90 1.07 > > > > SPI 37.75 13.69 > > > > PPN 2.90 1.07 > > > > SPN 37.75 13.69 > > > > tx_count 31808 174769 > > > > > > Seems interrupt count does not affect the performance at all with -l > > > 131072 -w 128k. > > > > > > > Right. > > > > > > nr_napi_schedule 31805 174697 > > > > total_packets 92354 187408 > > > > total_reqs 1200793 2392614 > > > > > > > >netperf Tput: 5.8Gb/s 10.5Gb/s > > > > PPI 2.13 1.00 > > > > SPI 36.70 16.73 > > > > PPN 2.13 1.31 > > > > SPN 36.70 16.75 > > > > tx_count 57635 205599 > > > > nr_napi_schedule 57633 205311 > > > > total_packets 122800 270254 > > > > total_reqs 2115068 3439751 > > > > > > > > PPI: packets processed per interrupt > > > > SPI: slots processed per interrupt > > > > PPN: packets processed per napi schedule > > > > SPN: slots processed per napi schedule > > > > tx_count: interrupt count > > > > total_reqs: total slots used during test > > > > > > > >* Notification and batching > > > > > > > >Is notification and batching really a problem? I''m not so sure now. My > > > >first thought when I didn''t measure PPI / PPN / SPI / SPN in copying > > > >case was that "in that case netback *must* have better batching" which > > > >turned out not very true -- copying mode makes netback slower, however > > > >the batching gained is not hugh. > > > > > > > >Ideally we still want to batch as much as possible. Possible way > > > >includes playing with the ''weight'' parameter in NAPI. But as the figures > > > >show batching seems not to be very important for throughput, at least > > > >for now. If the NAPI framework and netfront / netback are doing their > > > >jobs as designed we might not need to worry about this now. > > > > > > > >Andrew, do you have any thought on this? You found out that NAPI didn''t > > > >scale well with multi-threaded iperf in DomU, do you have any handle how > > > >that can happen? > > > > > > > >* Thoughts on zero-copy TX > > > > > > > >With this hack we are able to achieve 10Gb/s single stream, which is > > > >good. But, with classic XenoLinux kernel which has zero copy TX we > > > >didn''t able to achieve this. I also developed another zero copy netback > > > >prototype one year ago with Ian''s out-of-tree skb frag destructor patch > > > >series. That prototype couldn''t achieve 10Gb/s either (IIRC the > > > >performance was more or less the same as copying mode, about 6~7Gb/s). > > > > > > > >My hack maps all necessary pages permantently, there is no unmap, we > > > >skip lots of page table manipulation and TLB flushes. So my basic > > > >conclusion is that page table manipulation and TLB flushes do incur > > > >heavy performance penalty. > > > > > > > >This hack can be upstreamed in no way. If we''re to re-introduce > > > >zero-copy TX, we would need to implement some sort of lazy flushing > > > >mechanism. I haven''t thought this through. Presumably this mechanism > > > >would also benefit blk somehow? I''m not sure yet. > > > > > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list > > > >mechanism) be useful here? So that we can unify blk and net drivers? > > > > > > > >* Changes required to introduce zero-copy TX > > > > > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is > > > >not yet upstreamed. > > > > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > > > > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> > > > > > > > Yes. But I believe there''s been several versions posted. The link you > > have is not the latest version. > > > > > > > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires > > > >backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. > > > > > > > >3. Lazy flushing mechanism or persistent grants: ??? > > > > > > I did some test with persistent grants before, it did not show > > > better performance than grant copy. But I was using the default > > > params of netperf, and not tried large packet size. Your results > > > reminds me that maybe persistent grants would get similar results > > > with larger packet size too. > > > > > > > "No better performance" -- that''s because both mechanisms are copying? > > However I presume persistent grant can scale better? From an earlier > > email last week, I read that copying is done by the guest so that this > > mechanism scales much better than hypervisor copying in blk''s case. > > Yes, I always expected persistent grants to be faster then > gnttab_copy but I was very surprised by the difference in performances: > > http://marc.info/?l=xen-devel&m=137234605929944 > > I think it''s worth trying persistent grants on PV network, although it''s > very unlikely that they are going to improve the throughput by 5 Gb/s. >I think it can improve aggregated throughput, however its not likely to improve single stream throughput.> Also once we have both PV block and network using persistent grants, > we might incur the grant table limit, see this email: > > http://marc.info/?l=xen-devel&m=137183474618974Yes, indeed.
Stefano Stabellini
2013-Jul-01 14:54 UTC
Re: Interesting observation with network event notification and batching
On Mon, 1 Jul 2013, Wei Liu wrote:> On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote: > > On Mon, 1 Jul 2013, Wei Liu wrote: > > > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote: > > > > > > > > On 2013-6-29 0:15, Wei Liu wrote: > > > > >Hi all, > > > > > > > > > >After collecting more stats and comparing copying / mapping cases, I now > > > > >have some more interesting finds, which might contradict what I said > > > > >before. > > > > > > > > > >I tuned the runes I used for benchmark to make sure iperf and netperf > > > > >generate large packets (~64K). Here are the runes I use: > > > > > > > > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > > > > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > > > > > > > > > COPY MAP > > > > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > > > > > > > So with default iperf setting, copy is about 7.9G, and map is about > > > > 2.5G? How about the result of netperf without large packets? > > > > > > > > > > First question, yes. > > > > > > Second question, 5.8Gb/s. And I believe for the copying scheme without > > > large packet the throuput is more or less the same. > > > > > > > > PPI 2.90 1.07 > > > > > SPI 37.75 13.69 > > > > > PPN 2.90 1.07 > > > > > SPN 37.75 13.69 > > > > > tx_count 31808 174769 > > > > > > > > Seems interrupt count does not affect the performance at all with -l > > > > 131072 -w 128k. > > > > > > > > > > Right. > > > > > > > > nr_napi_schedule 31805 174697 > > > > > total_packets 92354 187408 > > > > > total_reqs 1200793 2392614 > > > > > > > > > >netperf Tput: 5.8Gb/s 10.5Gb/s > > > > > PPI 2.13 1.00 > > > > > SPI 36.70 16.73 > > > > > PPN 2.13 1.31 > > > > > SPN 36.70 16.75 > > > > > tx_count 57635 205599 > > > > > nr_napi_schedule 57633 205311 > > > > > total_packets 122800 270254 > > > > > total_reqs 2115068 3439751 > > > > > > > > > > PPI: packets processed per interrupt > > > > > SPI: slots processed per interrupt > > > > > PPN: packets processed per napi schedule > > > > > SPN: slots processed per napi schedule > > > > > tx_count: interrupt count > > > > > total_reqs: total slots used during test > > > > > > > > > >* Notification and batching > > > > > > > > > >Is notification and batching really a problem? I''m not so sure now. My > > > > >first thought when I didn''t measure PPI / PPN / SPI / SPN in copying > > > > >case was that "in that case netback *must* have better batching" which > > > > >turned out not very true -- copying mode makes netback slower, however > > > > >the batching gained is not hugh. > > > > > > > > > >Ideally we still want to batch as much as possible. Possible way > > > > >includes playing with the ''weight'' parameter in NAPI. But as the figures > > > > >show batching seems not to be very important for throughput, at least > > > > >for now. If the NAPI framework and netfront / netback are doing their > > > > >jobs as designed we might not need to worry about this now. > > > > > > > > > >Andrew, do you have any thought on this? You found out that NAPI didn''t > > > > >scale well with multi-threaded iperf in DomU, do you have any handle how > > > > >that can happen? > > > > > > > > > >* Thoughts on zero-copy TX > > > > > > > > > >With this hack we are able to achieve 10Gb/s single stream, which is > > > > >good. But, with classic XenoLinux kernel which has zero copy TX we > > > > >didn''t able to achieve this. I also developed another zero copy netback > > > > >prototype one year ago with Ian''s out-of-tree skb frag destructor patch > > > > >series. That prototype couldn''t achieve 10Gb/s either (IIRC the > > > > >performance was more or less the same as copying mode, about 6~7Gb/s). > > > > > > > > > >My hack maps all necessary pages permantently, there is no unmap, we > > > > >skip lots of page table manipulation and TLB flushes. So my basic > > > > >conclusion is that page table manipulation and TLB flushes do incur > > > > >heavy performance penalty. > > > > > > > > > >This hack can be upstreamed in no way. If we''re to re-introduce > > > > >zero-copy TX, we would need to implement some sort of lazy flushing > > > > >mechanism. I haven''t thought this through. Presumably this mechanism > > > > >would also benefit blk somehow? I''m not sure yet. > > > > > > > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list > > > > >mechanism) be useful here? So that we can unify blk and net drivers? > > > > > > > > > >* Changes required to introduce zero-copy TX > > > > > > > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is > > > > >not yet upstreamed. > > > > > > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > > > > > > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> > > > > > > > > > > Yes. But I believe there''s been several versions posted. The link you > > > have is not the latest version. > > > > > > > > > > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires > > > > >backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. > > > > > > > > > >3. Lazy flushing mechanism or persistent grants: ??? > > > > > > > > I did some test with persistent grants before, it did not show > > > > better performance than grant copy. But I was using the default > > > > params of netperf, and not tried large packet size. Your results > > > > reminds me that maybe persistent grants would get similar results > > > > with larger packet size too. > > > > > > > > > > "No better performance" -- that''s because both mechanisms are copying? > > > However I presume persistent grant can scale better? From an earlier > > > email last week, I read that copying is done by the guest so that this > > > mechanism scales much better than hypervisor copying in blk''s case. > > > > Yes, I always expected persistent grants to be faster then > > gnttab_copy but I was very surprised by the difference in performances: > > > > http://marc.info/?l=xen-devel&m=137234605929944 > > > > I think it''s worth trying persistent grants on PV network, although it''s > > very unlikely that they are going to improve the throughput by 5 Gb/s. > > > > I think it can improve aggregated throughput, however its not likely to > improve single stream throughput.you are probably right
annie li
2013-Jul-01 15:59 UTC
Re: Interesting observation with network event notification and batching
On 2013-7-1 16:54, Wei Liu wrote:> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote: >> On 2013-6-29 0:15, Wei Liu wrote: >>> Hi all, >>> >>> After collecting more stats and comparing copying / mapping cases, I now >>> have some more interesting finds, which might contradict what I said >>> before. >>> >>> I tuned the runes I used for benchmark to make sure iperf and netperf >>> generate large packets (~64K). Here are the runes I use: >>> >>> iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) >>> netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 >>> >>> COPY MAP >>> iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) >> So with default iperf setting, copy is about 7.9G, and map is about >> 2.5G? How about the result of netperf without large packets? >> > First question, yes. > > Second question, 5.8Gb/s. And I believe for the copying scheme without > large packet the throuput is more or less the same. > >>> PPI 2.90 1.07 >>> SPI 37.75 13.69 >>> PPN 2.90 1.07 >>> SPN 37.75 13.69 >>> tx_count 31808 174769 >> Seems interrupt count does not affect the performance at all with -l >> 131072 -w 128k. >> > Right. > >>> nr_napi_schedule 31805 174697 >>> total_packets 92354 187408 >>> total_reqs 1200793 2392614 >>> >>> netperf Tput: 5.8Gb/s 10.5Gb/s >>> PPI 2.13 1.00 >>> SPI 36.70 16.73 >>> PPN 2.13 1.31 >>> SPN 36.70 16.75 >>> tx_count 57635 205599 >>> nr_napi_schedule 57633 205311 >>> total_packets 122800 270254 >>> total_reqs 2115068 3439751 >>> >>> PPI: packets processed per interrupt >>> SPI: slots processed per interrupt >>> PPN: packets processed per napi schedule >>> SPN: slots processed per napi schedule >>> tx_count: interrupt count >>> total_reqs: total slots used during test >>> >>> * Notification and batching >>> >>> Is notification and batching really a problem? I''m not so sure now. My >>> first thought when I didn''t measure PPI / PPN / SPI / SPN in copying >>> case was that "in that case netback *must* have better batching" which >>> turned out not very true -- copying mode makes netback slower, however >>> the batching gained is not hugh. >>> >>> Ideally we still want to batch as much as possible. Possible way >>> includes playing with the ''weight'' parameter in NAPI. But as the figures >>> show batching seems not to be very important for throughput, at least >>> for now. If the NAPI framework and netfront / netback are doing their >>> jobs as designed we might not need to worry about this now. >>> >>> Andrew, do you have any thought on this? You found out that NAPI didn''t >>> scale well with multi-threaded iperf in DomU, do you have any handle how >>> that can happen? >>> >>> * Thoughts on zero-copy TX >>> >>> With this hack we are able to achieve 10Gb/s single stream, which is >>> good. But, with classic XenoLinux kernel which has zero copy TX we >>> didn''t able to achieve this. I also developed another zero copy netback >>> prototype one year ago with Ian''s out-of-tree skb frag destructor patch >>> series. That prototype couldn''t achieve 10Gb/s either (IIRC the >>> performance was more or less the same as copying mode, about 6~7Gb/s). >>> >>> My hack maps all necessary pages permantently, there is no unmap, we >>> skip lots of page table manipulation and TLB flushes. So my basic >>> conclusion is that page table manipulation and TLB flushes do incur >>> heavy performance penalty. >>> >>> This hack can be upstreamed in no way. If we''re to re-introduce >>> zero-copy TX, we would need to implement some sort of lazy flushing >>> mechanism. I haven''t thought this through. Presumably this mechanism >>> would also benefit blk somehow? I''m not sure yet. >>> >>> Could persistent mapping (with the to-be-developed reclaim / MRU list >>> mechanism) be useful here? So that we can unify blk and net drivers? >>> >>> * Changes required to introduce zero-copy TX >>> >>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is >>> not yet upstreamed. >> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? >> >> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> >> > Yes. But I believe there''s been several versions posted. The link you > have is not the latest version. > >>> 2. Mechanism to negotiate max slots frontend can use: mapping requires >>> backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. >>> >>> 3. Lazy flushing mechanism or persistent grants: ??? >> I did some test with persistent grants before, it did not show >> better performance than grant copy. But I was using the default >> params of netperf, and not tried large packet size. Your results >> reminds me that maybe persistent grants would get similar results >> with larger packet size too. >> > "No better performance" -- that''s because both mechanisms are copying? > However I presume persistent grant can scale better? From an earlier > email last week, I read that copying is done by the guest so that this > mechanism scales much better than hypervisor copying in blk''s case.The original persistent patch does memcpy in both netback and netfront side. I am thinking maybe the performance can become better if removing the memcpy from netfront. Moreover, I also have a feeling that we got persistent grant performance based on default netperf params test, just like wei''s hack which does not get better performance without large packets. So let me try some test with large packets though. Thanks Annie
annie li
2013-Jul-01 15:59 UTC
Re: Interesting observation with network event notification and batching
On 2013-7-1 22:19, Stefano Stabellini wrote:> Could you please use plain text emails in the future?Sure, sorry about that. Thanks Annie> > On Mon, 1 Jul 2013, annie li wrote: >> On 2013-6-29 0:15, Wei Liu wrote: >> >> Hi all, >> >> After collecting more stats and comparing copying / mapping cases, I now >> have some more interesting finds, which might contradict what I said >> before. >> >> I tuned the runes I used for benchmark to make sure iperf and netperf >> generate large packets (~64K). Here are the runes I use: >> >> iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) >> netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 >> >> COPY MAP >> iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) >> >> >> So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets? >> >> PPI 2.90 1.07 >> SPI 37.75 13.69 >> PPN 2.90 1.07 >> SPN 37.75 13.69 >> tx_count 31808 174769 >> >> >> Seems interrupt count does not affect the performance at all with -l 131072 -w 128k. >> >> nr_napi_schedule 31805 174697 >> total_packets 92354 187408 >> total_reqs 1200793 2392614 >> >> netperf Tput: 5.8Gb/s 10.5Gb/s >> PPI 2.13 1.00 >> SPI 36.70 16.73 >> PPN 2.13 1.31 >> SPN 36.70 16.75 >> tx_count 57635 205599 >> nr_napi_schedule 57633 205311 >> total_packets 122800 270254 >> total_reqs 2115068 3439751 >> >> PPI: packets processed per interrupt >> SPI: slots processed per interrupt >> PPN: packets processed per napi schedule >> SPN: slots processed per napi schedule >> tx_count: interrupt count >> total_reqs: total slots used during test >> >> * Notification and batching >> >> Is notification and batching really a problem? I''m not so sure now. My >> first thought when I didn''t measure PPI / PPN / SPI / SPN in copying >> case was that "in that case netback *must* have better batching" which >> turned out not very true -- copying mode makes netback slower, however >> the batching gained is not hugh. >> >> Ideally we still want to batch as much as possible. Possible way >> includes playing with the ''weight'' parameter in NAPI. But as the figures >> show batching seems not to be very important for throughput, at least >> for now. If the NAPI framework and netfront / netback are doing their >> jobs as designed we might not need to worry about this now. >> >> Andrew, do you have any thought on this? You found out that NAPI didn''t >> scale well with multi-threaded iperf in DomU, do you have any handle how >> that can happen? >> >> * Thoughts on zero-copy TX >> >> With this hack we are able to achieve 10Gb/s single stream, which is >> good. But, with classic XenoLinux kernel which has zero copy TX we >> didn''t able to achieve this. I also developed another zero copy netback >> prototype one year ago with Ian''s out-of-tree skb frag destructor patch >> series. That prototype couldn''t achieve 10Gb/s either (IIRC the >> performance was more or less the same as copying mode, about 6~7Gb/s). >> >> My hack maps all necessary pages permantently, there is no unmap, we >> skip lots of page table manipulation and TLB flushes. So my basic >> conclusion is that page table manipulation and TLB flushes do incur >> heavy performance penalty. >> >> This hack can be upstreamed in no way. If we''re to re-introduce >> zero-copy TX, we would need to implement some sort of lazy flushing >> mechanism. I haven''t thought this through. Presumably this mechanism >> would also benefit blk somehow? I''m not sure yet. >> >> Could persistent mapping (with the to-be-developed reclaim / MRU list >> mechanism) be useful here? So that we can unify blk and net drivers? >> >> * Changes required to introduce zero-copy TX >> >> 1. SKB frag destructor series: to track life cycle of SKB frags. This is >> not yet upstreamed. >> >> >> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? >> >> >> 2. Mechanism to negotiate max slots frontend can use: mapping requires >> backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. >> >> 3. Lazy flushing mechanism or persistent grants: ??? >> >> >> I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default >> params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar >> results with larger packet size too. >> >> Thanks >> Annie >> >> >>
Wei Liu
2013-Jul-01 16:06 UTC
Re: Interesting observation with network event notification and batching
On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote: [...]> >>>1. SKB frag destructor series: to track life cycle of SKB frags. This is > >>>not yet upstreamed. > >>Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > >> > >><http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> > >> > >Yes. But I believe there''s been several versions posted. The link you > >have is not the latest version. > > > >>>2. Mechanism to negotiate max slots frontend can use: mapping requires > >>>backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. > >>> > >>>3. Lazy flushing mechanism or persistent grants: ??? > >>I did some test with persistent grants before, it did not show > >>better performance than grant copy. But I was using the default > >>params of netperf, and not tried large packet size. Your results > >>reminds me that maybe persistent grants would get similar results > >>with larger packet size too. > >> > >"No better performance" -- that''s because both mechanisms are copying? > >However I presume persistent grant can scale better? From an earlier > >email last week, I read that copying is done by the guest so that this > >mechanism scales much better than hypervisor copying in blk''s case. > > The original persistent patch does memcpy in both netback and > netfront side. I am thinking maybe the performance can become better > if removing the memcpy from netfront.I would say that removing copy in netback can scale better.> Moreover, I also have a feeling that we got persistent grant > performance based on default netperf params test, just like wei''s > hack which does not get better performance without large packets. So > let me try some test with large packets though. >Sadly enough, I found out today these sort of test seems to be quite inconsistent. On a Intel 10G Nic the throughput is actually higher without enforcing iperf / netperf to generate large packets. Wei.> Thanks > Annie
Andrew Bennieston
2013-Jul-01 16:53 UTC
Re: Interesting observation with network event notification and batching
On 01/07/13 17:06, Wei Liu wrote:> On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote: > [...] >>>>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is >>>>> not yet upstreamed. >>>> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? >>>> >>>> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> >>>> >>> Yes. But I believe there''s been several versions posted. The link you >>> have is not the latest version. >>> >>>>> 2. Mechanism to negotiate max slots frontend can use: mapping requires >>>>> backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS. >>>>> >>>>> 3. Lazy flushing mechanism or persistent grants: ??? >>>> I did some test with persistent grants before, it did not show >>>> better performance than grant copy. But I was using the default >>>> params of netperf, and not tried large packet size. Your results >>>> reminds me that maybe persistent grants would get similar results >>>> with larger packet size too. >>>> >>> "No better performance" -- that''s because both mechanisms are copying? >>> However I presume persistent grant can scale better? From an earlier >>> email last week, I read that copying is done by the guest so that this >>> mechanism scales much better than hypervisor copying in blk''s case. >> >> The original persistent patch does memcpy in both netback and >> netfront side. I am thinking maybe the performance can become better >> if removing the memcpy from netfront. > > I would say that removing copy in netback can scale better. > >> Moreover, I also have a feeling that we got persistent grant >> performance based on default netperf params test, just like wei''s >> hack which does not get better performance without large packets. So >> let me try some test with large packets though. >> > > Sadly enough, I found out today these sort of test seems to be quite > inconsistent. On a Intel 10G Nic the throughput is actually higher > without enforcing iperf / netperf to generate large packets.When I have made performance measurements using iperf, I found that for a given point in the parameter space (e.g. for a fixed number of guests, interfaces, fixed parameters to iperf, fixed test run duration, etc.) the variation was typically _smaller than_ +/- 1 Gbit/s on a 10G NIC. I notice that your results don''t include any error bars or indication of standard deviation... With this sort of data (or, really, any data) measuring at least 5 times will help to get an idea of the fluctuations present (i.e. a measure of statistical uncertainty) by quoting a mean +/- standard deviation. Having the standard deviation (or other estimator for the uncertainty in the results) allows us to better determine how significant this difference in results really is. For example, is the high throughput you quoted (~ 14 Gbit/s) an upward fluctuation, and the low value (~6) a downward fluctuation? Having a mean and standard deviation would allow us to determine just how (in)compatible these values are. Assuming a Gaussian distribution (and when sampled sufficient times, "everything" tends to a Gaussian) you have an almost 5% chance that a result lies more than 2 standard deviations from the mean (and a 0.3% chance that it lies more than 3 s.d. from the mean!). Results that appear "high" or "low" may, therefore, not be entirely unexpected. Having a measure of the standard deviation provides some basis against which to determine how likely it is that a measured value is just statistical fluctuation, or whether it is a significant result. Another thing I noticed is that you''re running the iperf test for only 5 seconds. I have found in the past that iperf (or, more likely, TCP) takes a while to "ramp up" (even with all parameters fixed e.g. "-l <size> -w <size>") and that tests run for 2 minutes or more (e.g. "-t 120") give much more stable results. Andrew.> > > Wei. > >> Thanks >> Annie
Wei Liu
2013-Jul-01 17:55 UTC
Re: Interesting observation with network event notification and batching
On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote: [...]> > > >Sadly enough, I found out today these sort of test seems to be quite > >inconsistent. On a Intel 10G Nic the throughput is actually higher > >without enforcing iperf / netperf to generate large packets. > > When I have made performance measurements using iperf, I found that > for a given point in the parameter space (e.g. for a fixed number of > guests, interfaces, fixed parameters to iperf, fixed test run > duration, etc.) the variation was typically _smaller than_ +/- 1 > Gbit/s on a 10G NIC. >I was talking about virtual interface v.s. real hardware. The parameters that maximize throughput for one case don''t seem to be working for the other case. The deviation for a specific interface is rather small.> I notice that your results don''t include any error bars or > indication of standard deviation... > > With this sort of data (or, really, any data) measuring at least 5 > times will help to get an idea of the fluctuations present (i.e. a > measure of statistical uncertainty) by quoting a mean +/- standard > deviation. Having the standard deviation (or other estimator for the > uncertainty in the results) allows us to better determine how > significant this difference in results really is. > > For example, is the high throughput you quoted (~ 14 Gbit/s) an > upward fluctuation, and the low value (~6) a downward fluctuation? > Having a mean and standard deviation would allow us to determine > just how (in)compatible these values are. >I ran those tests for several times and picked the number that appeared most. Anyway I will try to come up with better visualized graphs.> Assuming a Gaussian distribution (and when sampled sufficient times, > "everything" tends to a Gaussian) you have an almost 5% chance that > a result lies more than 2 standard deviations from the mean (and a > 0.3% chance that it lies more than 3 s.d. from the mean!). Results > that appear "high" or "low" may, therefore, not be entirely > unexpected. Having a measure of the standard deviation provides some > basis against which to determine how likely it is that a measured > value is just statistical fluctuation, or whether it is a > significant result. > > Another thing I noticed is that you''re running the iperf test for > only 5 seconds. I have found in the past that iperf (or, more > likely, TCP) takes a while to "ramp up" (even with all parameters > fixed e.g. "-l <size> -w <size>") and that tests run for 2 minutes > or more (e.g. "-t 120") give much more stable results. >Hmm... for me the lenght of the test doesn''t make much difference, that''s why I''ve chosen such a short time. As you mentioned this I intend to run the tests a big longer.> Andrew. > > > > > > >Wei. > > > >>Thanks > >>Annie
Wei Liu
2013-Jul-03 15:18 UTC
Re: Interesting observation with network event notification and batching
On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote: [...]> >I would say that removing copy in netback can scale better. > > > >>Moreover, I also have a feeling that we got persistent grant > >>performance based on default netperf params test, just like wei''s > >>hack which does not get better performance without large packets. So > >>let me try some test with large packets though. > >> > > > >Sadly enough, I found out today these sort of test seems to be quite > >inconsistent. On a Intel 10G Nic the throughput is actually higher > >without enforcing iperf / netperf to generate large packets. > > When I have made performance measurements using iperf, I found that > for a given point in the parameter space (e.g. for a fixed number of > guests, interfaces, fixed parameters to iperf, fixed test run > duration, etc.) the variation was typically _smaller than_ +/- 1 > Gbit/s on a 10G NIC. > > I notice that your results don''t include any error bars or > indication of standard deviation... > > With this sort of data (or, really, any data) measuring at least 5 > times will help to get an idea of the fluctuations present (i.e. a > measure of statistical uncertainty) by quoting a mean +/- standard > deviation. Having the standard deviation (or other estimator for the > uncertainty in the results) allows us to better determine how > significant this difference in results really is. > > For example, is the high throughput you quoted (~ 14 Gbit/s) an > upward fluctuation, and the low value (~6) a downward fluctuation? > Having a mean and standard deviation would allow us to determine > just how (in)compatible these values are. > > Assuming a Gaussian distribution (and when sampled sufficient times, > "everything" tends to a Gaussian) you have an almost 5% chance that > a result lies more than 2 standard deviations from the mean (and a > 0.3% chance that it lies more than 3 s.d. from the mean!). Results > that appear "high" or "low" may, therefore, not be entirely > unexpected. Having a measure of the standard deviation provides some > basis against which to determine how likely it is that a measured > value is just statistical fluctuation, or whether it is a > significant result. > > Another thing I noticed is that you''re running the iperf test for > only 5 seconds. I have found in the past that iperf (or, more > likely, TCP) takes a while to "ramp up" (even with all parameters > fixed e.g. "-l <size> -w <size>") and that tests run for 2 minutes > or more (e.g. "-t 120") give much more stable results. > > Andrew. >Here you go, results for the new conducted benchmarks. Was about to do graph but found out not really worth it because it''s only single stream. For iperf tests unit is Gb/s, for netperf tests unit is Mb/s. COPY SCHEME iperf -c 10.80.237.127 -t 120 6.19 6.23 6.26 6.25 6.27 mean 6.24 s.d. 0.031622776601759 iperf -c 10.80.237.127 -t 120 -l 131072 6.07 6.07 6.03 6.06 6.06 mean 6.058 s.d. 0.016431676725514 netperf -H 10.80.237.127 -l120 -f m 5662.55 5636.6 5641.52 5631.39 5630.98 mean 5640.608 s.d. 13.0001642297036 netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072 5831.19 5833.03 5829.54 5838.89 5830.5 mean 5832.63 s.d. 3.72512415992628 PERMANENT MAP SCHEME "iperf -c 10.80.237.127 -t 120 2.42 2.41 2.41 2.42 2.43 mean 2.418 s.d. 0.00836660026531 iperf -c 10.80.237.127 -t 120 -l 131072 14.3 14.2 14.2 14.4 14.3 mean 14.28 s.d. 0.083666002653234 netperf -H 10.80.237.127 -l120 -f m 4632.27 4630.08 4633.18 4641.25 4632.23 mean 4633.802 s.d. 4.31656924013371 netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072 10556.04 10532.89 10541.83 10552.77 10546.77 mean 10546.06 s.d. 9.17156475133789 Short run of iperf / netperf was conducted before each test run so that the system was "warmed-up". The results show that the single stream performance is quite stable. Also there''s not much difference between running tests for 5s or 120s. Wei.> > > > > >Wei. > > > >>Thanks > >>Annie