thr3ads.net - Xen devel - Interesting observation with network event notification and batching [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Wei Liu

2013-Jun-12 10:14 UTC

Interesting observation with network event notification and batching

Hi all

I''m hacking on a netback trying to identify whether TLB flushes causes
heavy performance penalty on Tx path. The hack is quite nasty (you would
not want to know, trust me).

Basically what is doesn''t is, 1) alter network protocol to pass along
mfns instead of grant references, 2) when the backend sees a new mfn,
map it RO and cache it in its own address space.

With this hack, now we have some sort of zero-copy TX path. Backend
doesn''t need to issue any grant copy / map operation any more. When it
sees a new packet in the ring, it just needs to pick up the pages
in its own address space and assemble packets with those pages then pass
the packet on to network stack.

In theory this should boost performance, but in practice it is the other
way around. This hack makes Xen network more than 50% slower than before
(OMG). Further investigation shows that with this hack the batching
ability is gone. Before this hack, netback batches like 64 slots in one
interrupt event, however after this hack, it only batches 3 slots in one
interrupt event -- that''s no batching at all because we can expect one
packet to occupy 3 slots.

Time to have some figures (iperf from DomU to Dom0).

Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
per batch 64.

After the hack, throughput: 2.5 Gb/s, average slots per batch 3.

After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
per batch 6.

After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
per batch 26.

After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
per batch 26.

After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
per batch 25.

After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
per batch 25.

Average slots per batch is calculate as followed:
 1. count total_slots processed from start of day
 2. count tx_count which is the number of tx_action function gets
    invoked
 3. avg_slots_per_tx = total_slots / tx_count

The counter-intuition figures imply that there is something wrong with
the currently batching mechanism. Probably we need to fine-tune the
batching behavior for network and play with event pointers in the ring
(actually I''m looking into it now). It would be good to have some input
on this.

Konrad, IIRC you once mentioned you discovered something with event
notification, what''s that?

To all, any thoughts?


Wei.

Konrad Rzeszutek Wilk

2013-Jun-14 18:53 UTC

head link

Re: Interesting observation with network event notification and batching

On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote:> Hi all
> 
> I''m hacking on a netback trying to identify whether TLB flushes
causes
> heavy performance penalty on Tx path. The hack is quite nasty (you would
> not want to know, trust me).
> 
> Basically what is doesn''t is, 1) alter network protocol to pass
along
You probably meant: "what it does" ?
> mfns instead of grant references, 2) when the backend sees a new mfn,
> map it RO and cache it in its own address space.
> 
> With this hack, now we have some sort of zero-copy TX path. Backend
> doesn''t need to issue any grant copy / map operation any more.
When it
> sees a new packet in the ring, it just needs to pick up the pages
> in its own address space and assemble packets with those pages then pass
> the packet on to network stack.
Uh, so not sure I understand the RO part. If dom0 is mapping it won''t
that trigger a PTE update? And doesn''t somebody (either the guest or
initial domain) do a grant mapping to let the hypervisor know it is
OK to map a grant?

Or is dom0 actually permitted to map the MFN of any guest without using
the grants? In which case you are then using the _PAGE_IOMAP
somewhere and setting up vmap entries with the MFN''s that point to the
foreign domain - I think?
> 
> In theory this should boost performance, but in practice it is the other
> way around. This hack makes Xen network more than 50% slower than before
> (OMG). Further investigation shows that with this hack the batching
> ability is gone. Before this hack, netback batches like 64 slots in one
That is quite interesting.
> interrupt event, however after this hack, it only batches 3 slots in one
> interrupt event -- that''s no batching at all because we can expect
one
> packet to occupy 3 slots.
Right.> 
> Time to have some figures (iperf from DomU to Dom0).
> 
> Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
> per batch 64.
> 
> After the hack, throughput: 2.5 Gb/s, average slots per batch 3.
> 
> After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
> per batch 6.
> 
> After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
> per batch 26.
> 
> After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
> per batch 26.
> 
> After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
> per batch 25.
> 
> After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
> per batch 25.
> 
How do you get it to do more HYPERVISR_xen_version? Did you just add
a (for i = 1024; i>0;i--) hypervisor_yield();

in netback?> Average slots per batch is calculate as followed:
>  1. count total_slots processed from start of day
>  2. count tx_count which is the number of tx_action function gets
>     invoked
>  3. avg_slots_per_tx = total_slots / tx_count
> 
> The counter-intuition figures imply that there is something wrong with
> the currently batching mechanism. Probably we need to fine-tune the
> batching behavior for network and play with event pointers in the ring
> (actually I''m looking into it now). It would be good to have some
input
> on this.
I am still unsure I understand hwo your changes would incur more
of the yields.> 
> Konrad, IIRC you once mentioned you discovered something with event
> notification, what''s that?
They were bizzare. I naively expected some form of # of physical NIC 
interrupts to be around the same as the VIF or less. And I figured
that the amount of interrupts would be constant irregardless of the
size of the packets. In other words #packets == #interrupts.

In reality the number of interrupts the VIF had was about the same while
for the NIC it would fluctuate. (I can''t remember the details).

But it was odd and I didn''t go deeper in it to figure out what
was happening. And also to figure out if for the VIF we could
do something of #packets != #interrupts.  And hopefully some
mechanism to adjust so that the amount of interrupts would
be lesser per packets (hand waving here).

Wei Liu

2013-Jun-16 09:54 UTC

head link

Re: Interesting observation with network event notification and batching

On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk
wrote:> On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote:
> > Hi all
> > 
> > I''m hacking on a netback trying to identify whether TLB
flushes causes
> > heavy performance penalty on Tx path. The hack is quite nasty (you
would
> > not want to know, trust me).
> > 
> > Basically what is doesn''t is, 1) alter network protocol to
pass along
> 
> You probably meant: "what it does" ?
> 
Oh yes! Muscle memory got me!
> > mfns instead of grant references, 2) when the backend sees a new mfn,
> > map it RO and cache it in its own address space.
> > 
> > With this hack, now we have some sort of zero-copy TX path. Backend
> > doesn''t need to issue any grant copy / map operation any
more. When it
> > sees a new packet in the ring, it just needs to pick up the pages
> > in its own address space and assemble packets with those pages then
pass
> > the packet on to network stack.
> 
> Uh, so not sure I understand the RO part. If dom0 is mapping it
won''t
> that trigger a PTE update? And doesn''t somebody (either the guest
or
> initial domain) do a grant mapping to let the hypervisor know it is
> OK to map a grant?
> 
It is very easy to issue HYPERVISOR_mmu_udpate to alter Dom0''s mapping,
because Dom0 is priveleged.
> Or is dom0 actually permitted to map the MFN of any guest without using
> the grants? In which case you are then using the _PAGE_IOMAP
> somewhere and setting up vmap entries with the MFN''s that point to
the
> foreign domain - I think?
> 
Sort of, but I didn''t use vmap, I used alloc_page to get actual pages.
Then I modified the underlying PTE to point to the MFN from netfront.
> > 
> > In theory this should boost performance, but in practice it is the
other
> > way around. This hack makes Xen network more than 50% slower than
before
> > (OMG). Further investigation shows that with this hack the batching
> > ability is gone. Before this hack, netback batches like 64 slots in
one
> 
> That is quite interesting.
> 
> > interrupt event, however after this hack, it only batches 3 slots in
one
> > interrupt event -- that''s no batching at all because we can
expect one
> > packet to occupy 3 slots.
> 
> Right.
> > 
> > Time to have some figures (iperf from DomU to Dom0).
> > 
> > Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
> > per batch 64.
> > 
> > After the hack, throughput: 2.5 Gb/s, average slots per batch 3.
> > 
> > After the hack, adds in 64 HYPERVISOR_xen_version (it just does
context
> > switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average
slots
> > per batch 6.
> > 
> > After the hack, adds in 256 HYPERVISOR_xen_version (it just does
context
> > switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average
slots
> > per batch 26.
> > 
> > After the hack, adds in 512 HYPERVISOR_xen_version (it just does
context
> > switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average
slots
> > per batch 26.
> > 
> > After the hack, adds in 768 HYPERVISOR_xen_version (it just does
context
> > switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average
slots
> > per batch 25.
> > 
> > After the hack, adds in 1024 HYPERVISOR_xen_version (it just does
context
> > switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average
slots
> > per batch 25.
> > 
> 
> How do you get it to do more HYPERVISR_xen_version? Did you just add
> a (for i = 1024; i>0;i--) hypervisor_yield();
 for (i = 0; i < X; i++) (void)HYPERVISOR_xen_version(0, NULL);
> 
> in netback?
> > Average slots per batch is calculate as followed:
> >  1. count total_slots processed from start of day
> >  2. count tx_count which is the number of tx_action function gets
> >     invoked
> >  3. avg_slots_per_tx = total_slots / tx_count
> > 
> > The counter-intuition figures imply that there is something wrong with
> > the currently batching mechanism. Probably we need to fine-tune the
> > batching behavior for network and play with event pointers in the ring
> > (actually I''m looking into it now). It would be good to have
some input
> > on this.
> 
> I am still unsure I understand hwo your changes would incur more
> of the yields.
It''s not yielding. At least that''s not the purpose of that
hypercall.
HYPERVISOR_xen_version(0, NULL) only does guest -> hypervisor -> guest
context switching. The original purpose of HYPERVISOR_xen_version(0,
NULL) is to force guest to check pending events.

Since you mentioned yeilding, I will also try to do yielding and post
figures.
> > 
> > Konrad, IIRC you once mentioned you discovered something with event
> > notification, what''s that?
> 
> They were bizzare. I naively expected some form of # of physical NIC 
> interrupts to be around the same as the VIF or less. And I figured
> that the amount of interrupts would be constant irregardless of the
> size of the packets. In other words #packets == #interrupts.
> 
It could be that the frontend notifies the backend for every packet it
sends. This is not desirable and I don''t expect the ring to behave that
way.
> In reality the number of interrupts the VIF had was about the same while
> for the NIC it would fluctuate. (I can''t remember the details).
> 
I''m not sure I understand you here. But for the NIC, if you see the
number of interrupt goes from high to low that''s expected. When the NIC
has very high interrupt rate it turns to polling mode.
> But it was odd and I didn''t go deeper in it to figure out what
> was happening. And also to figure out if for the VIF we could
> do something of #packets != #interrupts.  And hopefully some
> mechanism to adjust so that the amount of interrupts would
> be lesser per packets (hand waving here).
I''m trying to do this now.


Wei.

Wei Liu

2013-Jun-16 12:46 UTC

head link

Re: Interesting observation with network event notification and batching

On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk wrote:
[...]> > How do you get it to do more HYPERVISR_xen_version? Did you just add
> a (for i = 1024; i>0;i--) hypervisor_yield();
> 
Here are the figures to replace HYPERVISOR_xen_version(0, NULL) with
HYPERVISOR_sched_op(SCHEDOP_yield, NULL).

64 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 5.15G/s,
average slots per tx 25

128 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 7.75G/s,
average slots per tx 26

512 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 1.74G/s,
average slots per tx 18

1024 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 998M/s,
average slots per tx 18

Please note that Dom0 and DomU runs on different PCPUs.

I think this kind of behavior has something to do with scheduler. But
down to the bottom we should really fix notification mechanism.


Wei.

Ian Campbell

2013-Jun-17 09:38 UTC

head link

Re: Interesting observation with network event notification and batching

On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:> > > Konrad, IIRC you once mentioned you discovered something with
event
> > > notification, what''s that?
> > 
> > They were bizzare. I naively expected some form of # of physical NIC 
> > interrupts to be around the same as the VIF or less. And I figured
> > that the amount of interrupts would be constant irregardless of the
> > size of the packets. In other words #packets == #interrupts.
> > 
> 
> It could be that the frontend notifies the backend for every packet it
> sends. This is not desirable and I don''t expect the ring to behave
that
> way.
It is probably worth checking that things are working how we think they
should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and
netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
suitable points to maximise batching.

Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
loop right? This would push the req_event pointer to just after the last
request, meaning the net request enqueued by the frontend would cause a
notification -- even though the backend is actually still continuing to
process requests and would have picked up that packet without further
notification. n this case there is a fair bit of work left in the
backend for this iteration i.e. plenty of opportunity for the frontend
to queue more requests.

The comments in ring.h say:
 *  These macros will set the req_event/rsp_event field to trigger a
 *  notification on the very next message that is enqueued. If you want to
 *  create batches of work (i.e., only receive a notification after several
 *  messages have been enqueued) then you will need to create a customised
 *  version of the FINAL_CHECK macro in your own code, which sets the event
 *  field appropriately.

Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
(and other similar loops) and add a FINAL check at the very end?
> > But it was odd and I didn''t go deeper in it to figure out
what
> > was happening. And also to figure out if for the VIF we could
> > do something of #packets != #interrupts.  And hopefully some
> > mechanism to adjust so that the amount of interrupts would
> > be lesser per packets (hand waving here).
> 
> I''m trying to do this now.
What scheme do you have in mind?

Andrew Bennieston

2013-Jun-17 09:56 UTC

head link

Re: Interesting observation with network event notification and batching

On 17/06/13 10:38, Ian Campbell wrote:> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>> Konrad, IIRC you once mentioned you discovered something with
event
>>>> notification, what''s that?
>>>
>>> They were bizzare. I naively expected some form of # of physical
NIC
>>> interrupts to be around the same as the VIF or less. And I figured
>>> that the amount of interrupts would be constant irregardless of the
>>> size of the packets. In other words #packets == #interrupts.
>>>
>>
>> It could be that the frontend notifies the backend for every packet it
>> sends. This is not desirable and I don''t expect the ring to
behave that
>> way.
I have observed this kind of behaviour during network performance tests 
in which I periodically checked the ring state during an iperf session. 
It looked to me like the frontend was sending notifications far too 
often, but that the backend was sending them very infrequently, so the 
Tx (from guest) ring was mostly empty and the Rx (to guest) ring was 
mostly full. This has the effect of both front and backend having to 
block occasionally waiting for the other end to clear or fill a ring, 
even though there is more data available.

My initial theory was that this was caused in part by the shared event 
channel, however I expect that Wei is testing on top of a kernel with 
his split event channel features?
>
> It is probably worth checking that things are working how we think they
> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and
> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
>
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
>
> The comments in ring.h say:
>   *  These macros will set the req_event/rsp_event field to trigger a
>   *  notification on the very next message that is enqueued. If you want to
>   *  create batches of work (i.e., only receive a notification after
several
>   *  messages have been enqueued) then you will need to create a customised
>   *  version of the FINAL_CHECK macro in your own code, which sets the
event
>   *  field appropriately.
>
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
>
>>> But it was odd and I didn''t go deeper in it to figure out
what
>>> was happening. And also to figure out if for the VIF we could
>>> do something of #packets != #interrupts.  And hopefully some
>>> mechanism to adjust so that the amount of interrupts would
>>> be lesser per packets (hand waving here).
>>
>> I''m trying to do this now.
>
> What scheme do you have in mind?
As I mentioned above, filling a ring completely appears to be almost as 
bad as sending too many notifications. The ideal scheme may involve 
trying to balance the ring at some "half-full" state, depending on the
capacity for the front- and backends to process requests and responses.

Andrew.

Jan Beulich

2013-Jun-17 10:06 UTC

head link

Re: Interesting observation with network event notification and batching

>>> On 17.06.13 at 11:38, Ian Campbell <Ian.Campbell@citrix.com>
wrote:
> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>> > > Konrad, IIRC you once mentioned you discovered something with
event
>> > > notification, what''s that?
>> > 
>> > They were bizzare. I naively expected some form of # of physical
NIC
>> > interrupts to be around the same as the VIF or less. And I figured
>> > that the amount of interrupts would be constant irregardless of
the
>> > size of the packets. In other words #packets == #interrupts.
>> > 
>> 
>> It could be that the frontend notifies the backend for every packet it
>> sends. This is not desirable and I don''t expect the ring to
behave that
>> way.
> 
> It is probably worth checking that things are working how we think they
> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and
> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
> 
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
> 
> The comments in ring.h say:
>  *  These macros will set the req_event/rsp_event field to trigger a
>  *  notification on the very next message that is enqueued. If you want to
>  *  create batches of work (i.e., only receive a notification after several
>  *  messages have been enqueued) then you will need to create a customised
>  *  version of the FINAL_CHECK macro in your own code, which sets the event
>  *  field appropriately.
> 
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
But then again the macro doesn''t update req_event when there
are unconsumed requests already upon entry to the macro.

Jan

Ian Campbell

2013-Jun-17 10:16 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, 2013-06-17 at 11:06 +0100, Jan Beulich wrote:> >>> On 17.06.13 at 11:38, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
> > On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >> > > Konrad, IIRC you once mentioned you discovered something
with event
> >> > > notification, what''s that?
> >> > 
> >> > They were bizzare. I naively expected some form of # of
physical NIC
> >> > interrupts to be around the same as the VIF or less. And I
figured
> >> > that the amount of interrupts would be constant irregardless
of the
> >> > size of the packets. In other words #packets == #interrupts.
> >> > 
> >> 
> >> It could be that the frontend notifies the backend for every
packet it
> >> sends. This is not desirable and I don''t expect the ring
to behave that
> >> way.
> > 
> > It is probably worth checking that things are working how we think
they
> > should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_..
and
> > netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed
at
> > suitable points to maximise batching.
> > 
> > Is the RING_FINAL_CHECK_FOR_REQUESTS inside the
xen_netbk_tx_build_gops
> > loop right? This would push the req_event pointer to just after the
last
> > request, meaning the net request enqueued by the frontend would cause
a
> > notification -- even though the backend is actually still continuing
to
> > process requests and would have picked up that packet without further
> > notification. n this case there is a fair bit of work left in the
> > backend for this iteration i.e. plenty of opportunity for the frontend
> > to queue more requests.
> > 
> > The comments in ring.h say:
> >  *  These macros will set the req_event/rsp_event field to trigger a
> >  *  notification on the very next message that is enqueued. If you
want to
> >  *  create batches of work (i.e., only receive a notification after
several
> >  *  messages have been enqueued) then you will need to create a
customised
> >  *  version of the FINAL_CHECK macro in your own code, which sets the
event
> >  *  field appropriately.
> > 
> > Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> > (and other similar loops) and add a FINAL check at the very end?
> 
> But then again the macro doesn''t update req_event when there
> are unconsumed requests already upon entry to the macro.
My concern was that when we process the last request currently on the
ring we immediately set it forward, even though netback goes on to do a
bunch more work (including e.g. the grant copies) before looping back
and looking for more work. That''s a potentially large window for the
frontend to enqueue and then needlessly notify a new packet.

It could potentially lead to a pathological case of notifying every
packet unnecessarily.

Ian.

Wei Liu

2013-Jun-17 10:35 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell
wrote:> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> > > > Konrad, IIRC you once mentioned you discovered something
with event
> > > > notification, what''s that?
> > > 
> > > They were bizzare. I naively expected some form of # of physical
NIC
> > > interrupts to be around the same as the VIF or less. And I
figured
> > > that the amount of interrupts would be constant irregardless of
the
> > > size of the packets. In other words #packets == #interrupts.
> > > 
> > 
> > It could be that the frontend notifies the backend for every packet it
> > sends. This is not desirable and I don''t expect the ring to
behave that
> > way.
> 
> It is probably worth checking that things are working how we think they
> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_.. and
> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
> 
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
> 
> The comments in ring.h say:
>  *  These macros will set the req_event/rsp_event field to trigger a
>  *  notification on the very next message that is enqueued. If you want to
>  *  create batches of work (i.e., only receive a notification after several
>  *  messages have been enqueued) then you will need to create a customised
>  *  version of the FINAL_CHECK macro in your own code, which sets the event
>  *  field appropriately.
> 
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
> 
> > > But it was odd and I didn''t go deeper in it to figure
out what
> > > was happening. And also to figure out if for the VIF we could
> > > do something of #packets != #interrupts.  And hopefully some
> > > mechanism to adjust so that the amount of interrupts would
> > > be lesser per packets (hand waving here).
> > 
> > I''m trying to do this now.
> 
> What scheme do you have in mind?
Basically the one you mentioned above.

Playing with various event pointers now.


Wei.
>

Wei Liu

2013-Jun-17 10:46 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston
wrote:> On 17/06/13 10:38, Ian Campbell wrote:
> >On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >>>>Konrad, IIRC you once mentioned you discovered something
with event
> >>>>notification, what''s that?
> >>>
> >>>They were bizzare. I naively expected some form of # of
physical NIC
> >>>interrupts to be around the same as the VIF or less. And I
figured
> >>>that the amount of interrupts would be constant irregardless of
the
> >>>size of the packets. In other words #packets == #interrupts.
> >>>
> >>
> >>It could be that the frontend notifies the backend for every packet
it
> >>sends. This is not desirable and I don''t expect the ring
to behave that
> >>way.
> 
> I have observed this kind of behaviour during network performance
> tests in which I periodically checked the ring state during an iperf
> session. It looked to me like the frontend was sending notifications
> far too often, but that the backend was sending them very
> infrequently, so the Tx (from guest) ring was mostly empty and the
> Rx (to guest) ring was mostly full. This has the effect of both
> front and backend having to block occasionally waiting for the other
> end to clear or fill a ring, even though there is more data
> available.
> 
> My initial theory was that this was caused in part by the shared
> event channel, however I expect that Wei is testing on top of a
> kernel with his split event channel features?
> 
Yes, with split event channels.

And during tests the interrupt counts, frontend TX has 6 figures
interrupt number while frontend RX has 2 figures number.
> >
> >It is probably worth checking that things are working how we think they
> >should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_..
and
> >netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed
at
> >suitable points to maximise batching.
> >
> >Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> >loop right? This would push the req_event pointer to just after the
last
> >request, meaning the net request enqueued by the frontend would cause a
> >notification -- even though the backend is actually still continuing to
> >process requests and would have picked up that packet without further
> >notification. n this case there is a fair bit of work left in the
> >backend for this iteration i.e. plenty of opportunity for the frontend
> >to queue more requests.
> >
> >The comments in ring.h say:
> >  *  These macros will set the req_event/rsp_event field to trigger a
> >  *  notification on the very next message that is enqueued. If you
want to
> >  *  create batches of work (i.e., only receive a notification after
several
> >  *  messages have been enqueued) then you will need to create a
customised
> >  *  version of the FINAL_CHECK macro in your own code, which sets the
event
> >  *  field appropriately.
> >
> >Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> >(and other similar loops) and add a FINAL check at the very end?
> >
> >>>But it was odd and I didn''t go deeper in it to figure
out what
> >>>was happening. And also to figure out if for the VIF we could
> >>>do something of #packets != #interrupts.  And hopefully some
> >>>mechanism to adjust so that the amount of interrupts would
> >>>be lesser per packets (hand waving here).
> >>
> >>I''m trying to do this now.
> >
> >What scheme do you have in mind?
> 
> As I mentioned above, filling a ring completely appears to be almost
> as bad as sending too many notifications. The ideal scheme may
> involve trying to balance the ring at some "half-full" state,
> depending on the capacity for the front- and backends to process
> requests and responses.
> 
I don''t think filling the ring full causes any problem, that''s
just
conceptually the same as "half-full" state if you need to throttle the
ring.

The real problem is how to do notifications correctly.


Wei.
> Andrew.

Andrew Bennieston

2013-Jun-17 10:56 UTC

head link

Re: Interesting observation with network event notification and batching

On 17/06/13 11:46, Wei Liu wrote:> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
>> On 17/06/13 10:38, Ian Campbell wrote:
>>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>>> Konrad, IIRC you once mentioned you discovered
something with event
>>>>>> notification, what''s that?
>>>>>
>>>>> They were bizzare. I naively expected some form of # of
physical NIC
>>>>> interrupts to be around the same as the VIF or less. And I
figured
>>>>> that the amount of interrupts would be constant
irregardless of the
>>>>> size of the packets. In other words #packets ==
#interrupts.
>>>>>
>>>>
>>>> It could be that the frontend notifies the backend for every
packet it
>>>> sends. This is not desirable and I don''t expect the
ring to behave that
>>>> way.
>>
>> I have observed this kind of behaviour during network performance
>> tests in which I periodically checked the ring state during an iperf
>> session. It looked to me like the frontend was sending notifications
>> far too often, but that the backend was sending them very
>> infrequently, so the Tx (from guest) ring was mostly empty and the
>> Rx (to guest) ring was mostly full. This has the effect of both
>> front and backend having to block occasionally waiting for the other
>> end to clear or fill a ring, even though there is more data
>> available.
>>
>> My initial theory was that this was caused in part by the shared
>> event channel, however I expect that Wei is testing on top of a
>> kernel with his split event channel features?
>>
>
> Yes, with split event channels.
>
> And during tests the interrupt counts, frontend TX has 6 figures
> interrupt number while frontend RX has 2 figures number.
>
>>>
>>> It is probably worth checking that things are working how we think
they
>>> should. i.e. that netback''s calls to
RING_FINAL_CHECK_FOR_.. and
>>> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are
placed at
>>> suitable points to maximise batching.
>>>
>>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the
xen_netbk_tx_build_gops
>>> loop right? This would push the req_event pointer to just after the
last
>>> request, meaning the net request enqueued by the frontend would
cause a
>>> notification -- even though the backend is actually still
continuing to
>>> process requests and would have picked up that packet without
further
>>> notification. n this case there is a fair bit of work left in the
>>> backend for this iteration i.e. plenty of opportunity for the
frontend
>>> to queue more requests.
>>>
>>> The comments in ring.h say:
>>>   *  These macros will set the req_event/rsp_event field to trigger
a
>>>   *  notification on the very next message that is enqueued. If you
want to
>>>   *  create batches of work (i.e., only receive a notification
after several
>>>   *  messages have been enqueued) then you will need to create a
customised
>>>   *  version of the FINAL_CHECK macro in your own code, which sets
the event
>>>   *  field appropriately.
>>>
>>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that
loop
>>> (and other similar loops) and add a FINAL check at the very end?
>>>
>>>>> But it was odd and I didn''t go deeper in it to
figure out what
>>>>> was happening. And also to figure out if for the VIF we
could
>>>>> do something of #packets != #interrupts.  And hopefully
some
>>>>> mechanism to adjust so that the amount of interrupts would
>>>>> be lesser per packets (hand waving here).
>>>>
>>>> I''m trying to do this now.
>>>
>>> What scheme do you have in mind?
>>
>> As I mentioned above, filling a ring completely appears to be almost
>> as bad as sending too many notifications. The ideal scheme may
>> involve trying to balance the ring at some "half-full" state,
>> depending on the capacity for the front- and backends to process
>> requests and responses.
>>
>
> I don''t think filling the ring full causes any problem,
that''s just
> conceptually the same as "half-full" state if you need to
throttle the
> ring.My understanding was that filling the ring will cause the producer to 
sleep until slots become available (i.e. the until the consumer notifies 
it that it has removed something from the ring).

I''m just concerned that overly aggressive batching may lead to a 
situation where the consumer is sitting idle, waiting for a notification 
that the producer hasn''t yet sent because it can still fill more slots 
on the ring. When the ring is completely full, the producer would have 
to wait for the ring to partially empty. At this point, the consumer 
would hold off notifying because it can still batch more processing, so 
the producer is left waiting. (Repeat as required). It would be better 
to have both producer and consumer running concurrently.

I mention this mainly so that we don''t end up with a swing to the polar
opposite of what we have now, which (to my mind) is just as bad. Clearly 
this is an edge case, but if there''s a reason I''m missing that
this
can''t happen (e.g. after a period of inactivity) then don''t
hesitate to
point it out :)

(Perhaps "half-full" was misleading... the optimal state may be
"just
enough room for one more packet", or something along those lines...)

Andrew

Ian Campbell

2013-Jun-17 11:08 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston
wrote:> On 17/06/13 11:46, Wei Liu wrote:
> > On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
> >> On 17/06/13 10:38, Ian Campbell wrote:
> >>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >>>>>> Konrad, IIRC you once mentioned you discovered
something with event
> >>>>>> notification, what''s that?
> >>>>>
> >>>>> They were bizzare. I naively expected some form of #
of physical NIC
> >>>>> interrupts to be around the same as the VIF or less.
And I figured
> >>>>> that the amount of interrupts would be constant
irregardless of the
> >>>>> size of the packets. In other words #packets ==
#interrupts.
> >>>>>
> >>>>
> >>>> It could be that the frontend notifies the backend for
every packet it
> >>>> sends. This is not desirable and I don''t expect
the ring to behave that
> >>>> way.
> >>
> >> I have observed this kind of behaviour during network performance
> >> tests in which I periodically checked the ring state during an
iperf
> >> session. It looked to me like the frontend was sending
notifications
> >> far too often, but that the backend was sending them very
> >> infrequently, so the Tx (from guest) ring was mostly empty and the
> >> Rx (to guest) ring was mostly full. This has the effect of both
> >> front and backend having to block occasionally waiting for the
other
> >> end to clear or fill a ring, even though there is more data
> >> available.
> >>
> >> My initial theory was that this was caused in part by the shared
> >> event channel, however I expect that Wei is testing on top of a
> >> kernel with his split event channel features?
> >>
> >
> > Yes, with split event channels.
> >
> > And during tests the interrupt counts, frontend TX has 6 figures
> > interrupt number while frontend RX has 2 figures number.
> >
> >>>
> >>> It is probably worth checking that things are working how we
think they
> >>> should. i.e. that netback''s calls to
RING_FINAL_CHECK_FOR_.. and
> >>> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY
are placed at
> >>> suitable points to maximise batching.
> >>>
> >>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the
xen_netbk_tx_build_gops
> >>> loop right? This would push the req_event pointer to just
after the last
> >>> request, meaning the net request enqueued by the frontend
would cause a
> >>> notification -- even though the backend is actually still
continuing to
> >>> process requests and would have picked up that packet without
further
> >>> notification. n this case there is a fair bit of work left in
the
> >>> backend for this iteration i.e. plenty of opportunity for the
frontend
> >>> to queue more requests.
> >>>
> >>> The comments in ring.h say:
> >>>   *  These macros will set the req_event/rsp_event field to
trigger a
> >>>   *  notification on the very next message that is enqueued.
If you want to
> >>>   *  create batches of work (i.e., only receive a notification
after several
> >>>   *  messages have been enqueued) then you will need to create
a customised
> >>>   *  version of the FINAL_CHECK macro in your own code, which
sets the event
> >>>   *  field appropriately.
> >>>
> >>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in
that loop
> >>> (and other similar loops) and add a FINAL check at the very
end?
> >>>
> >>>>> But it was odd and I didn''t go deeper in it
to figure out what
> >>>>> was happening. And also to figure out if for the VIF
we could
> >>>>> do something of #packets != #interrupts.  And
hopefully some
> >>>>> mechanism to adjust so that the amount of interrupts
would
> >>>>> be lesser per packets (hand waving here).
> >>>>
> >>>> I''m trying to do this now.
> >>>
> >>> What scheme do you have in mind?
> >>
> >> As I mentioned above, filling a ring completely appears to be
almost
> >> as bad as sending too many notifications. The ideal scheme may
> >> involve trying to balance the ring at some "half-full"
state,
> >> depending on the capacity for the front- and backends to process
> >> requests and responses.
> >>
> >
> > I don''t think filling the ring full causes any problem,
that''s just
> > conceptually the same as "half-full" state if you need to
throttle the
> > ring.
> My understanding was that filling the ring will cause the producer to 
> sleep until slots become available (i.e. the until the consumer notifies 
> it that it has removed something from the ring).
> 
> I''m just concerned that overly aggressive batching may lead to a 
> situation where the consumer is sitting idle, waiting for a notification 
> that the producer hasn''t yet sent because it can still fill more
slots
> on the ring. When the ring is completely full, the producer would have 
> to wait for the ring to partially empty. At this point, the consumer 
> would hold off notifying because it can still batch more processing, so 
> the producer is left waiting. (Repeat as required). It would be better 
> to have both producer and consumer running concurrently.
> 
> I mention this mainly so that we don''t end up with a swing to the
polar
> opposite of what we have now, which (to my mind) is just as bad. Clearly 
> this is an edge case, but if there''s a reason I''m missing
that this
> can''t happen (e.g. after a period of inactivity) then
don''t hesitate to
> point it out :)
Doesn''t the separation between req_event and rsp_event help here?

So if the producer fills the ring, it will sleep, but set rsp_event
appropriately that when the backend completes some (but not all) work it
will be woken up so that it can put extra stuff on the ring.

It shouldn''t need to wait for the backend to process the whole batch
for
this.
> 
> (Perhaps "half-full" was misleading... the optimal state may be
"just
> enough room for one more packet", or something along those lines...)
> 
> Andrew
>

annie li

2013-Jun-17 11:34 UTC

head link

Re: Interesting observation with network event notification and batching

On 2013-6-17 18:35, Wei Liu wrote:> On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell wrote:
>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>> Konrad, IIRC you once mentioned you discovered something
with event
>>>>> notification, what''s that?
>>>> They were bizzare. I naively expected some form of # of
physical NIC
>>>> interrupts to be around the same as the VIF or less. And I
figured
>>>> that the amount of interrupts would be constant irregardless of
the
>>>> size of the packets. In other words #packets == #interrupts.
>>>>
>>> It could be that the frontend notifies the backend for every packet
it
>>> sends. This is not desirable and I don''t expect the ring
to behave that
>>> way.
>> It is probably worth checking that things are working how we think they
>> should. i.e. that netback''s calls to RING_FINAL_CHECK_FOR_..
and
>> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed
at
>> suitable points to maximise batching.
>>
>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
>> loop right? This would push the req_event pointer to just after the
last
>> request, meaning the net request enqueued by the frontend would cause a
>> notification -- even though the backend is actually still continuing to
>> process requests and would have picked up that packet without further
>> notification. n this case there is a fair bit of work left in the
>> backend for this iteration i.e. plenty of opportunity for the frontend
>> to queue more requests.
>>
>> The comments in ring.h say:
>>   *  These macros will set the req_event/rsp_event field to trigger a
>>   *  notification on the very next message that is enqueued. If you
want to
>>   *  create batches of work (i.e., only receive a notification after
several
>>   *  messages have been enqueued) then you will need to create a
customised
>>   *  version of the FINAL_CHECK macro in your own code, which sets the
event
>>   *  field appropriately.
>>
>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
>> (and other similar loops) and add a FINAL check at the very end?
>>
>>>> But it was odd and I didn''t go deeper in it to figure
out what
>>>> was happening. And also to figure out if for the VIF we could
>>>> do something of #packets != #interrupts.  And hopefully some
>>>> mechanism to adjust so that the amount of interrupts would
>>>> be lesser per packets (hand waving here).
>>> I''m trying to do this now.
>> What scheme do you have in mind?
> Basically the one you mentioned above.
>
> Playing with various event pointers now.
Did you collect data of how much requests netback processes when 
req_event is updated in RING_FINAL_CHECK_FOR_REQUESTS? I assume this 
value is pretty small from your test result. How about not updating 
req_event every time when there is no unconsumed request in 
RING_FINAL_CHECK_FOR_REQUESTS?

Thanks
Annie>
>
> Wei.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Andrew Bennieston

2013-Jun-17 11:55 UTC

head link

Re: Interesting observation with network event notification and batching

On 17/06/13 12:08, Ian Campbell wrote:> On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston wrote:
>> On 17/06/13 11:46, Wei Liu wrote:
>>> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
>>>> On 17/06/13 10:38, Ian Campbell wrote:
>>>>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>>>>> Konrad, IIRC you once mentioned you discovered
something with event
>>>>>>>> notification, what''s that?
>>>>>>>
>>>>>>> They were bizzare. I naively expected some form of
# of physical NIC
>>>>>>> interrupts to be around the same as the VIF or
less. And I figured
>>>>>>> that the amount of interrupts would be constant
irregardless of the
>>>>>>> size of the packets. In other words #packets ==
#interrupts.
>>>>>>>
>>>>>>
>>>>>> It could be that the frontend notifies the backend for
every packet it
>>>>>> sends. This is not desirable and I don''t
expect the ring to behave that
>>>>>> way.
>>>>
>>>> I have observed this kind of behaviour during network
performance
>>>> tests in which I periodically checked the ring state during an
iperf
>>>> session. It looked to me like the frontend was sending
notifications
>>>> far too often, but that the backend was sending them very
>>>> infrequently, so the Tx (from guest) ring was mostly empty and
the
>>>> Rx (to guest) ring was mostly full. This has the effect of both
>>>> front and backend having to block occasionally waiting for the
other
>>>> end to clear or fill a ring, even though there is more data
>>>> available.
>>>>
>>>> My initial theory was that this was caused in part by the
shared
>>>> event channel, however I expect that Wei is testing on top of a
>>>> kernel with his split event channel features?
>>>>
>>>
>>> Yes, with split event channels.
>>>
>>> And during tests the interrupt counts, frontend TX has 6 figures
>>> interrupt number while frontend RX has 2 figures number.
>>>
>>>>>
>>>>> It is probably worth checking that things are working how
we think they
>>>>> should. i.e. that netback''s calls to
RING_FINAL_CHECK_FOR_.. and
>>>>> netfront''s calls to RING_PUSH_..._AND_CHECK_NOTIFY
are placed at
>>>>> suitable points to maximise batching.
>>>>>
>>>>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the
xen_netbk_tx_build_gops
>>>>> loop right? This would push the req_event pointer to just
after the last
>>>>> request, meaning the net request enqueued by the frontend
would cause a
>>>>> notification -- even though the backend is actually still
continuing to
>>>>> process requests and would have picked up that packet
without further
>>>>> notification. n this case there is a fair bit of work left
in the
>>>>> backend for this iteration i.e. plenty of opportunity for
the frontend
>>>>> to queue more requests.
>>>>>
>>>>> The comments in ring.h say:
>>>>>    *  These macros will set the req_event/rsp_event field
to trigger a
>>>>>    *  notification on the very next message that is
enqueued. If you want to
>>>>>    *  create batches of work (i.e., only receive a
notification after several
>>>>>    *  messages have been enqueued) then you will need to
create a customised
>>>>>    *  version of the FINAL_CHECK macro in your own code,
which sets the event
>>>>>    *  field appropriately.
>>>>>
>>>>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in
that loop
>>>>> (and other similar loops) and add a FINAL check at the very
end?
>>>>>
>>>>>>> But it was odd and I didn''t go deeper in
it to figure out what
>>>>>>> was happening. And also to figure out if for the
VIF we could
>>>>>>> do something of #packets != #interrupts.  And
hopefully some
>>>>>>> mechanism to adjust so that the amount of
interrupts would
>>>>>>> be lesser per packets (hand waving here).
>>>>>>
>>>>>> I''m trying to do this now.
>>>>>
>>>>> What scheme do you have in mind?
>>>>
>>>> As I mentioned above, filling a ring completely appears to be
almost
>>>> as bad as sending too many notifications. The ideal scheme may
>>>> involve trying to balance the ring at some
"half-full" state,
>>>> depending on the capacity for the front- and backends to
process
>>>> requests and responses.
>>>>
>>>
>>> I don''t think filling the ring full causes any problem,
that''s just
>>> conceptually the same as "half-full" state if you need to
throttle the
>>> ring.
>> My understanding was that filling the ring will cause the producer to
>> sleep until slots become available (i.e. the until the consumer
notifies
>> it that it has removed something from the ring).
>>
>> I''m just concerned that overly aggressive batching may lead to
a
>> situation where the consumer is sitting idle, waiting for a
notification
>> that the producer hasn''t yet sent because it can still fill
more slots
>> on the ring. When the ring is completely full, the producer would have
>> to wait for the ring to partially empty. At this point, the consumer
>> would hold off notifying because it can still batch more processing, so
>> the producer is left waiting. (Repeat as required). It would be better
>> to have both producer and consumer running concurrently.
>>
>> I mention this mainly so that we don''t end up with a swing to
the polar
>> opposite of what we have now, which (to my mind) is just as bad.
Clearly
>> this is an edge case, but if there''s a reason I''m
missing that this
>> can''t happen (e.g. after a period of inactivity) then
don''t hesitate to
>> point it out :)
>
> Doesn''t the separation between req_event and rsp_event help here?
>
> So if the producer fills the ring, it will sleep, but set rsp_event
> appropriately that when the backend completes some (but not all) work it
> will be woken up so that it can put extra stuff on the ring.
>
> It shouldn''t need to wait for the backend to process the whole
batch for
> this.
Right. As long as this logic doesn''t get inadvertently changed in an 
attempt to improve batching of events!
>
>>
>> (Perhaps "half-full" was misleading... the optimal state may
be "just
>> enough room for one more packet", or something along those
lines...)
>>
>> Andrew
>>
>
>

Wei Liu

2013-Jun-28 16:15 UTC

head link

Re: Interesting observation with network event notification and batching

Hi all,

After collecting more stats and comparing copying / mapping cases, I now
have some more interesting finds, which might contradict what I said
before.

I tuned the runes I used for benchmark to make sure iperf and netperf
generate large packets (~64K). Here are the runes I use:

  iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
  netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072

                          COPY                    MAP
iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
         PPI               2.90                  1.07
         SPI               37.75                 13.69
         PPN               2.90                  1.07
         SPN               37.75                 13.69
         tx_count           31808                174769
         nr_napi_schedule   31805                174697
         total_packets      92354                187408
         total_reqs         1200793              2392614

netperf  Tput:            5.8Gb/s             10.5Gb/s
         PPI               2.13                   1.00
         SPI               36.70                  16.73
         PPN               2.13                   1.31
         SPN               36.70                  16.75
         tx_count           57635                205599
         nr_napi_schedule   57633                205311
         total_packets      122800               270254
         total_reqs         2115068              3439751

  PPI: packets processed per interrupt
  SPI: slots processed per interrupt
  PPN: packets processed per napi schedule
  SPN: slots processed per napi schedule
  tx_count: interrupt count
  total_reqs: total slots used during test

* Notification and batching

Is notification and batching really a problem? I''m not so sure now. My
first thought when I didn''t measure PPI / PPN / SPI / SPN in copying
case was that "in that case netback *must* have better batching" which
turned out not very true -- copying mode makes netback slower, however
the batching gained is not hugh.

Ideally we still want to batch as much as possible. Possible way
includes playing with the ''weight'' parameter in NAPI. But as
the figures
show batching seems not to be very important for throughput, at least
for now. If the NAPI framework and netfront / netback are doing their
jobs as designed we might not need to worry about this now.

Andrew, do you have any thought on this? You found out that NAPI didn''t
scale well with multi-threaded iperf in DomU, do you have any handle how
that can happen?

* Thoughts on zero-copy TX

With this hack we are able to achieve 10Gb/s single stream, which is
good. But, with classic XenoLinux kernel which has zero copy TX we
didn''t able to achieve this.  I also developed another zero copy
netback
prototype one year ago with Ian''s out-of-tree skb frag destructor patch
series. That prototype couldn''t achieve 10Gb/s either (IIRC the
performance was more or less the same as copying mode, about 6~7Gb/s).

My hack maps all necessary pages permantently, there is no unmap, we
skip lots of page table manipulation and TLB flushes. So my basic
conclusion is that page table manipulation and TLB flushes do incur
heavy performance penalty.

This hack can be upstreamed in no way. If we''re to re-introduce
zero-copy TX, we would need to implement some sort of lazy flushing
mechanism. I haven''t thought this through. Presumably this mechanism
would also benefit blk somehow? I''m not sure yet.

Could persistent mapping (with the to-be-developed reclaim / MRU list
mechanism) be useful here? So that we can unify blk and net drivers?

* Changes required to introduce zero-copy TX

1. SKB frag destructor series: to track life cycle of SKB frags. This is
not yet upstreamed.

2. Mechanism to negotiate max slots frontend can use: mapping requires
backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS.

3. Lazy flushing mechanism or persistent grants: ???


Wei.

* Note
In my previous tests I only ran iperf and didn''t have the right rune to
generate large packets. Iperf seems to have a behavior to increase
packet size as time goes by. In the copying case the packet size was
increased to 64K eventually while in the mapping case odd thing happened
(I believe that must due to the bug in my hack :-/) -- packet size was
always the default size (8K). Adding ''-l 131072'' to iperf
makes sure
that the packet is always 64K.

annie li

2013-Jul-01 07:48 UTC

head link

Re: Interesting observation with network event notification and batching

On 2013-6-29 0:15, Wei Liu wrote:> Hi all,
>
> After collecting more stats and comparing copying / mapping cases, I now
> have some more interesting finds, which might contradict what I said
> before.
>
> I tuned the runes I used for benchmark to make sure iperf and netperf
> generate large packets (~64K). Here are the runes I use:
>
>    iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>    netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>
>                            COPY                    MAP
> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
So with default iperf setting, copy is about 7.9G, and map is about 
2.5G? How about the result of netperf without large packets?
>           PPI               2.90                  1.07
>           SPI               37.75                 13.69
>           PPN               2.90                  1.07
>           SPN               37.75                 13.69
>           tx_count           31808                174769
Seems interrupt count does not affect the performance at all with -l 
131072 -w 128k.
>           nr_napi_schedule   31805                174697
>           total_packets      92354                187408
>           total_reqs         1200793              2392614
>
> netperf  Tput:            5.8Gb/s             10.5Gb/s
>           PPI               2.13                   1.00
>           SPI               36.70                  16.73
>           PPN               2.13                   1.31
>           SPN               36.70                  16.75
>           tx_count           57635                205599
>           nr_napi_schedule   57633                205311
>           total_packets      122800               270254
>           total_reqs         2115068              3439751
>
>    PPI: packets processed per interrupt
>    SPI: slots processed per interrupt
>    PPN: packets processed per napi schedule
>    SPN: slots processed per napi schedule
>    tx_count: interrupt count
>    total_reqs: total slots used during test
>
> * Notification and batching
>
> Is notification and batching really a problem? I''m not so sure
now. My
> first thought when I didn''t measure PPI / PPN / SPI / SPN in
copying
> case was that "in that case netback *must* have better batching"
which
> turned out not very true -- copying mode makes netback slower, however
> the batching gained is not hugh.
>
> Ideally we still want to batch as much as possible. Possible way
> includes playing with the ''weight'' parameter in NAPI. But
as the figures
> show batching seems not to be very important for throughput, at least
> for now. If the NAPI framework and netfront / netback are doing their
> jobs as designed we might not need to worry about this now.
>
> Andrew, do you have any thought on this? You found out that NAPI
didn''t
> scale well with multi-threaded iperf in DomU, do you have any handle how
> that can happen?
>
> * Thoughts on zero-copy TX
>
> With this hack we are able to achieve 10Gb/s single stream, which is
> good. But, with classic XenoLinux kernel which has zero copy TX we
> didn''t able to achieve this.  I also developed another zero copy
netback
> prototype one year ago with Ian''s out-of-tree skb frag destructor
patch
> series. That prototype couldn''t achieve 10Gb/s either (IIRC the
> performance was more or less the same as copying mode, about 6~7Gb/s).
>
> My hack maps all necessary pages permantently, there is no unmap, we
> skip lots of page table manipulation and TLB flushes. So my basic
> conclusion is that page table manipulation and TLB flushes do incur
> heavy performance penalty.
>
> This hack can be upstreamed in no way. If we''re to re-introduce
> zero-copy TX, we would need to implement some sort of lazy flushing
> mechanism. I haven''t thought this through. Presumably this
mechanism
> would also benefit blk somehow? I''m not sure yet.
>
> Could persistent mapping (with the to-be-developed reclaim / MRU list
> mechanism) be useful here? So that we can unify blk and net drivers?
>
> * Changes required to introduce zero-copy TX
>
> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
> not yet upstreamed.
Are you mentioning this one 
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?

<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>
> 2. Mechanism to negotiate max slots frontend can use: mapping requires
> backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS.
>
> 3. Lazy flushing mechanism or persistent grants: ???
I did some test with persistent grants before, it did not show better 
performance than grant copy. But I was using the default params of 
netperf, and not tried large packet size. Your results reminds me that 
maybe persistent grants would get similar results with larger packet 
size too.

Thanks
Annie



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Wei Liu

2013-Jul-01 08:54 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li
wrote:> 
> On 2013-6-29 0:15, Wei Liu wrote:
> >Hi all,
> >
> >After collecting more stats and comparing copying / mapping cases, I
now
> >have some more interesting finds, which might contradict what I said
> >before.
> >
> >I tuned the runes I used for benchmark to make sure iperf and netperf
> >generate large packets (~64K). Here are the runes I use:
> >
> >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> >
> >                           COPY                    MAP
> >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> 
> So with default iperf setting, copy is about 7.9G, and map is about
> 2.5G? How about the result of netperf without large packets?
> 
First question, yes.

Second question, 5.8Gb/s. And I believe for the copying scheme without
large packet the throuput is more or less the same.
> >          PPI               2.90                  1.07
> >          SPI               37.75                 13.69
> >          PPN               2.90                  1.07
> >          SPN               37.75                 13.69
> >          tx_count           31808                174769
> 
> Seems interrupt count does not affect the performance at all with -l
> 131072 -w 128k.
> 
Right.
> >          nr_napi_schedule   31805                174697
> >          total_packets      92354                187408
> >          total_reqs         1200793              2392614
> >
> >netperf  Tput:            5.8Gb/s             10.5Gb/s
> >          PPI               2.13                   1.00
> >          SPI               36.70                  16.73
> >          PPN               2.13                   1.31
> >          SPN               36.70                  16.75
> >          tx_count           57635                205599
> >          nr_napi_schedule   57633                205311
> >          total_packets      122800               270254
> >          total_reqs         2115068              3439751
> >
> >   PPI: packets processed per interrupt
> >   SPI: slots processed per interrupt
> >   PPN: packets processed per napi schedule
> >   SPN: slots processed per napi schedule
> >   tx_count: interrupt count
> >   total_reqs: total slots used during test
> >
> >* Notification and batching
> >
> >Is notification and batching really a problem? I''m not so sure
now. My
> >first thought when I didn''t measure PPI / PPN / SPI / SPN in
copying
> >case was that "in that case netback *must* have better
batching" which
> >turned out not very true -- copying mode makes netback slower, however
> >the batching gained is not hugh.
> >
> >Ideally we still want to batch as much as possible. Possible way
> >includes playing with the ''weight'' parameter in NAPI.
But as the figures
> >show batching seems not to be very important for throughput, at least
> >for now. If the NAPI framework and netfront / netback are doing their
> >jobs as designed we might not need to worry about this now.
> >
> >Andrew, do you have any thought on this? You found out that NAPI
didn''t
> >scale well with multi-threaded iperf in DomU, do you have any handle
how
> >that can happen?
> >
> >* Thoughts on zero-copy TX
> >
> >With this hack we are able to achieve 10Gb/s single stream, which is
> >good. But, with classic XenoLinux kernel which has zero copy TX we
> >didn''t able to achieve this.  I also developed another zero
copy netback
> >prototype one year ago with Ian''s out-of-tree skb frag
destructor patch
> >series. That prototype couldn''t achieve 10Gb/s either (IIRC
the
> >performance was more or less the same as copying mode, about 6~7Gb/s).
> >
> >My hack maps all necessary pages permantently, there is no unmap, we
> >skip lots of page table manipulation and TLB flushes. So my basic
> >conclusion is that page table manipulation and TLB flushes do incur
> >heavy performance penalty.
> >
> >This hack can be upstreamed in no way. If we''re to
re-introduce
> >zero-copy TX, we would need to implement some sort of lazy flushing
> >mechanism. I haven''t thought this through. Presumably this
mechanism
> >would also benefit blk somehow? I''m not sure yet.
> >
> >Could persistent mapping (with the to-be-developed reclaim / MRU list
> >mechanism) be useful here? So that we can unify blk and net drivers?
> >
> >* Changes required to introduce zero-copy TX
> >
> >1. SKB frag destructor series: to track life cycle of SKB frags. This
is
> >not yet upstreamed.
> 
> Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> 
>
<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> 
Yes. But I believe there''s been several versions posted. The link you
have is not the latest version.
> >
> >2. Mechanism to negotiate max slots frontend can use: mapping requires
> >backend''s MAX_SKB_FRAGS >= frontend''s
MAX_SKB_FRAGS.
> >
> >3. Lazy flushing mechanism or persistent grants: ???
> 
> I did some test with persistent grants before, it did not show
> better performance than grant copy. But I was using the default
> params of netperf, and not tried large packet size. Your results
> reminds me that maybe persistent grants would get similar results
> with larger packet size too.
> 
"No better performance" -- that''s because both mechanisms are
copying?
However I presume persistent grant can scale better? From an earlier
email last week, I read that copying is done by the guest so that this
mechanism scales much better than hypervisor copying in blk''s case.


Wei.
> Thanks
> Annie
>

Stefano Stabellini

2013-Jul-01 14:19 UTC

head link

Re: Interesting observation with network event notification and batching

Could you please use plain text emails in the future?

On Mon, 1 Jul 2013, annie li wrote:> On 2013-6-29 0:15, Wei Liu wrote:
> 
> Hi all,
> 
> After collecting more stats and comparing copying / mapping cases, I now
> have some more interesting finds, which might contradict what I said
> before.
> 
> I tuned the runes I used for benchmark to make sure iperf and netperf
> generate large packets (~64K). Here are the runes I use:
> 
>   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> 
>                           COPY                    MAP
> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> 
> 
> So with default iperf setting, copy is about 7.9G, and map is about 2.5G?
How about the result of netperf without large packets?
> 
>          PPI               2.90                  1.07
>          SPI               37.75                 13.69
>          PPN               2.90                  1.07
>          SPN               37.75                 13.69
>          tx_count           31808                174769
> 
> 
> Seems interrupt count does not affect the performance at all with -l 131072
-w 128k.
> 
>          nr_napi_schedule   31805                174697
>          total_packets      92354                187408
>          total_reqs         1200793              2392614
> 
> netperf  Tput:            5.8Gb/s             10.5Gb/s
>          PPI               2.13                   1.00
>          SPI               36.70                  16.73
>          PPN               2.13                   1.31
>          SPN               36.70                  16.75
>          tx_count           57635                205599
>          nr_napi_schedule   57633                205311
>          total_packets      122800               270254
>          total_reqs         2115068              3439751
> 
>   PPI: packets processed per interrupt
>   SPI: slots processed per interrupt
>   PPN: packets processed per napi schedule
>   SPN: slots processed per napi schedule
>   tx_count: interrupt count
>   total_reqs: total slots used during test
> 
> * Notification and batching
> 
> Is notification and batching really a problem? I''m not so sure
now. My
> first thought when I didn''t measure PPI / PPN / SPI / SPN in
copying
> case was that "in that case netback *must* have better batching"
which
> turned out not very true -- copying mode makes netback slower, however
> the batching gained is not hugh.
> 
> Ideally we still want to batch as much as possible. Possible way
> includes playing with the ''weight'' parameter in NAPI. But
as the figures
> show batching seems not to be very important for throughput, at least
> for now. If the NAPI framework and netfront / netback are doing their
> jobs as designed we might not need to worry about this now.
> 
> Andrew, do you have any thought on this? You found out that NAPI
didn''t
> scale well with multi-threaded iperf in DomU, do you have any handle how
> that can happen?
> 
> * Thoughts on zero-copy TX
> 
> With this hack we are able to achieve 10Gb/s single stream, which is
> good. But, with classic XenoLinux kernel which has zero copy TX we
> didn''t able to achieve this.  I also developed another zero copy
netback
> prototype one year ago with Ian''s out-of-tree skb frag destructor
patch
> series. That prototype couldn''t achieve 10Gb/s either (IIRC the
> performance was more or less the same as copying mode, about 6~7Gb/s).
> 
> My hack maps all necessary pages permantently, there is no unmap, we
> skip lots of page table manipulation and TLB flushes. So my basic
> conclusion is that page table manipulation and TLB flushes do incur
> heavy performance penalty.
> 
> This hack can be upstreamed in no way. If we''re to re-introduce
> zero-copy TX, we would need to implement some sort of lazy flushing
> mechanism. I haven''t thought this through. Presumably this
mechanism
> would also benefit blk somehow? I''m not sure yet.
> 
> Could persistent mapping (with the to-be-developed reclaim / MRU list
> mechanism) be useful here? So that we can unify blk and net drivers?
> 
> * Changes required to introduce zero-copy TX
> 
> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
> not yet upstreamed.
> 
> 
> Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> 
> 
> 2. Mechanism to negotiate max slots frontend can use: mapping requires
> backend''s MAX_SKB_FRAGS >= frontend''s MAX_SKB_FRAGS.
> 
> 3. Lazy flushing mechanism or persistent grants: ???
> 
> 
> I did some test with persistent grants before, it did not show better
performance than grant copy. But I was using the default
> params of netperf, and not tried large packet size. Your results reminds me
that maybe persistent grants would get similar
> results with larger packet size too.
> 
> Thanks
> Annie
> 
> 
>

Stefano Stabellini

2013-Jul-01 14:29 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, 1 Jul 2013, Wei Liu wrote:> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > 
> > On 2013-6-29 0:15, Wei Liu wrote:
> > >Hi all,
> > >
> > >After collecting more stats and comparing copying / mapping cases,
I now
> > >have some more interesting finds, which might contradict what I
said
> > >before.
> > >
> > >I tuned the runes I used for benchmark to make sure iperf and
netperf
> > >generate large packets (~64K). Here are the runes I use:
> > >
> > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > >
> > >                           COPY                    MAP
> > >iperf    Tput:             6.5Gb/s             14Gb/s (was
2.5Gb/s)
> > 
> > So with default iperf setting, copy is about 7.9G, and map is about
> > 2.5G? How about the result of netperf without large packets?
> > 
> 
> First question, yes.
> 
> Second question, 5.8Gb/s. And I believe for the copying scheme without
> large packet the throuput is more or less the same.
> 
> > >          PPI               2.90                  1.07
> > >          SPI               37.75                 13.69
> > >          PPN               2.90                  1.07
> > >          SPN               37.75                 13.69
> > >          tx_count           31808                174769
> > 
> > Seems interrupt count does not affect the performance at all with -l
> > 131072 -w 128k.
> > 
> 
> Right.
> 
> > >          nr_napi_schedule   31805                174697
> > >          total_packets      92354                187408
> > >          total_reqs         1200793              2392614
> > >
> > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > >          PPI               2.13                   1.00
> > >          SPI               36.70                  16.73
> > >          PPN               2.13                   1.31
> > >          SPN               36.70                  16.75
> > >          tx_count           57635                205599
> > >          nr_napi_schedule   57633                205311
> > >          total_packets      122800               270254
> > >          total_reqs         2115068              3439751
> > >
> > >   PPI: packets processed per interrupt
> > >   SPI: slots processed per interrupt
> > >   PPN: packets processed per napi schedule
> > >   SPN: slots processed per napi schedule
> > >   tx_count: interrupt count
> > >   total_reqs: total slots used during test
> > >
> > >* Notification and batching
> > >
> > >Is notification and batching really a problem? I''m not so
sure now. My
> > >first thought when I didn''t measure PPI / PPN / SPI / SPN
in copying
> > >case was that "in that case netback *must* have better
batching" which
> > >turned out not very true -- copying mode makes netback slower,
however
> > >the batching gained is not hugh.
> > >
> > >Ideally we still want to batch as much as possible. Possible way
> > >includes playing with the ''weight'' parameter in
NAPI. But as the figures
> > >show batching seems not to be very important for throughput, at
least
> > >for now. If the NAPI framework and netfront / netback are doing
their
> > >jobs as designed we might not need to worry about this now.
> > >
> > >Andrew, do you have any thought on this? You found out that NAPI
didn''t
> > >scale well with multi-threaded iperf in DomU, do you have any
handle how
> > >that can happen?
> > >
> > >* Thoughts on zero-copy TX
> > >
> > >With this hack we are able to achieve 10Gb/s single stream, which
is
> > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > >didn''t able to achieve this.  I also developed another
zero copy netback
> > >prototype one year ago with Ian''s out-of-tree skb frag
destructor patch
> > >series. That prototype couldn''t achieve 10Gb/s either
(IIRC the
> > >performance was more or less the same as copying mode, about
6~7Gb/s).
> > >
> > >My hack maps all necessary pages permantently, there is no unmap,
we
> > >skip lots of page table manipulation and TLB flushes. So my basic
> > >conclusion is that page table manipulation and TLB flushes do
incur
> > >heavy performance penalty.
> > >
> > >This hack can be upstreamed in no way. If we''re to
re-introduce
> > >zero-copy TX, we would need to implement some sort of lazy
flushing
> > >mechanism. I haven''t thought this through. Presumably
this mechanism
> > >would also benefit blk somehow? I''m not sure yet.
> > >
> > >Could persistent mapping (with the to-be-developed reclaim / MRU
list
> > >mechanism) be useful here? So that we can unify blk and net
drivers?
> > >
> > >* Changes required to introduce zero-copy TX
> > >
> > >1. SKB frag destructor series: to track life cycle of SKB frags.
This is
> > >not yet upstreamed.
> > 
> > Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > 
> >
<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > 
> 
> Yes. But I believe there''s been several versions posted. The link
you
> have is not the latest version.
> 
> > >
> > >2. Mechanism to negotiate max slots frontend can use: mapping
requires
> > >backend''s MAX_SKB_FRAGS >= frontend''s
MAX_SKB_FRAGS.
> > >
> > >3. Lazy flushing mechanism or persistent grants: ???
> > 
> > I did some test with persistent grants before, it did not show
> > better performance than grant copy. But I was using the default
> > params of netperf, and not tried large packet size. Your results
> > reminds me that maybe persistent grants would get similar results
> > with larger packet size too.
> > 
> 
> "No better performance" -- that''s because both
mechanisms are copying?
> However I presume persistent grant can scale better? From an earlier
> email last week, I read that copying is done by the guest so that this
> mechanism scales much better than hypervisor copying in blk''s
case.
Yes, I always expected persistent grants to be faster then
gnttab_copy but I was very surprised by the difference in performances:

http://marc.info/?l=xen-devel&m=137234605929944

I think it''s worth trying persistent grants on PV network, although
it''s
very unlikely that they are going to improve the throughput by 5 Gb/s.

Also once we have both PV block and network using persistent grants,
we might incur the grant table limit, see this email:

http://marc.info/?l=xen-devel&m=137183474618974

Wei Liu

2013-Jul-01 14:39 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini
wrote:> On Mon, 1 Jul 2013, Wei Liu wrote:
> > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > > 
> > > On 2013-6-29 0:15, Wei Liu wrote:
> > > >Hi all,
> > > >
> > > >After collecting more stats and comparing copying / mapping
cases, I now
> > > >have some more interesting finds, which might contradict what
I said
> > > >before.
> > > >
> > > >I tuned the runes I used for benchmark to make sure iperf and
netperf
> > > >generate large packets (~64K). Here are the runes I use:
> > > >
> > > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > > >
> > > >                           COPY                    MAP
> > > >iperf    Tput:             6.5Gb/s             14Gb/s (was
2.5Gb/s)
> > > 
> > > So with default iperf setting, copy is about 7.9G, and map is
about
> > > 2.5G? How about the result of netperf without large packets?
> > > 
> > 
> > First question, yes.
> > 
> > Second question, 5.8Gb/s. And I believe for the copying scheme without
> > large packet the throuput is more or less the same.
> > 
> > > >          PPI               2.90                  1.07
> > > >          SPI               37.75                 13.69
> > > >          PPN               2.90                  1.07
> > > >          SPN               37.75                 13.69
> > > >          tx_count           31808                174769
> > > 
> > > Seems interrupt count does not affect the performance at all with
-l
> > > 131072 -w 128k.
> > > 
> > 
> > Right.
> > 
> > > >          nr_napi_schedule   31805                174697
> > > >          total_packets      92354                187408
> > > >          total_reqs         1200793              2392614
> > > >
> > > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > > >          PPI               2.13                   1.00
> > > >          SPI               36.70                  16.73
> > > >          PPN               2.13                   1.31
> > > >          SPN               36.70                  16.75
> > > >          tx_count           57635                205599
> > > >          nr_napi_schedule   57633                205311
> > > >          total_packets      122800               270254
> > > >          total_reqs         2115068              3439751
> > > >
> > > >   PPI: packets processed per interrupt
> > > >   SPI: slots processed per interrupt
> > > >   PPN: packets processed per napi schedule
> > > >   SPN: slots processed per napi schedule
> > > >   tx_count: interrupt count
> > > >   total_reqs: total slots used during test
> > > >
> > > >* Notification and batching
> > > >
> > > >Is notification and batching really a problem? I''m
not so sure now. My
> > > >first thought when I didn''t measure PPI / PPN / SPI
/ SPN in copying
> > > >case was that "in that case netback *must* have better
batching" which
> > > >turned out not very true -- copying mode makes netback
slower, however
> > > >the batching gained is not hugh.
> > > >
> > > >Ideally we still want to batch as much as possible. Possible
way
> > > >includes playing with the ''weight''
parameter in NAPI. But as the figures
> > > >show batching seems not to be very important for throughput,
at least
> > > >for now. If the NAPI framework and netfront / netback are
doing their
> > > >jobs as designed we might not need to worry about this now.
> > > >
> > > >Andrew, do you have any thought on this? You found out that
NAPI didn''t
> > > >scale well with multi-threaded iperf in DomU, do you have any
handle how
> > > >that can happen?
> > > >
> > > >* Thoughts on zero-copy TX
> > > >
> > > >With this hack we are able to achieve 10Gb/s single stream,
which is
> > > >good. But, with classic XenoLinux kernel which has zero copy
TX we
> > > >didn''t able to achieve this.  I also developed
another zero copy netback
> > > >prototype one year ago with Ian''s out-of-tree skb
frag destructor patch
> > > >series. That prototype couldn''t achieve 10Gb/s
either (IIRC the
> > > >performance was more or less the same as copying mode, about
6~7Gb/s).
> > > >
> > > >My hack maps all necessary pages permantently, there is no
unmap, we
> > > >skip lots of page table manipulation and TLB flushes. So my
basic
> > > >conclusion is that page table manipulation and TLB flushes do
incur
> > > >heavy performance penalty.
> > > >
> > > >This hack can be upstreamed in no way. If we''re to
re-introduce
> > > >zero-copy TX, we would need to implement some sort of lazy
flushing
> > > >mechanism. I haven''t thought this through.
Presumably this mechanism
> > > >would also benefit blk somehow? I''m not sure yet.
> > > >
> > > >Could persistent mapping (with the to-be-developed reclaim /
MRU list
> > > >mechanism) be useful here? So that we can unify blk and net
drivers?
> > > >
> > > >* Changes required to introduce zero-copy TX
> > > >
> > > >1. SKB frag destructor series: to track life cycle of SKB
frags. This is
> > > >not yet upstreamed.
> > > 
> > > Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > > 
> > >
<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > > 
> > 
> > Yes. But I believe there''s been several versions posted. The
link you
> > have is not the latest version.
> > 
> > > >
> > > >2. Mechanism to negotiate max slots frontend can use: mapping
requires
> > > >backend''s MAX_SKB_FRAGS >= frontend''s
MAX_SKB_FRAGS.
> > > >
> > > >3. Lazy flushing mechanism or persistent grants: ???
> > > 
> > > I did some test with persistent grants before, it did not show
> > > better performance than grant copy. But I was using the default
> > > params of netperf, and not tried large packet size. Your results
> > > reminds me that maybe persistent grants would get similar results
> > > with larger packet size too.
> > > 
> > 
> > "No better performance" -- that''s because both
mechanisms are copying?
> > However I presume persistent grant can scale better? From an earlier
> > email last week, I read that copying is done by the guest so that this
> > mechanism scales much better than hypervisor copying in blk''s
case.
> 
> Yes, I always expected persistent grants to be faster then
> gnttab_copy but I was very surprised by the difference in performances:
> 
> http://marc.info/?l=xen-devel&m=137234605929944
> 
> I think it''s worth trying persistent grants on PV network,
although it''s
> very unlikely that they are going to improve the throughput by 5 Gb/s.
> 
I think it can improve aggregated throughput, however its not likely to
improve single stream throughput.
> Also once we have both PV block and network using persistent grants,
> we might incur the grant table limit, see this email:
> 
> http://marc.info/?l=xen-devel&m=137183474618974
Yes, indeed.

Stefano Stabellini

2013-Jul-01 14:54 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, 1 Jul 2013, Wei Liu wrote:> On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:
> > On Mon, 1 Jul 2013, Wei Liu wrote:
> > > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > > > 
> > > > On 2013-6-29 0:15, Wei Liu wrote:
> > > > >Hi all,
> > > > >
> > > > >After collecting more stats and comparing copying /
mapping cases, I now
> > > > >have some more interesting finds, which might contradict
what I said
> > > > >before.
> > > > >
> > > > >I tuned the runes I used for benchmark to make sure
iperf and netperf
> > > > >generate large packets (~64K). Here are the runes I use:
> > > > >
> > > > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see
note)
> > > > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S
131072
> > > > >
> > > > >                           COPY                    MAP
> > > > >iperf    Tput:             6.5Gb/s             14Gb/s
(was 2.5Gb/s)
> > > > 
> > > > So with default iperf setting, copy is about 7.9G, and map
is about
> > > > 2.5G? How about the result of netperf without large packets?
> > > > 
> > > 
> > > First question, yes.
> > > 
> > > Second question, 5.8Gb/s. And I believe for the copying scheme
without
> > > large packet the throuput is more or less the same.
> > > 
> > > > >          PPI               2.90                  1.07
> > > > >          SPI               37.75                 13.69
> > > > >          PPN               2.90                  1.07
> > > > >          SPN               37.75                 13.69
> > > > >          tx_count           31808                174769
> > > > 
> > > > Seems interrupt count does not affect the performance at all
with -l
> > > > 131072 -w 128k.
> > > > 
> > > 
> > > Right.
> > > 
> > > > >          nr_napi_schedule   31805                174697
> > > > >          total_packets      92354                187408
> > > > >          total_reqs         1200793             
2392614
> > > > >
> > > > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > > > >          PPI               2.13                   1.00
> > > > >          SPI               36.70                  16.73
> > > > >          PPN               2.13                   1.31
> > > > >          SPN               36.70                  16.75
> > > > >          tx_count           57635                205599
> > > > >          nr_napi_schedule   57633                205311
> > > > >          total_packets      122800               270254
> > > > >          total_reqs         2115068             
3439751
> > > > >
> > > > >   PPI: packets processed per interrupt
> > > > >   SPI: slots processed per interrupt
> > > > >   PPN: packets processed per napi schedule
> > > > >   SPN: slots processed per napi schedule
> > > > >   tx_count: interrupt count
> > > > >   total_reqs: total slots used during test
> > > > >
> > > > >* Notification and batching
> > > > >
> > > > >Is notification and batching really a problem?
I''m not so sure now. My
> > > > >first thought when I didn''t measure PPI / PPN /
SPI / SPN in copying
> > > > >case was that "in that case netback *must* have
better batching" which
> > > > >turned out not very true -- copying mode makes netback
slower, however
> > > > >the batching gained is not hugh.
> > > > >
> > > > >Ideally we still want to batch as much as possible.
Possible way
> > > > >includes playing with the ''weight''
parameter in NAPI. But as the figures
> > > > >show batching seems not to be very important for
throughput, at least
> > > > >for now. If the NAPI framework and netfront / netback
are doing their
> > > > >jobs as designed we might not need to worry about this
now.
> > > > >
> > > > >Andrew, do you have any thought on this? You found out
that NAPI didn''t
> > > > >scale well with multi-threaded iperf in DomU, do you
have any handle how
> > > > >that can happen?
> > > > >
> > > > >* Thoughts on zero-copy TX
> > > > >
> > > > >With this hack we are able to achieve 10Gb/s single
stream, which is
> > > > >good. But, with classic XenoLinux kernel which has zero
copy TX we
> > > > >didn''t able to achieve this.  I also developed
another zero copy netback
> > > > >prototype one year ago with Ian''s out-of-tree
skb frag destructor patch
> > > > >series. That prototype couldn''t achieve 10Gb/s
either (IIRC the
> > > > >performance was more or less the same as copying mode,
about 6~7Gb/s).
> > > > >
> > > > >My hack maps all necessary pages permantently, there is
no unmap, we
> > > > >skip lots of page table manipulation and TLB flushes. So
my basic
> > > > >conclusion is that page table manipulation and TLB
flushes do incur
> > > > >heavy performance penalty.
> > > > >
> > > > >This hack can be upstreamed in no way. If we''re
to re-introduce
> > > > >zero-copy TX, we would need to implement some sort of
lazy flushing
> > > > >mechanism. I haven''t thought this through.
Presumably this mechanism
> > > > >would also benefit blk somehow? I''m not sure
yet.
> > > > >
> > > > >Could persistent mapping (with the to-be-developed
reclaim / MRU list
> > > > >mechanism) be useful here? So that we can unify blk and
net drivers?
> > > > >
> > > > >* Changes required to introduce zero-copy TX
> > > > >
> > > > >1. SKB frag destructor series: to track life cycle of
SKB frags. This is
> > > > >not yet upstreamed.
> > > > 
> > > > Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > > > 
> > > >
<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > > > 
> > > 
> > > Yes. But I believe there''s been several versions posted.
The link you
> > > have is not the latest version.
> > > 
> > > > >
> > > > >2. Mechanism to negotiate max slots frontend can use:
mapping requires
> > > > >backend''s MAX_SKB_FRAGS >=
frontend''s MAX_SKB_FRAGS.
> > > > >
> > > > >3. Lazy flushing mechanism or persistent grants: ???
> > > > 
> > > > I did some test with persistent grants before, it did not
show
> > > > better performance than grant copy. But I was using the
default
> > > > params of netperf, and not tried large packet size. Your
results
> > > > reminds me that maybe persistent grants would get similar
results
> > > > with larger packet size too.
> > > > 
> > > 
> > > "No better performance" -- that''s because both
mechanisms are copying?
> > > However I presume persistent grant can scale better? From an
earlier
> > > email last week, I read that copying is done by the guest so that
this
> > > mechanism scales much better than hypervisor copying in
blk''s case.
> > 
> > Yes, I always expected persistent grants to be faster then
> > gnttab_copy but I was very surprised by the difference in
performances:
> > 
> > http://marc.info/?l=xen-devel&m=137234605929944
> > 
> > I think it''s worth trying persistent grants on PV network,
although it''s
> > very unlikely that they are going to improve the throughput by 5 Gb/s.
> > 
> 
> I think it can improve aggregated throughput, however its not likely to
> improve single stream throughput.
you are probably right

annie li

2013-Jul-01 15:59 UTC

head link

Re: Interesting observation with network event notification and batching

On 2013-7-1 16:54, Wei Liu wrote:> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
>> On 2013-6-29 0:15, Wei Liu wrote:
>>> Hi all,
>>>
>>> After collecting more stats and comparing copying / mapping cases,
I now
>>> have some more interesting finds, which might contradict what I
said
>>> before.
>>>
>>> I tuned the runes I used for benchmark to make sure iperf and
netperf
>>> generate large packets (~64K). Here are the runes I use:
>>>
>>>    iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>>>    netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>>>
>>>                            COPY                    MAP
>>> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
>> So with default iperf setting, copy is about 7.9G, and map is about
>> 2.5G? How about the result of netperf without large packets?
>>
> First question, yes.
>
> Second question, 5.8Gb/s. And I believe for the copying scheme without
> large packet the throuput is more or less the same.
>
>>>           PPI               2.90                  1.07
>>>           SPI               37.75                 13.69
>>>           PPN               2.90                  1.07
>>>           SPN               37.75                 13.69
>>>           tx_count           31808                174769
>> Seems interrupt count does not affect the performance at all with -l
>> 131072 -w 128k.
>>
> Right.
>
>>>           nr_napi_schedule   31805                174697
>>>           total_packets      92354                187408
>>>           total_reqs         1200793              2392614
>>>
>>> netperf  Tput:            5.8Gb/s             10.5Gb/s
>>>           PPI               2.13                   1.00
>>>           SPI               36.70                  16.73
>>>           PPN               2.13                   1.31
>>>           SPN               36.70                  16.75
>>>           tx_count           57635                205599
>>>           nr_napi_schedule   57633                205311
>>>           total_packets      122800               270254
>>>           total_reqs         2115068              3439751
>>>
>>>    PPI: packets processed per interrupt
>>>    SPI: slots processed per interrupt
>>>    PPN: packets processed per napi schedule
>>>    SPN: slots processed per napi schedule
>>>    tx_count: interrupt count
>>>    total_reqs: total slots used during test
>>>
>>> * Notification and batching
>>>
>>> Is notification and batching really a problem? I''m not so
sure now. My
>>> first thought when I didn''t measure PPI / PPN / SPI / SPN
in copying
>>> case was that "in that case netback *must* have better
batching" which
>>> turned out not very true -- copying mode makes netback slower,
however
>>> the batching gained is not hugh.
>>>
>>> Ideally we still want to batch as much as possible. Possible way
>>> includes playing with the ''weight'' parameter in
NAPI. But as the figures
>>> show batching seems not to be very important for throughput, at
least
>>> for now. If the NAPI framework and netfront / netback are doing
their
>>> jobs as designed we might not need to worry about this now.
>>>
>>> Andrew, do you have any thought on this? You found out that NAPI
didn''t
>>> scale well with multi-threaded iperf in DomU, do you have any
handle how
>>> that can happen?
>>>
>>> * Thoughts on zero-copy TX
>>>
>>> With this hack we are able to achieve 10Gb/s single stream, which
is
>>> good. But, with classic XenoLinux kernel which has zero copy TX we
>>> didn''t able to achieve this.  I also developed another
zero copy netback
>>> prototype one year ago with Ian''s out-of-tree skb frag
destructor patch
>>> series. That prototype couldn''t achieve 10Gb/s either
(IIRC the
>>> performance was more or less the same as copying mode, about
6~7Gb/s).
>>>
>>> My hack maps all necessary pages permantently, there is no unmap,
we
>>> skip lots of page table manipulation and TLB flushes. So my basic
>>> conclusion is that page table manipulation and TLB flushes do incur
>>> heavy performance penalty.
>>>
>>> This hack can be upstreamed in no way. If we''re to
re-introduce
>>> zero-copy TX, we would need to implement some sort of lazy flushing
>>> mechanism. I haven''t thought this through. Presumably this
mechanism
>>> would also benefit blk somehow? I''m not sure yet.
>>>
>>> Could persistent mapping (with the to-be-developed reclaim / MRU
list
>>> mechanism) be useful here? So that we can unify blk and net
drivers?
>>>
>>> * Changes required to introduce zero-copy TX
>>>
>>> 1. SKB frag destructor series: to track life cycle of SKB frags.
This is
>>> not yet upstreamed.
>> Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>
>>
<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>>
> Yes. But I believe there''s been several versions posted. The link
you
> have is not the latest version.
>
>>> 2. Mechanism to negotiate max slots frontend can use: mapping
requires
>>> backend''s MAX_SKB_FRAGS >= frontend''s
MAX_SKB_FRAGS.
>>>
>>> 3. Lazy flushing mechanism or persistent grants: ???
>> I did some test with persistent grants before, it did not show
>> better performance than grant copy. But I was using the default
>> params of netperf, and not tried large packet size. Your results
>> reminds me that maybe persistent grants would get similar results
>> with larger packet size too.
>>
> "No better performance" -- that''s because both
mechanisms are copying?
> However I presume persistent grant can scale better? From an earlier
> email last week, I read that copying is done by the guest so that this
> mechanism scales much better than hypervisor copying in blk''s
case.
The original persistent patch does memcpy in both netback and netfront 
side. I am thinking maybe the performance can become better if removing 
the memcpy from netfront.
Moreover, I also have a feeling that we got persistent grant performance 
based on default netperf params test, just like wei''s hack which does 
not get better performance without large packets. So let me try some 
test with large packets though.

Thanks
Annie

annie li

2013-Jul-01 15:59 UTC

head link

Re: Interesting observation with network event notification and batching

On 2013-7-1 22:19, Stefano Stabellini wrote:> Could you please use plain text emails in the future?
Sure, sorry about that.

Thanks
Annie>
> On Mon, 1 Jul 2013, annie li wrote:
>> On 2013-6-29 0:15, Wei Liu wrote:
>>
>> Hi all,
>>
>> After collecting more stats and comparing copying / mapping cases, I
now
>> have some more interesting finds, which might contradict what I said
>> before.
>>
>> I tuned the runes I used for benchmark to make sure iperf and netperf
>> generate large packets (~64K). Here are the runes I use:
>>
>>    iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>>    netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>>
>>                            COPY                    MAP
>> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
>>
>>
>> So with default iperf setting, copy is about 7.9G, and map is about
2.5G? How about the result of netperf without large packets?
>>
>>           PPI               2.90                  1.07
>>           SPI               37.75                 13.69
>>           PPN               2.90                  1.07
>>           SPN               37.75                 13.69
>>           tx_count           31808                174769
>>
>>
>> Seems interrupt count does not affect the performance at all with -l
131072 -w 128k.
>>
>>           nr_napi_schedule   31805                174697
>>           total_packets      92354                187408
>>           total_reqs         1200793              2392614
>>
>> netperf  Tput:            5.8Gb/s             10.5Gb/s
>>           PPI               2.13                   1.00
>>           SPI               36.70                  16.73
>>           PPN               2.13                   1.31
>>           SPN               36.70                  16.75
>>           tx_count           57635                205599
>>           nr_napi_schedule   57633                205311
>>           total_packets      122800               270254
>>           total_reqs         2115068              3439751
>>
>>    PPI: packets processed per interrupt
>>    SPI: slots processed per interrupt
>>    PPN: packets processed per napi schedule
>>    SPN: slots processed per napi schedule
>>    tx_count: interrupt count
>>    total_reqs: total slots used during test
>>
>> * Notification and batching
>>
>> Is notification and batching really a problem? I''m not so sure
now. My
>> first thought when I didn''t measure PPI / PPN / SPI / SPN in
copying
>> case was that "in that case netback *must* have better
batching" which
>> turned out not very true -- copying mode makes netback slower, however
>> the batching gained is not hugh.
>>
>> Ideally we still want to batch as much as possible. Possible way
>> includes playing with the ''weight'' parameter in NAPI.
But as the figures
>> show batching seems not to be very important for throughput, at least
>> for now. If the NAPI framework and netfront / netback are doing their
>> jobs as designed we might not need to worry about this now.
>>
>> Andrew, do you have any thought on this? You found out that NAPI
didn''t
>> scale well with multi-threaded iperf in DomU, do you have any handle
how
>> that can happen?
>>
>> * Thoughts on zero-copy TX
>>
>> With this hack we are able to achieve 10Gb/s single stream, which is
>> good. But, with classic XenoLinux kernel which has zero copy TX we
>> didn''t able to achieve this.  I also developed another zero
copy netback
>> prototype one year ago with Ian''s out-of-tree skb frag
destructor patch
>> series. That prototype couldn''t achieve 10Gb/s either (IIRC
the
>> performance was more or less the same as copying mode, about 6~7Gb/s).
>>
>> My hack maps all necessary pages permantently, there is no unmap, we
>> skip lots of page table manipulation and TLB flushes. So my basic
>> conclusion is that page table manipulation and TLB flushes do incur
>> heavy performance penalty.
>>
>> This hack can be upstreamed in no way. If we''re to
re-introduce
>> zero-copy TX, we would need to implement some sort of lazy flushing
>> mechanism. I haven''t thought this through. Presumably this
mechanism
>> would also benefit blk somehow? I''m not sure yet.
>>
>> Could persistent mapping (with the to-be-developed reclaim / MRU list
>> mechanism) be useful here? So that we can unify blk and net drivers?
>>
>> * Changes required to introduce zero-copy TX
>>
>> 1. SKB frag destructor series: to track life cycle of SKB frags. This
is
>> not yet upstreamed.
>>
>>
>> Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>
>>
>> 2. Mechanism to negotiate max slots frontend can use: mapping requires
>> backend''s MAX_SKB_FRAGS >= frontend''s
MAX_SKB_FRAGS.
>>
>> 3. Lazy flushing mechanism or persistent grants: ???
>>
>>
>> I did some test with persistent grants before, it did not show better
performance than grant copy. But I was using the default
>> params of netperf, and not tried large packet size. Your results
reminds me that maybe persistent grants would get similar
>> results with larger packet size too.
>>
>> Thanks
>> Annie
>>
>>
>>

Wei Liu

2013-Jul-01 16:06 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote:
[...]> >>>1. SKB frag destructor series: to track life cycle of SKB
frags. This is
> >>>not yet upstreamed.
> >>Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> >>
>
>><http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> >>
> >Yes. But I believe there''s been several versions posted. The
link you
> >have is not the latest version.
> >
> >>>2. Mechanism to negotiate max slots frontend can use: mapping
requires
> >>>backend''s MAX_SKB_FRAGS >= frontend''s
MAX_SKB_FRAGS.
> >>>
> >>>3. Lazy flushing mechanism or persistent grants: ???
> >>I did some test with persistent grants before, it did not show
> >>better performance than grant copy. But I was using the default
> >>params of netperf, and not tried large packet size. Your results
> >>reminds me that maybe persistent grants would get similar results
> >>with larger packet size too.
> >>
> >"No better performance" -- that''s because both
mechanisms are copying?
> >However I presume persistent grant can scale better? From an earlier
> >email last week, I read that copying is done by the guest so that this
> >mechanism scales much better than hypervisor copying in blk''s
case.
> 
> The original persistent patch does memcpy in both netback and
> netfront side. I am thinking maybe the performance can become better
> if removing the memcpy from netfront.
I would say that removing copy in netback can scale better.
> Moreover, I also have a feeling that we got persistent grant
> performance based on default netperf params test, just like wei''s
> hack which does not get better performance without large packets. So
> let me try some test with large packets though.
> 
Sadly enough, I found out today these sort of test seems to be quite
inconsistent. On a Intel 10G Nic the throughput is actually higher
without enforcing iperf / netperf to generate large packets.


Wei.
> Thanks
> Annie

Andrew Bennieston

2013-Jul-01 16:53 UTC

head link

Re: Interesting observation with network event notification and batching

On 01/07/13 17:06, Wei Liu wrote:> On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote:
> [...]
>>>>> 1. SKB frag destructor series: to track life cycle of SKB
frags. This is
>>>>> not yet upstreamed.
>>>> Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>>>
>>>>
<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>>>>
>>> Yes. But I believe there''s been several versions posted.
The link you
>>> have is not the latest version.
>>>
>>>>> 2. Mechanism to negotiate max slots frontend can use:
mapping requires
>>>>> backend''s MAX_SKB_FRAGS >= frontend''s
MAX_SKB_FRAGS.
>>>>>
>>>>> 3. Lazy flushing mechanism or persistent grants: ???
>>>> I did some test with persistent grants before, it did not show
>>>> better performance than grant copy. But I was using the default
>>>> params of netperf, and not tried large packet size. Your
results
>>>> reminds me that maybe persistent grants would get similar
results
>>>> with larger packet size too.
>>>>
>>> "No better performance" -- that''s because both
mechanisms are copying?
>>> However I presume persistent grant can scale better? From an
earlier
>>> email last week, I read that copying is done by the guest so that
this
>>> mechanism scales much better than hypervisor copying in
blk''s case.
>>
>> The original persistent patch does memcpy in both netback and
>> netfront side. I am thinking maybe the performance can become better
>> if removing the memcpy from netfront.
>
> I would say that removing copy in netback can scale better.
>
>> Moreover, I also have a feeling that we got persistent grant
>> performance based on default netperf params test, just like
wei''s
>> hack which does not get better performance without large packets. So
>> let me try some test with large packets though.
>>
>
> Sadly enough, I found out today these sort of test seems to be quite
> inconsistent. On a Intel 10G Nic the throughput is actually higher
> without enforcing iperf / netperf to generate large packets.
When I have made performance measurements using iperf, I found that for 
a given point in the parameter space (e.g. for a fixed number of guests, 
interfaces, fixed parameters to iperf, fixed test run duration, etc.) 
the variation was typically _smaller than_ +/- 1 Gbit/s on a 10G NIC.

I notice that your results don''t include any error bars or indication
of
standard deviation...

With this sort of data (or, really, any data) measuring at least 5 times 
will help to get an idea of the fluctuations present (i.e. a measure of 
statistical uncertainty) by quoting a mean +/- standard deviation. 
Having the standard deviation (or other estimator for the uncertainty in 
the results) allows us to better determine how significant this 
difference in results really is.

For example, is the high throughput you quoted (~ 14 Gbit/s) an upward 
fluctuation, and the low value (~6) a downward fluctuation? Having a 
mean and standard deviation would allow us to determine just how 
(in)compatible these values are.

Assuming a Gaussian distribution (and when sampled sufficient times, 
"everything" tends to a Gaussian) you have an almost 5% chance that a 
result lies more than 2 standard deviations from the mean (and a 0.3% 
chance that it lies more than 3 s.d. from the mean!). Results that 
appear "high" or "low" may, therefore, not be entirely
unexpected.
Having a measure of the standard deviation provides some basis against 
which to determine how likely it is that a measured value is just 
statistical fluctuation, or whether it is a significant result.

Another thing I noticed is that you''re running the iperf test for only
5
seconds. I have found in the past that iperf (or, more likely, TCP) 
takes a while to "ramp up" (even with all parameters fixed e.g.
"-l
<size> -w <size>") and that tests run for 2 minutes or more
(e.g. "-t
120") give much more stable results.

Andrew.
>
>
> Wei.
>
>> Thanks
>> Annie

Wei Liu

2013-Jul-01 17:55 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote:
[...]> >
> >Sadly enough, I found out today these sort of test seems to be quite
> >inconsistent. On a Intel 10G Nic the throughput is actually higher
> >without enforcing iperf / netperf to generate large packets.
> 
> When I have made performance measurements using iperf, I found that
> for a given point in the parameter space (e.g. for a fixed number of
> guests, interfaces, fixed parameters to iperf, fixed test run
> duration, etc.) the variation was typically _smaller than_ +/- 1
> Gbit/s on a 10G NIC.
> 
I was talking about virtual interface v.s. real hardware. The parameters
that maximize throughput for one case don''t seem to be working for the
other case. The deviation for a specific interface is rather small.
> I notice that your results don''t include any error bars or
> indication of standard deviation...
> 
> With this sort of data (or, really, any data) measuring at least 5
> times will help to get an idea of the fluctuations present (i.e. a
> measure of statistical uncertainty) by quoting a mean +/- standard
> deviation. Having the standard deviation (or other estimator for the
> uncertainty in the results) allows us to better determine how
> significant this difference in results really is.
> 
> For example, is the high throughput you quoted (~ 14 Gbit/s) an
> upward fluctuation, and the low value (~6) a downward fluctuation?
> Having a mean and standard deviation would allow us to determine
> just how (in)compatible these values are.
> 
I ran those tests for several times and picked the number that appeared
most. Anyway I will try to come up with better visualized graphs.
> Assuming a Gaussian distribution (and when sampled sufficient times,
> "everything" tends to a Gaussian) you have an almost 5% chance
that
> a result lies more than 2 standard deviations from the mean (and a
> 0.3% chance that it lies more than 3 s.d. from the mean!). Results
> that appear "high" or "low" may, therefore, not be
entirely
> unexpected. Having a measure of the standard deviation provides some
> basis against which to determine how likely it is that a measured
> value is just statistical fluctuation, or whether it is a
> significant result.
> 
> Another thing I noticed is that you''re running the iperf test for
> only 5 seconds. I have found in the past that iperf (or, more
> likely, TCP) takes a while to "ramp up" (even with all parameters
> fixed e.g. "-l <size> -w <size>") and that tests run
for 2 minutes
> or more (e.g. "-t 120") give much more stable results.
> 
Hmm... for me the lenght of the test doesn''t make much difference,
that''s why I''ve chosen such a short time. As you mentioned
this I intend
to run the tests a big longer.
> Andrew.
> 
> >
> >
> >Wei.
> >
> >>Thanks
> >>Annie

Wei Liu

2013-Jul-03 15:18 UTC

head link

Re: Interesting observation with network event notification and batching

On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote:
[...]> >I would say that removing copy in netback can scale better.
> >
> >>Moreover, I also have a feeling that we got persistent grant
> >>performance based on default netperf params test, just like
wei''s
> >>hack which does not get better performance without large packets.
So
> >>let me try some test with large packets though.
> >>
> >
> >Sadly enough, I found out today these sort of test seems to be quite
> >inconsistent. On a Intel 10G Nic the throughput is actually higher
> >without enforcing iperf / netperf to generate large packets.
> 
> When I have made performance measurements using iperf, I found that
> for a given point in the parameter space (e.g. for a fixed number of
> guests, interfaces, fixed parameters to iperf, fixed test run
> duration, etc.) the variation was typically _smaller than_ +/- 1
> Gbit/s on a 10G NIC.
> 
> I notice that your results don''t include any error bars or
> indication of standard deviation...
> 
> With this sort of data (or, really, any data) measuring at least 5
> times will help to get an idea of the fluctuations present (i.e. a
> measure of statistical uncertainty) by quoting a mean +/- standard
> deviation. Having the standard deviation (or other estimator for the
> uncertainty in the results) allows us to better determine how
> significant this difference in results really is.
> 
> For example, is the high throughput you quoted (~ 14 Gbit/s) an
> upward fluctuation, and the low value (~6) a downward fluctuation?
> Having a mean and standard deviation would allow us to determine
> just how (in)compatible these values are.
> 
> Assuming a Gaussian distribution (and when sampled sufficient times,
> "everything" tends to a Gaussian) you have an almost 5% chance
that
> a result lies more than 2 standard deviations from the mean (and a
> 0.3% chance that it lies more than 3 s.d. from the mean!). Results
> that appear "high" or "low" may, therefore, not be
entirely
> unexpected. Having a measure of the standard deviation provides some
> basis against which to determine how likely it is that a measured
> value is just statistical fluctuation, or whether it is a
> significant result.
> 
> Another thing I noticed is that you''re running the iperf test for
> only 5 seconds. I have found in the past that iperf (or, more
> likely, TCP) takes a while to "ramp up" (even with all parameters
> fixed e.g. "-l <size> -w <size>") and that tests run
for 2 minutes
> or more (e.g. "-t 120") give much more stable results.
> 
> Andrew.
> 
Here you go, results for the new conducted benchmarks. Was about to do
graph but found out not really worth it because it''s only single
stream.

For iperf tests unit is Gb/s, for netperf tests unit is Mb/s.

COPY SCHEME
iperf -c  10.80.237.127 -t 120
6.19 6.23 6.26 6.25 6.27
mean 6.24    s.d. 0.031622776601759

iperf -c 10.80.237.127 -t 120  -l 131072
6.07 6.07 6.03 6.06 6.06
mean 6.058   s.d. 0.016431676725514

netperf -H 10.80.237.127 -l120 -f m
5662.55 5636.6 5641.52 5631.39 5630.98
mean 5640.608   s.d. 13.0001642297036

netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072
5831.19 5833.03 5829.54 5838.89 5830.5
mean 5832.63  s.d. 3.72512415992628


PERMANENT MAP SCHEME
"iperf -c  10.80.237.127 -t 120
2.42 2.41 2.41 2.42 2.43
mean 2.418   s.d. 0.00836660026531

iperf -c 10.80.237.127 -t 120  -l 131072
14.3 14.2 14.2 14.4 14.3
mean 14.28  s.d. 0.083666002653234

netperf -H 10.80.237.127 -l120 -f m
4632.27 4630.08 4633.18 4641.25 4632.23
mean 4633.802   s.d. 4.31656924013371

netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072
10556.04  10532.89 10541.83 10552.77 10546.77
mean 10546.06  s.d. 9.17156475133789


Short run of iperf / netperf was conducted before each test run so that
the system was "warmed-up".

The results show that the single stream performance is quite stable.
Also there''s not much difference between running tests for 5s or 120s.


Wei.> >
> >
> >Wei.
> >
> >>Thanks
> >>Annie

Possibly Parallel Threads

Search for more maybe matching threads

Xen devel - Jun 2013 - Interesting observation with network event notification and batching

Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Re: Interesting observation with network event notification and batching

Possibly Parallel Threads