thr3ads.net - Linux Virtualization - TODO list for qemu+KVM networking performance v2 [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Michael S. Tsirkin

2009-Jun-04 16:43 UTC

TODO list for qemu+KVM networking performance v2

As I'm new to qemu/kvm, to figure out how networking performance can be
improved, I
went over the code and took some notes. As I did this, I tried to record ideas
from recent discussions and ideas that came up on improving performance. Thus
this list.

This includes a partial overview of networking code in a virtual environment,
with
focus on performance: I'm only interested in sending and receiving packets,
ignoring configuration etc.

I have likely missed a ton of clever ideas and older discussions, and probably
misunderstood some code. Please pipe up with corrections, additions, etc. And
please don't take offence if I didn't attribute the idea correctly -
most of
them are marked mst by I don't claim they are original. Just let me know.

And there are a couple of trivial questions on the code - I'll
add answers here as they become available.

I out up a copy at http://www.linux-kvm.org/page/Networking_Performance as
well, and intend to dump updates there from time to time.

Thanks,
MST

---

There are many ways to set up networking in a virtual machone.
here's one: linux guest -> virtio-net -> virtio-pci -> qemu+kvm
-> tap -> bridge.
Let's take a look at this one.

Virtio is the guest side of things.

Guest kernel virtio-net:

TX:
- Guest kernel allocates a packet (skb) in guest kernel memory
and fills it in with data, passes it to networking stack.
- The skb is passed on to guest network driver
(hard_start_xmit)
- skbs in flight are kept in send queue linked list,
so that we can flush them when device is removed
[ mst: optimization idea: virtqueue already tracks
posted buffers. Add flush/purge operation and use that instead? ]
- skb is reformatted to scattergather format
[ mst: idea to try: this does a copy for skb head,
which might be costly especially for small/linear packets.
Try to avoid this? Might need to tweak virtio interface.
]
- network driver adds the packet buffer on TX ring
- network driver does a kick which causes a VM exit
[ mst: any way to mitigate # of VM exits here?
Possibly could be done on host side as well. ]
[ markmc: All of our efforts there have been on the host side, I think
that's preferable than trying to do anything on the guest side.
]

- Full queue:
we keep a single extra skb around:
if we fail to transmit, we queue it
[ mst: idea to try: what does it do to
performance if we queue more packets? ]
if we already have 1 outstanding packet,
we stop the queue and discard the new packet
[ mst: optimization idea: might be better to discard the old
packet and queue the new one, e.g. with TCP old one
might have timed out already ]
[ markmc: the queue might soon be going away:
200905292346.04815.rusty at rustcorp.com.au

http://archive.netbsd.se/?ml=linux-netdev&a=2009-05&m=10788575
]

- We get each buffer from host as it is completed and free it
- TX interrupts are only enabled when queue is stopped,
and when it is originally created (we disable them on completion)
[ mst: idea: second part is probably unintentional.
todo: we probably should disable interrupts when device is created.
]
- We poll for buffer completions:
1. Before each TX 2. On a timer tasklet (unless 3 is supported)
3. When host sends us interrupt telling us that the queue is empty
[ mst: idea to try: instead of empty, enable send interrupts on xmit when
buffer is almost full (e.g. at least half empty): we are running out
of
buffers, it's important to free them ASAP. Can be done
from host or from guest. ]
[ Rusty proposing that we don't need (2) or (3) if the skbs are
orphaned
before start_xmit(). See subj "net: skb_orphan on
dev_hard_start_xmit".]
[ rusty also seems to be suggesting that disabling
VIRTIO_F_NOTIFY_ON_EMPTY
on the host should help the case where the host out-paces the guest
]
4. when queue is stopped or when first packet was sent after device
was created (interrupts are enabled then)

RX:
- There are really 2 mostly separate code paths: with mergeable
rx buffers support in host and without. I focus on mergeable
buffers here since this is the default in recent qemu.
[mst: optimization idea: mark mergeable_rx_bufs as likely() then?]
- Each skb has a 128 byte buffer at head and a single page for data.
Only full pages are passed to virtio buffers.
[ mst: for large packets, managing the 128 head buffers is wasted
effort. Try allocating skbs on rcv path when needed. ].
[ mst: to clarify the previos suggestion: I am talking about
merging here. We currently allocate skbs and pages for them. If a packet
spans multiple pages, we discard the extra skbs. Instead, let's
allocate
pages but not skbs. Allocate and fill skbs on receive path. ]

Pages are allocate from our private buffer before fallback to
alloc_page.
See below.

- Buffers are replenished after packet is received,
when number of buffers becomes low (below 1/2 max).
This serves to reduce the number of kicks (VMexits) for RX.
[ mst: code might become simpler if we add buffers
immediately, but don't kick until later]
[ markmc: possibly. batching this buffer allocation might be
introducing more unpredictability to benchmarks too - i.e. there isn't
a
fixed per-packet overhead, some packets randomly have a higher overhead]
on failure to allocate in atomic context we simply stop
and try again on next recv packet.
[mst: there's a fixme that this fails if we complete run out of
buffers,
should be handled by timer. could be a thread as well
(allocate with GFP_KERNEL).
idea: might be good for performance anyway. ]
After adding buffers, we do a kick.
[ mst: test whether this optimization works: recv kicks should be rare
]
Outstanding buffers are kept on recv linked list.
[ mst: optimization idea: virtqueue already tracks
posted buffers. Add flush operation and use that instead. ]

- recv is done with napi: on recv interrupt, disable interrupts
poll until queue is empty, enable when it's empty
[mst: test how well does this work. should get 1 interrupt per
N packets. what is N?]
[mst: idea: implement interrupt coalescing? ]

- when recv packet is polled, first 128 bytes are copied out,
the rest is collected in the array of frags.
if packet spans multiple buffers, unused skbs are discarded.
If packet is < 128, the page is added to pool, see below.
The packet is then sent up the networking stack.

- we have a pool of pages (LIFO) which are left unused
at the tail of the buffer for short packets (< 128)
[mst: test how common is it for poll to be nonempty.]
[mst: for short skbs, the new buffer we will allocate
and re-add is identical to the old one. try just copying
the sg over instead of re-formatting.
]
[mst: try using circular buffer instead of linked list for pool ]
[mst: is it a good idea to limit pool size? ]
[mst: need to measure: for large messages, the pool might become empty
fast.
replenish it from thread context with GFP_KERNEL pages?]
[mst: some architectures (with expensive unaligned DMA) override NET_IP_ALIGN.
since we don't really do DMA, we probably should use alignment of 2
always]

Guest kernel virtio-ring:
Adding buffer:
the ring keeps a LIFO free list of ring entries
[ mst: idea to try: it should be pretty common
for entries to complete in-order.
use circular buffer to optimize for that case,
and fall back on free list if not. ]
[ mst: question: there's a FIXME to avoid modulus in the math.
since num is a power of 2, isn't this just & (num -
1)?]
Polling buffer:
we look at vq index and use that to find the next completed buffer
the pointer to data (skb) is retrieved and returned to user
[ mst: clearing data is only needed for debugging.
try removing this write - cache will be cleaner? ]

Guest kernel virtio-pci:
notify (kick):
to notify host of ring activity, we perform pio write
[ mst: hypercalls are reported to be slightly cheaper ... ]
interrupt:
on interrupt, we invoke the callback for the relevant vq
for regular interrupts, we clear the interrupt, and scan
list of vqs invoking callbacks
[ mst: test whether msi-x/msi works better ]

Host qemu:

TX:
We poll for TX packets in 2 ways
- On timer event (see below)
- When we get a kick from guest
At this point, we disable further notifications,
and start a timer. Notifications are reenabled after this.
This is designed to reduce the number of VMExits due to TX.
[ markmc: tried removing the timer.
It seems to really help some workloads. E.g. on RHEL:
http://markmc.fedorapeople.org/virtio-netperf/2009-04-15/
on fedora removing timer has no major effect either way:

http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/g-h-tput-04-no-tx-timer.html
]
[ markmc: had patches moving the "flush tx queue on ring full" into
the I/O thread.

http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/g-h-tput-02-flush-in-io-thread.html
the graph seems to show no effect on performance.
]
[ mst: it is interesting that to start timer,
we use qemu_get_clock which does a systemcall. ]
[ mst: test how well does this work. We should get
a kick once for N packets. ]
[ mst: idea: instead of enabling interrupts after draining the queue,
try waiting another timer tick ... ]
[ mst: test whether the queue gets full. It will if timer is too
large. If yes we might ask the guest to force notification so we drain the
queue ASAP. ]
[ markmc: I actually don't think we hit ring-full often ]
[ mst: it would be easy to kill the timer in host and never
disable interrupts, do all decisions on notification in guest.
However timers are more costly there. ]
[ avi: short timers are very expensive in the guest: need to exit to
set the timer, another to fire, yet another to to EOI ]
Packets are polled from virtio ring, walking descriptor linked list.
[ mst: optimize for completing in order? ]
Packet addresses are converted to guest iovec, using
cpu_physical_memory_map
[ mst: cpu_physical_memory_map could be optimized
to only handle what we actually use this for:
single page in RAM ]
With tap, we pass this to vlan and eventually call writev
on tap device
[ mst: if there's a single vlan, as is common,
we could optimize the vlan scan and pass packets to
final destination directly ]
[ rusty: write system call could be optimized out
by implementing a virtio server in kernel ]
[ markmc: vlans just should not be used in common case ]

An interesting thing to note here is that we don't
try to limit the number of packets outstanding on tap
device. So there's never "full queue".
[ mst: with UDP this likely leads to overruns and packet drops.
what about TCP?]
[ markmc: probably just UDP: we hit the TCP window size first. ]
[ mst: 2.6.30-rcX kernels let us limit the number of packets outstanding
on tap. Use this? ]
[ markmc: see my patches (subj: Add generic packet buffering API)
on qemu mailing list which will allow us
to hanle -EAGAIN from tap without having to unpop the buffer
from the virtio ring or poll()ing the fd for each packet. ]

[ avi: tap could implement an option to send multiple packets
with a single write]

Finally, we deliver queue interrupt if the guest asked for it.

RX:
RX is done from IO thread. We get notification that more buffers
have been posted and wake that thread.
[ mst: test how common this is. should be once per many packets ]
[ mst: would it be better to extend tap to consume packets on the same
CPU which got them? ]
When a packet arrives at the network interface, we read it in,
and then copy over into virtio buffer.
[ dlaor: reading directly into the virtio buffer would be
a low-hanging fruit ]
[ markmc: anthony had patches to do this a long time ago but they were
fairly ugly. Should be easier to do when we remove VLANs from the common
case - i.e. only copy if a VLAN is used ]
[ rusty: read system call could be optimized out
by implementing a virtio server in kernel ]
While we copy, we implement a work around for dhclient there.
[ mst: for zero copy will need a flag to disable it.
or just make header not zero copy? ]
[ mst: we can implement a kind of interrupt coalescing scheme,
where we don't send an RX interrupt until
we start getting low on RX buffers, or until
tap device recv queue is empty ]
[ avi: tap could implement an option to recv multiple packets
with a single read]

Host kernel networking stack with tap and bridge:
This is not specific to virtualization so just some notes:

- There are packet copies from/to userspace in both TX and RX paths.
[mst: TX might be addressable with aio and data destructors.
RX is known as a hard problem]
- If the real TX queue packets are sent on is full and is stopped,
this fact does not propagate to tap and to the user.
This will result in more packets being sent and lost.
[ mst: as mentioned above, one way to address this is
to limit the number of packets outstanding on
tap. Note that this might not fully solve the
problem as the queue could get used by other applications.
Is there some flow control mechanism in bridge we could use?].
- markmc: another thing we need to do is to disable bridge-nf-call-iptables by
default at the distro level. It defeats the tap send buffer accounting
and probably hurts performance.
- bridging in host is unnecessary if have a dedicated network device.
it might be interesting to support binding to raw sockets
instead of tap.
[ mst: what is the overhead of bridging isn't it negligeable? ]
[ mst: for this, it might be interesting to suport aio in raw sockets,
which will make it possible to pre-post RX buffers in raw sockets ]

---

Short term plans: I plan to start out with trying out the following ideas:

save a copy in qemu on RX side in case of a single nic in vlan
implement virtio-host kernel module

*detail on virtio-host-net kernel module project*

virtio-host-net is a simple character device which gets memory layout
information
from qemu, and uses this to convert between virtio descriptors to skbs.
The skbs are then passed to/from raw socket (or we could bind virtio-host
to physical device like raw socket does TBD).

Interrupts will be reported to eventfd descriptors, and device will poll
eventfd descriptors to get kicks from guest.

--
MST

Gregory Haskins

2009-Jun-04 17:16 UTC

head link

TODO list for qemu+KVM networking performance v2

Michael S. Tsirkin wrote:> As I'm new to qemu/kvm, to figure out how networking performance can be
improved, I
> went over the code and took some notes.  As I did this, I tried to record
ideas
> from recent discussions and ideas that came up on improving performance.
Thus
> this list.
>
> This includes a partial overview of networking code in a virtual
environment, with
> focus on performance: I'm only interested in sending and receiving
packets,
> ignoring configuration etc.
>
> I have likely missed a ton of clever ideas and older discussions, and
probably
> misunderstood some code. Please pipe up with corrections, additions, etc.
And
> please don't take offence if I didn't attribute the idea correctly
- most of
> them are marked mst by I don't claim they are original. Just let me
know.
>
> And there are a couple of trivial questions on the code - I'll
> add answers here as they become available.
>
> I out up a copy at http://www.linux-kvm.org/page/Networking_Performance as
> well, and intend to dump updates there from time to time.
>   
Hi Michael,
  Not sure if you have seen this, but I've already started to work on
the code for in-kernel devices and have a (currently non-virtio based)
proof-of-concept network device which you can for comparative data.  You
can find details here:

http://lkml.org/lkml/2009/4/21/408

<snip>

(Will look at your list later, to see if I can add
anything)> ---
>
> Short term plans: I plan to start out with trying out the following ideas:
>
> save a copy in qemu on RX side in case of a single nic in vlan
> implement virtio-host kernel module
>
> *detail on virtio-host-net kernel module project*
>
> virtio-host-net is a simple character device which gets memory layout
information
> from qemu, and uses this to convert between virtio descriptors to skbs.
> The skbs are then passed to/from raw socket (or we could bind virtio-host
> to physical device like raw socket does TBD).
>
> Interrupts will be reported to eventfd descriptors, and device will poll
> eventfd descriptors to get kicks from guest.
>
>   
I currently have a virtio transport for vbus implemented, but it still
needs a virtio-net device-model backend written.  If you are interested,
we can work on this together to implement your idea.  Its on my "todo"
list for vbus anyway, but I am currently distracted with the
irqfd/iosignalfd projects which are prereqs for vbus to be considered
for merge.

Basically vbus is a framework for declaring in-kernel devices (not kvm
specific, per se) with a full security/containment model, a
hot-pluggable configuration engine, and a dynamically loadable 
device-model.  The framework takes care of the details of signal-path
and memory routing for you so that something like a virtio-net model can
be implemented once and work in a variety of environments such as kvm,
lguest, etc.

Interested?
-Greg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 266 bytes
Desc: OpenPGP digital signature
Url :
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090604/88679f81/attachment.pgp

Michael S. Tsirkin

2009-Jun-04 17:29 UTC

head link

TODO list for qemu+KVM networking performance v2

On Thu, Jun 04, 2009 at 01:16:05PM -0400, Gregory Haskins
wrote:> Michael S. Tsirkin wrote:
> > As I'm new to qemu/kvm, to figure out how networking performance
can be improved, I
> > went over the code and took some notes.  As I did this, I tried to
record ideas
> > from recent discussions and ideas that came up on improving
performance. Thus
> > this list.
> >
> > This includes a partial overview of networking code in a virtual
environment, with
> > focus on performance: I'm only interested in sending and receiving
packets,
> > ignoring configuration etc.
> >
> > I have likely missed a ton of clever ideas and older discussions, and
probably
> > misunderstood some code. Please pipe up with corrections, additions,
etc. And
> > please don't take offence if I didn't attribute the idea
correctly - most of
> > them are marked mst by I don't claim they are original. Just let
me know.
> >
> > And there are a couple of trivial questions on the code - I'll
> > add answers here as they become available.
> >
> > I out up a copy at
http://www.linux-kvm.org/page/Networking_Performance as
> > well, and intend to dump updates there from time to time.
> >   
> 
> Hi Michael,
>   Not sure if you have seen this, but I've already started to work on
> the code for in-kernel devices and have a (currently non-virtio based)
> proof-of-concept network device which you can for comparative data.  You
> can find details here:
> 
> http://lkml.org/lkml/2009/4/21/408
> 
> <snip>
Thanks
> (Will look at your list later, to see if I can add anything)
> > ---
> >
> > Short term plans: I plan to start out with trying out the following
ideas:
> >
> > save a copy in qemu on RX side in case of a single nic in vlan
> > implement virtio-host kernel module
> >
> > *detail on virtio-host-net kernel module project*
> >
> > virtio-host-net is a simple character device which gets memory layout
information
> > from qemu, and uses this to convert between virtio descriptors to
skbs.
> > The skbs are then passed to/from raw socket (or we could bind
virtio-host
> > to physical device like raw socket does TBD).
> >
> > Interrupts will be reported to eventfd descriptors, and device will
poll
> > eventfd descriptors to get kicks from guest.
> >
> >   
> 
> I currently have a virtio transport for vbus implemented, but it still
> needs a virtio-net device-model backend written.
You mean virtio-ring implementation?
I intended to basically start by reusing the code from
Documentation/lguest/lguest.c
Isn't this all there is to it?
>  If you are interested,
> we can work on this together to implement your idea.  Its on my
"todo"
> list for vbus anyway, but I am currently distracted with the
> irqfd/iosignalfd projects which are prereqs for vbus to be considered
> for merge.
> 
> Basically vbus is a framework for declaring in-kernel devices (not kvm
> specific, per se) with a full security/containment model, a
> hot-pluggable configuration engine, and a dynamically loadable 
> device-model.  The framework takes care of the details of signal-path
> and memory routing for you so that something like a virtio-net model can
> be implemented once and work in a variety of environments such as kvm,
> lguest, etc.
> 
> Interested?
> -Greg
> 
It seems that a character device with a couple of ioctls would be simpler
for an initial prototype.

-- 
MST

Rusty Russell

2009-Jun-10 03:39 UTC

head link

TODO list for qemu+KVM networking performance v2

On Fri, 5 Jun 2009 02:13:20 am Michael S. Tsirkin wrote:> I out up a copy at http://www.linux-kvm.org/page/Networking_Performance as
> well, and intend to dump updates there from time to time.
Hi Michael,

  Sorry for the delay.  I'm weaning myself off my virtio work, but
virtio_net
performance is an issue which still needs lots of love.  

BTW a non-wiki on the wiki?.  You should probably rename it to 
"MST_Networking_Performance" or allow editing :)
> 	- skbs in flight are kept in send queue linked list,
> 	  so that we can flush them when device is removed
> 	  [ mst: optimization idea: virtqueue already tracks
>             posted buffers. Add flush/purge operation and use that instead?
Interesting idea,  but not really an optimization.  (flush_buf() which does a 
get_buf() but for unused buffers).
> ] - skb is reformatted to scattergather format
>           [ mst: idea to try: this does a copy for skb head,
>             which might be costly especially for small/linear packets.
> 	    Try to avoid this? Might need to tweak virtio interface.
>           ]
There's no copy here that I can see?
>         - network driver adds the packet buffer on TX ring
> 	- network driver does a kick which causes a VM exit
>           [ mst: any way to mitigate # of VM exits here?
>             Possibly could be done on host side as well. ]
> 	  [ markmc: All of our efforts there have been on the host side, I think
>             that's preferable than trying to do anything on the guest
side.
> ]
The current theoretical hole is that the host suppresses notifications using 
the VIRTIO_AVAIL_F_NO_NOTIFY flag, but we can get a number of notifications in 
before it gets to that suppression.  You can use a counter to improve this: 
you only notify when they're equal, and inc when you notify.  That way you 
suppress further notifications even if the other side takes ages to wake up.  
In practice, this shouldn't be played with until we have full aio (or equiv
in
kernel) for other side: host xmit tends to be too fast at the moment and we 
get a notification per packet anyway.
> 	- Full queue:
> 		we keep a single extra skb around:
> 			if we fail to transmit, we queue it
> 			[ mst: idea to try: what does it do to
>                           performance if we queue more packets? ]
Bad idea!!  We already have two queues, this is a third.  We should either 
stop the queue before it gets full, or fix TX_BUSY handling.  I've been
arguing
on netdev for the latter (see thread"[PATCH 2/4] virtio_net: return 
NETDEV_TX_BUSY instead of queueing an extra skb.").
> 	        [ markmc: the queue might soon be going away:
>                    200905292346.04815.rusty at rustcorp.com.au
Ah, yep, that one.
> http://archive.netbsd.se/?ml=linux-netdev&a=2009-05&m=10788575 ]
>
> 	- We get each buffer from host as it is completed and free it
>         - TX interrupts are only enabled when queue is stopped,
>           and when it is originally created (we disable them on completion)
>           [ mst: idea: second part is probably unintentional.
>             todo: we probably should disable interrupts when device is
> created. ]
Yep, minor wart.
> - We poll for buffer completions:
> 	  1. Before each TX 2. On a timer tasklet (unless 3 is supported)
>           3. When host sends us interrupt telling us that the queue is
> empty [ mst: idea to try: instead of empty, enable send interrupts on xmit
> when buffer is almost full (e.g. at least half empty): we are running out
> of buffers, it's important to free them ASAP. Can be done from host or
from
> guest. ]
>           [ Rusty proposing that we don't need (2) or (3) if the skbs
are
> orphaned before start_xmit(). See subj "net: skb_orphan on
> dev_hard_start_xmit".] [ rusty also seems to be suggesting that
disabling
> VIRTIO_F_NOTIFY_ON_EMPTY on the host should help the case where the host
> out-paces the guest ]
Yes, that's more fruitful.
>         - Each skb has a 128 byte buffer at head and a single page for
> data. Only full pages are passed to virtio buffers.
>           [ mst: for large packets, managing the 128 head buffers is wasted
>             effort. Try allocating skbs on rcv path when needed. ].
> 	    [ mst: to clarify the previos suggestion: I am talking about
> 	    merging here.  We currently allocate skbs and pages for them. If a
> packet spans multiple pages, we discard the extra skbs.  Instead, let's
> allocate pages but not skbs. Allocate and fill skbs on receive path. ]
Yep.  There's another issue here, which is alignment: packets which get
placed
into pages are misaligned (that 14 byte ethernet header).  We should add a 
feature to allow the host to say "I've skipped this many bytes at the
front".
> 	- Buffers are replenished after packet is received,
>           when number of buffers becomes low (below 1/2 max).
>           This serves to reduce the number of kicks (VMexits) for RX.
>           [ mst: code might become simpler if we add buffers
>             immediately, but don't kick until later]
> 	  [ markmc: possibly. batching this buffer allocation might be
> 	    introducing more unpredictability to benchmarks too - i.e. there
isn't
> a fixed per-packet overhead, some packets randomly have a higher overhead]
> on failure to allocate in atomic context we simply stop
>           and try again on next recv packet.
>           [mst: there's a fixme that this fails if we complete run out
of
> buffers, should be handled by timer. could be a thread as well (allocate
> with GFP_KERNEL).
>                 idea: might be good for performance anyway. ]
Yeah, this "batched packet add" is completely unscientific.  The host
will be
ignoring notifications anyway, so it shouldn't win anything AFAICT.  Ditch
it
and benchmark.
> 	  After adding buffers, we do a kick.
>           [ mst: test whether this optimization works: recv kicks should be
> rare ] Outstanding buffers are kept on recv linked list.
> 	  [ mst: optimization idea: virtqueue already tracks
>             posted buffers. Add flush operation and use that instead. ]
Don't understand this comment?
> 	- recv is done with napi: on recv interrupt, disable interrupts
>           poll until queue is empty, enable when it's empty
>          [mst: test how well does this work. should get 1 interrupt per
>           N packets. what is N?]
It works if the guest is outpacing the host, but in practice I had trouble 
getting above about 2:1.  I've attached a spreadsheet showing the results of
various tests using lguest.  You can see the last one
"lguest:net-delay-for-
more-output.patch" where I actually inserted a silly 50 usec delay before 
sending the receive interrupt: 47k irqs for 1M packets is great, too bad about 
the latency :)
>          [mst: idea: implement interrupt coalescing? ]
lguest does this in the host, with mixed results.  Here's the commentry from
my lguest:reduce_triggers-on-recv.patch (which is queued for linux-next as I 
believe it's the right thing even though win is in the noise).

lguest: try to batch interrupts on network receive

Rather than triggering an interrupt every time, we only trigger an
interrupt when there are no more incoming packets (or the recv queue
is full).

However, the overhead of doing the select to figure this out is
measurable: 1M pings goes from 98 to 104 seconds, and 1G Guest->Host
TCP goes from 3.69 to 3.94 seconds.  It's close to the noise though.

I tested various timeouts, including reducing it as the number of
pending packets increased, timing a 1 gigabyte TCP send from Guest ->
Host and Host -> Guest (GSO disabled, to increase packet rate).

// time tcpblast -o -s 65536 -c 16k 192.168.2.1:9999 > /dev/null

Timeout		Guest->Host	Pkts/irq	Host->Guest	Pkts/irq
Before		11.3s		1.0		6.3s		1.0
0		11.7s		1.0		6.6s		23.5
1		17.1s		8.8		8.6s		26.0
1/pending	13.4s		1.9		6.6s		23.8
2/pending	13.6s		2.8		6.6s		24.1
5/pending	14.1s		5.0		6.6s		24.4
> 	[mst: some architectures (with expensive unaligned DMA) override
> NET_IP_ALIGN. since we don't really do DMA, we probably should use
> alignment of 2 always]
That's unclear: what if the host is doing DMA?
> 		[ mst: question: there's a FIXME to avoid modulus in the math.
>                   since num is a power of 2, isn't this just & (num
- 1)?]
Exactly.
> 	Polling buffer:
> 		we look at vq index and use that to find the next completed buffer
> 		the pointer to data (skb) is retrieved and returned to user
> 		[ mst: clearing data is only needed for debugging.
>                   try removing this write - cache will be cleaner? ]
It's our only way of detecting issues with hosts.  We have reports of
BAD_RING
being triggered (unf. not reproducible).
> TX:
> 	We poll for TX packets in 2 ways
> 	- On timer event (see below)
> 	- When we get a kick from guest
> 	  At this point, we disable further notifications,
> 	  and start a timer. Notifications are reenabled after this.
> 	  This is designed to reduce the number of VMExits due to TX.
> 	  [ markmc: tried removing the timer.
>             It seems to really help some workloads. E.g. on RHEL:
>             http://markmc.fedorapeople.org/virtio-netperf/2009-04-15/
>             on fedora removing timer has no major effect either way:
> 	   
> http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/g-h-tput-04-no-tx-
>timer.html ]
lguest went fully multithreaded, dropped timer hack.  Much nicer, and faster.  
(See second point on graph).  Timers are a hack because we're not async, so 
fixing the real problem avoids that optimization guessing game entirely.
> 	Packets are polled from virtio ring, walking descriptor linked list.
> 	[ mst: optimize for completing in order? ]
> 	Packet addresses are converted to guest iovec, using
> 	cpu_physical_memory_map
> 	[ mst: cpu_physical_memory_map could be optimized
>           to only handle what we actually use this for:
>           single page in RAM ]
Anthony had a patch for this IIRC.
> Interrupts will be reported to eventfd descriptors, and device will poll
> eventfd descriptors to get kicks from guest.
This is definitely a win.  AFAICT you can inject interrupts into the guest from 
a separate thread today in KVM, too, so there's no core reason why devices 
can't be completely async with this one change.

Cheers,
Rusty.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: results3.gnumeric
Type: application/x-gnumeric
Size: 7570 bytes
Desc: not available
Url :
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090610/85b00332/attachment-0001.gnumeric

Apparently Analagous Threads

Search for more reasonably related threads

Linux Virtualization - Jun 2009 - TODO list for qemu+KVM networking performance v2

TODO list for qemu+KVM networking performance v2

TODO list for qemu+KVM networking performance v2

TODO list for qemu+KVM networking performance v2

TODO list for qemu+KVM networking performance v2

Apparently Analagous Threads