Here are some notes about performance that I prepared a while ago.
> TX is "packets from the guest", RX is "packets for the
guest".
>
> For discussion purposes, here''s how the TX path works (this is the
fast
> case - if there are resource shortages, ring fills, etc. things are more
> complex):
>
> domU: xnf is passed a packet chain (typically only a single packet). It:
> - flattens the message to a single mblk which is contained in a
> single page (this might be a no-op),
> - allocates a grant reference,
> - grants the backend access to the page containing the packet,
> - gets a slot in the tx ring,
> - updates the tx ring,
> - hypercall notifies the backend that a packet is ready.
>
> The TX ring is cleaned lazily, usually when getting a slot from
> the ring fails. Cleaning the ring results in freeing any buffers
> that were used for transmit.
>
> dom0: xnb receives an interrupt to say that the xnf sent one or more
> packets. It:
> - for each consumed slot in the ring:
> - add the grant reference of the page containing the packet to a
> list.
> - hypercall to map all of the pages for which we have grant
> references.
> - for each consumed slot in the ring:
> - allocate an mblk for the packet.
> - copy data from the granted page to the mblk.
> - store mblk in a list.
> - hypercall to unmap all of the granted pages.
> - pass the packet chain down to the NIC (typically a VNIC).
>
> Simpler improvements:
> - Add support for the scatter-gather extension to our
> frontend/backend driver pair. This would mean that we don''t
need
> to flatten mblk chains that belong to a single packet in the
> frontend driver. I have a quick prototype of this based on some
> work that Russ did (the Windows driver tends to use long packet
> chains, so it''s wanted in our backend).
> - Look at using the ''hypervisor copy'' hypercall to
move data from
> guest pages into mblks in the backend driver. This would remove
> the need to map the granted pages into dom0 (which is
> expensive). Prototyping this should be straightforward and it may
> provide a big win, but without trying we don''t know.
Certainly it
> would push the dom0 CPU time down (by moving the work into the
> hypervisor).
> - Use the guest provide buffers directly (esballoc) rather than
> copying the data into more buffers. I had an implementation of
> this and it suffered in three ways:
> - The buffer management was poor, causing a lot of lock contention
> over the ring (the tx completion freed the buffer and this
> contended with the tx lock used to reap packets from the
> ring). This could be fixed with a little time.
> - There are a limited number of ring entries (256) and they cannot
> be reused until the associated buffer is freed. If the dom0
> stack or a driver holds on to transmit buffers for a long time,
> we see ring exhaustion. The Neptune driver was particularly bad
> for this.
> - Guests grant read-only mappings for these pages. Unfortunately
> the Solaris IP stack expects to be able to modify packets which
> causes page faults. There are a couple of workarounds available:
> - Modify Solaris guests to grant read/write mappings and
> indicate this. I have this implemented and it works, but
it''s
> somewhat undesirable (and doesn''t help with Linux or
Windows
> guests).
> - Indicate to the MAC layer that these packets are ''read
only''
> and have it copy them if they are for the local stack.
> - Implement an address space manager for the pages used for
> these packets and handle faults as they occur - somewhat
> blue-sky this one :-)
>
> More complex improvements:
> - Avoid mapping the guest pages into dom0 completely if the packet
> is not destined for dom0. If the guest is sending a packet to a
> third party, dom0 doesn''t need to map in the packet at all -
it
> can pass the MA[1] to the DMA engine of the NIC without ever
> acquiring a VA. Issues:
> - We need the destination MAC address of the packet to be included
> in the TX ring so that we can route the packet (e.g. decide if
> it''s for dom0, another domU or external). There''s
no room for it
> in the current ring structures, see "netchannel2"
comments
> further on.
> - The MAC layer and any interested drivers would need to learn
> about packets for which there is currently no VA. This will
> require *big* changes.
> - Cache mappings of the granted pages from the guest domain.
It''s
> not clear how much benefit this would have for the transmit path -
> we''d need to see how often the same pages for transmit
buffers by
> the guest.
>
> Here''s the RX path (again, simpler case):
>
> domU: When the interface is created, domU:
> - for each entry in the RX ring:
> - allocate an MTU sized buffer,
> - find the PA and MFN[2] of the buffer,
> - allocate a grant reference for the buffer,
> - update the ring with the details of the buffer (gref and id)
> - signal the backend that RX buffers are available
>
> dom0: When a packet arrives[3]:
> - driver calls mac_rx() having prepared a packet,
> - MAC layer classifies the packet (if not for free from the ring
> used),
> - MAC layer passes packet chain (usually just one packet) to xnb
> RX function
> - xnb RX function:
> - for each packet in the chain (b_next):
> - get a slot in the RX ring
> - for each mblk in the packet (b_cont):
> - for each page in the mblk[4]:
> - fill in a hypervisor copy request for this chunk
> - hypercall to perform the copies
> - mark the RX ring entry completed
> - notify the frontend of new packets (if required[5]).
> - free the packet chain.
>
> domU: When a packet arrives (notified by the backend):
> - for each dirty entry in the RX ring:
> - allocate an mblk for the data
> - copy the data from the RX buffer to the mblk
> - add the mblk to the packet chain
> - mark the ring entry free (e.g. re-post the buffer)
> - notify the backend that the ring has free entries (if required).
> - pass the packet chain to mac_rx().
>
> Simpler improvements:
> - Don''t allocate a new mblk and copy the data in the domU
interrupt
> path, rather wrap around the buffer and re-post a new one. This
> looks like it would be a good win - definitely worth building
> something to see how it behaves. Obviously the buffer management
> gets a little more complicated, but it may be worth it. The
> downside is that it reduces the likely benefit of having the
> backend cache mappings for the pre-posted RX buffers, as we are
> much less likely to recycle the same buffers over and over again
> (which is what happens today).
> - Update the frontend driver to use the Crossbow polling
> implementation, significantly reducing the interrupt load on the
> guest. Max started on this but it has languished since he left
> us.
>
> More complex improvements:
> - Given that the guest pre-posts the buffers that it will use for
> received data, push these buffers down into the MAC layer,
> allowing the driver to directly place packets into guest
> buffers. This presumes that we can get an RX ring in the driver
> assigned for the MAC address of the guest.
>
> General things (TX and RX):
> - Implementing scatter gather should improve some cases, but
it''s
> not that big a win. It allows us to implement jumbo-frames, which
> will show improvements in benchmarks. It also leads to...
> - Implementing LSO/LRO between dom0 and domU could have big
> benefits, as it will reduce the number of interrupts and the
> number of hypercalls.
> - All of the backend xnb instances currently operate independently -
> they share no state. If there are a large number of active guests
> it will probably be worth looking at a scheme where we shift to a
> worker thread per CPU and have that thread responsible for
> multiple xnb instances. This would allow us to reduce the
> hypercall count even more.
> - netchannel2 is a new inter-domain protocol implementation intended
> to address some of the shortcomings in the current protocol. It
> includes:
> - multiple pages of TX/RX descriptors which can either be just
> bigger rings or independent rings,
> - multiple event channels (which means multiple interrupts),
> - improved ring structure (space for MAC addresses, ...).
> With it there is a proposal for a soft IOMMU implementation to
> improve the use of grant mappings.
>
> We''ve done nothing with netchannel2 so far. In Linux
it''s
> currently a prototype with changes to an Intel driver to use it
> with VMDQ.
>
> Footnotes:
> [1] Machine address. In Xen it''s no longer the case that all
memory is
> mapped into the dom0 kernel - you may not even have a physical
> mapping for the memory.
> [2] Machine frame number, analogous to PFN.
> [3] This assumes packets from an external source. Locally generated
> packets destined for a guest jump into the flow a couple of items
> down the list.
> [5] The frontend controls whether or not notification takes place using
> a watermark in the ring.
> [4] Each chunk passed to the hypervisor copy routine must only contain
> a single page, as we don''t know that the pages are machine
> contiguous (and it''s pretty expensive to find out).
dme.
--
David Edmondson, Sun Microsystems, http://dme.org