On Tue, Feb 28, 2017 at 01:47:19PM +0800, Yuanhan Liu
wrote:> Hi,
> 
> For virtio-net, we use 2 descs for representing a (small) pkt. One for
> virtio-net header and another one for the pkt data. And it has two issues:
> 
> - the desc buffer for storing pkt data is halfed
> 
>   Though we later introduced 2 more options to overcome this: ANYLAY_OUT
>   and indirect desc. The indirect desc has another issue: it introdues
>   an extra cache line visit.
So if we don't care about this part, we could maybe just add
a descriptor flag that puts the whole header in the descriptor.
> - virtio-net header could be scattered
> 
>   Assume the ANYLAY_OUT case, whereas the headered is prepened before
>   each mbuf (or skb in kernel). In DPDK, a burst recevice in vhost pmd
>   means 32 different cache visit for virtio header.
> 
>   For the legacy layout and indirect desc, the cache issue could somehone
>   diminished a bit: we could arrange the virtio header in a same memory
>   block and let the header desc point to the right one.
> 
>   But it's still not good enough: the virtio-net headers aren't
accessed
>   in batch: they have to be accessed one by one (by reading the desc).
>   That said, it's still not that good for cache utilization.
> 
> 
> And I'm proposing packed header:
> 
> - put all virtio-net header in a memory block.
> 
>   A burst size of 32 pkts need only access (32 * 12) / 64 = 6 cache lines.
>   While before, it could be 32 cache lines.
> 
> - introduce a header desc to reference above memory block.
> 
>   desc->addr = starting addr of net headers mem block
>   desc->len  = size of all net virtio net headers (burst size * header
size)
> 
> Thus, in a burst size of 32, we only need 33 descs: one for headers and
> others for store corresponding pkt data. More importantly, we could use
> the "len" field for computing the batch size. We then could load
the
> virtio net headers at once; we could also prefetch all the descs at once.
> 
> Note it could also be adapted to virtio 0.95 and 1.0. I also made a simple
> prototype with DPDK (yet again, it's Tx path only), I saw an impressive
> boost (about 30%) in a mirco benchmark.
> 
> I think such proposal may should also help other devices, too, if they
> also have a small header for each data.
> 
> Thoughts?
> 
> 	--yliu
That's great. An alternative might be to add an array of headers parallel
to array of descriptors and indexed by head. A bit in the descriptor
would then be enough to mark such a header as valid.
It's also an alternative way to pass in batches for virtio 1.1.
This has an advantage that it helps non-batched workloads as well
if enough packets end up in the ring, but maybe this
predicts on the CPU in a worse way. Worth benchmarking?
I hope above thoughts are helpful, but -
code walks - if you can show real gains I'd be inclined
to say let's go with it. You don't necessarily need to implement and
benchmark all possible ideas others can come up with :)
(though that's just me not speaking for anyone else -
 we'll have to put it on the TC ballot of course)
--  
MST