On Tue, Feb 28, 2017 at 01:47:19PM +0800, Yuanhan Liu
wrote:> Hi,
>
> For virtio-net, we use 2 descs for representing a (small) pkt. One for
> virtio-net header and another one for the pkt data. And it has two issues:
>
> - the desc buffer for storing pkt data is halfed
>
> Though we later introduced 2 more options to overcome this: ANYLAY_OUT
> and indirect desc. The indirect desc has another issue: it introdues
> an extra cache line visit.
So if we don't care about this part, we could maybe just add
a descriptor flag that puts the whole header in the descriptor.
> - virtio-net header could be scattered
>
> Assume the ANYLAY_OUT case, whereas the headered is prepened before
> each mbuf (or skb in kernel). In DPDK, a burst recevice in vhost pmd
> means 32 different cache visit for virtio header.
>
> For the legacy layout and indirect desc, the cache issue could somehone
> diminished a bit: we could arrange the virtio header in a same memory
> block and let the header desc point to the right one.
>
> But it's still not good enough: the virtio-net headers aren't
accessed
> in batch: they have to be accessed one by one (by reading the desc).
> That said, it's still not that good for cache utilization.
>
>
> And I'm proposing packed header:
>
> - put all virtio-net header in a memory block.
>
> A burst size of 32 pkts need only access (32 * 12) / 64 = 6 cache lines.
> While before, it could be 32 cache lines.
>
> - introduce a header desc to reference above memory block.
>
> desc->addr = starting addr of net headers mem block
> desc->len = size of all net virtio net headers (burst size * header
size)
>
> Thus, in a burst size of 32, we only need 33 descs: one for headers and
> others for store corresponding pkt data. More importantly, we could use
> the "len" field for computing the batch size. We then could load
the
> virtio net headers at once; we could also prefetch all the descs at once.
>
> Note it could also be adapted to virtio 0.95 and 1.0. I also made a simple
> prototype with DPDK (yet again, it's Tx path only), I saw an impressive
> boost (about 30%) in a mirco benchmark.
>
> I think such proposal may should also help other devices, too, if they
> also have a small header for each data.
>
> Thoughts?
>
> --yliu
That's great. An alternative might be to add an array of headers parallel
to array of descriptors and indexed by head. A bit in the descriptor
would then be enough to mark such a header as valid.
It's also an alternative way to pass in batches for virtio 1.1.
This has an advantage that it helps non-batched workloads as well
if enough packets end up in the ring, but maybe this
predicts on the CPU in a worse way. Worth benchmarking?
I hope above thoughts are helpful, but -
code walks - if you can show real gains I'd be inclined
to say let's go with it. You don't necessarily need to implement and
benchmark all possible ideas others can come up with :)
(though that's just me not speaking for anyone else -
we'll have to put it on the TC ballot of course)
--
MST