On Wed, Feb 08, 2017 at 06:41:40PM +0100, Paolo Bonzini wrote:> > > On 08/02/2017 04:20, Michael S. Tsirkin wrote: > > * Scatter/gather support > > > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0: > > > > /* This marks a buffer as continuing via the next field. */ > > #define VRING_DESC_F_NEXT 1 > > > > Unlike virtio 1.0, all descriptors must have distinct ID values. > > > > Also unlike virtio 1.0, use of this flag will be an optional feature > > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it. > > I would still prefer that we had _either_ single-direct or > multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT. I can propose > my idea for this in a separate message.All it costs us spec-wise is a single bit :) The cost of indirect is an extra cache miss. We couldn't decide what's better for everyone in 1.0 days and I doubt we'll be able to now, but yes, benchmarking is needed to make sire it's required. Very easy to remove or not to use/support in drivers/devices though.> > * Batching descriptors: > > > > virtio 1.0 allows passing a batch of descriptors in both directions, by > > incrementing the used/avail index by values > 1. We can support this by > > chaining a list of descriptors through a bit the flags field. > > To allow use together with s/g, a different bit will be used. > > > > #define VRING_DESC_F_BATCH_NEXT 0x0010 > > > > Batching works for both driver and device descriptors. > > I'm still not sure how this would be useful.So this is used at least by virtio-net mergeable buffers to combine many buffers into a single packet. Similarly, on transmit linux sometimes supplies packets in batches (XMIT_MORE flag) if the other side processes them it seems nice to tell it: there's more to come soon, if you see this it is wise to poll now. That's why I kind of felt it's better as a standard bit.> It cannot be mandatory to > set the bit, I think, because you don't know when the host/guest is > going to read descriptors. So both host and guest always have to look > ahead one element in any case.Right but the point is what to do if you find nothing there? If you saw VRING_DESC_F_BATCH_NEXT it's a hint that you should poll, there's more to come soon.> > * Non power-of-2 ring sizes > > > > As the ring simply wraps around, there's no reason to > > require ring size to be power of two. > > It can be made a separate feature though. > > Power of 2 ring sizes are required in order to ignore the high bits of > the indices. With non-power-of-2 sizes you are forced to keep the > indices less than the ring size.Right. So if (unlikely(idx++ > size)) idx = 0; OTOH ring size that's twice larger than necessary because of power of two requirements wastes cache.> Alternatively you can do this: > > > * Event index would be in the range 0 to 2 * Queue Size > > (to detect wrap arounds) and wrap to 0 after that. > > > > The assumption is that each side maintains an internal > > descriptor counter 0 to 2 * Queue Size that wraps to 0. > > In that case, interrupt triggers when counter reaches > > the given value. > > but it seems more complicated than just forcing power-of-2 and ignoring > the high bits. > > Thanks, > > PaoloAbsolutely power of 2 lets you save a branch. At this stage I'm just recording all the ideas and then as a next step we can micro-benchmark prototypes and compare. -- MST
On 08/02/2017 20:59, Michael S. Tsirkin wrote:> We couldn't decide what's better for everyone in 1.0 days and I doubt > we'll be able to now, but yes, benchmarking is needed to make > sire it's required. Very easy to remove or not to use/support in > drivers/devices though.Fair enough, but of course then we must specify that devices MUST support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers SHOULD support both (or use neither). The drivers' part adds to the implementation burden, which is why I wanted to remove it. Alternatively we can say that indirect is mandatory for both devices and drivers (and save a feature bit), while VRING_DESC_F_NEXT is optional.>>> * Non power-of-2 ring sizes >>> >>> As the ring simply wraps around, there's no reason to >>> require ring size to be power of two. >>> It can be made a separate feature though. >> >> Power of 2 ring sizes are required in order to ignore the high bits of >> the indices. With non-power-of-2 sizes you are forced to keep the >> indices less than the ring size. > > Right. So > > if (unlikely(idx++ > size)) > idx = 0; > > OTOH ring size that's twice larger than necessary > because of power of two requirements wastes cache.I don't know. Power of 2 ring size is pretty standard, I'd rather avoid the complication and the gratuitous difference with 1.0. If batching is mostly advisory (with exceptions such as mrg-rxbuf) I don't have any problem with it. Paolo
On Thu, 9 Feb 2017 16:48:53 +0100 Paolo Bonzini <pbonzini at redhat.com> wrote:> On 08/02/2017 20:59, Michael S. Tsirkin wrote: > > We couldn't decide what's better for everyone in 1.0 days and I doubt > > we'll be able to now, but yes, benchmarking is needed to make > > sire it's required. Very easy to remove or not to use/support in > > drivers/devices though. > > Fair enough, but of course then we must specify that devices MUST > support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers > SHOULD support both (or use neither). > > The drivers' part adds to the implementation burden, which is why I > wanted to remove it. Alternatively we can say that indirect is > mandatory for both devices and drivers (and save a feature bit), while > VRING_DESC_F_NEXT is optional.I think this (INDIRECT mandatory, NEXT optional) makes sense. But we really need some benchmarking.> > >>> * Non power-of-2 ring sizes > >>> > >>> As the ring simply wraps around, there's no reason to > >>> require ring size to be power of two. > >>> It can be made a separate feature though. > >> > >> Power of 2 ring sizes are required in order to ignore the high bits of > >> the indices. With non-power-of-2 sizes you are forced to keep the > >> indices less than the ring size. > > > > Right. So > > > > if (unlikely(idx++ > size)) > > idx = 0; > > > > OTOH ring size that's twice larger than necessary > > because of power of two requirements wastes cache. > > I don't know. Power of 2 ring size is pretty standard, I'd rather avoid > the complication and the gratuitous difference with 1.0.I agree. I don't think dropping the power of 2 requirement buys us so much that it makes up for the added complexity.
On Thu, Feb 09, 2017 at 04:48:53PM +0100, Paolo Bonzini wrote:> > > On 08/02/2017 20:59, Michael S. Tsirkin wrote: > > We couldn't decide what's better for everyone in 1.0 days and I doubt > > we'll be able to now, but yes, benchmarking is needed to make > > sire it's required. Very easy to remove or not to use/support in > > drivers/devices though. > > Fair enough, but of course then we must specify that devices MUST > support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers > SHOULD support both (or use neither). > > The drivers' part adds to the implementation burden, which is why I > wanted to remove it. Alternatively we can say that indirect is > mandatory for both devices and drivers (and save a feature bit), while > VRING_DESC_F_NEXT is optional. > > >>> * Non power-of-2 ring sizes > >>> > >>> As the ring simply wraps around, there's no reason to > >>> require ring size to be power of two. > >>> It can be made a separate feature though. > >> > >> Power of 2 ring sizes are required in order to ignore the high bits of > >> the indices. With non-power-of-2 sizes you are forced to keep the > >> indices less than the ring size. > > > > Right. So > > > > if (unlikely(idx++ > size)) > > idx = 0; > > > > OTOH ring size that's twice larger than necessary > > because of power of two requirements wastes cache. > > I don't know. Power of 2 ring size is pretty standard, I'd rather avoid > the complication and the gratuitous difference with 1.0.I thought originally there's a reason 1.0 rings had to be powers of two but now I don't see why. OK, we can make it a feature flag later if we want to.> If batching is mostly advisory (with exceptions such as mrg-rxbuf) I > don't have any problem with it. > > Paolo
On Wed, 22 Feb 2017 18:43:05 +0200 "Michael S. Tsirkin" <mst at redhat.com> wrote:> On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote: > > > >>> * Non power-of-2 ring sizes > > > >>> > > > >>> As the ring simply wraps around, there's no reason to > > > >>> require ring size to be power of two. > > > >>> It can be made a separate feature though. > > > >> > > > >> Power of 2 ring sizes are required in order to ignore the high bits of > > > >> the indices. With non-power-of-2 sizes you are forced to keep the > > > >> indices less than the ring size. > > > > > > > > Right. So > > > > > > > > if (unlikely(idx++ > size)) > > > > idx = 0; > > > > > > > > OTOH ring size that's twice larger than necessary > > > > because of power of two requirements wastes cache. > > > > > > I don't know. Power of 2 ring size is pretty standard, I'd rather avoid > > > the complication and the gratuitous difference with 1.0. > > > > I agree. I don't think dropping the power of 2 requirement buys us so > > much that it makes up for the added complexity. > > I recalled why I came up with this. The issue is cache associativity. > Recall that besides the ring we have event suppression > structures - if we are lucky and things run at the same speed > everything can work by polling keeping events disabled, then > event suppression structures are never written to, they are read-only. > > However if ring and event suppression share a cache line ring accesses > have a chance to push the event suppression out of cache, causing > misses on read. > > This can happen if they are at the same offset in the set. > E.g. with L1 cache 4Kbyte sets are common, so same offset > within a 4K page. > > We can fix this by making event suppression adjacent in memory, e.g.: > > > [interrupt suppress] > [descriptor ring] > [kick suppress] > > If this whole structure fits in a single set, ring accesses will > not push kick or interrupt suppress out of cache. > Specific layout can be left for drivers, but as set size is > a power of two this might require a non-power of two ring size. > > I conclude that this is an optimization that needs to be > benchmarked.This makes sense. But wouldn't the optimum layout not depend on the platform?> > I also note that the generic description does not have to force > powers of two *even if devices actually require it*. > I would be inclined to word the text in a way that makes > relaxing the restriction easier. > > For example, we can say "free running 16 bit index" and this forces a > power of two, but we can also say "free running index wrapping to 0 > after (N*queue-size - 1) with N chosen such that the value fits in 16 > bit" and this is exactly the same if queue size is a power of 2. > > So we can add text saying "ring size MUST be a power of two" > and later it will be easy to relax just by adding a feature bit.A later feature bit sounds good.
On Tue, Mar 07, 2017 at 04:53:53PM +0100, Cornelia Huck wrote:> On Wed, 22 Feb 2017 18:43:05 +0200 > "Michael S. Tsirkin" <mst at redhat.com> wrote: > > > On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote: > > > > >>> * Non power-of-2 ring sizes > > > > >>> > > > > >>> As the ring simply wraps around, there's no reason to > > > > >>> require ring size to be power of two. > > > > >>> It can be made a separate feature though. > > > > >> > > > > >> Power of 2 ring sizes are required in order to ignore the high bits of > > > > >> the indices. With non-power-of-2 sizes you are forced to keep the > > > > >> indices less than the ring size. > > > > > > > > > > Right. So > > > > > > > > > > if (unlikely(idx++ > size)) > > > > > idx = 0; > > > > > > > > > > OTOH ring size that's twice larger than necessary > > > > > because of power of two requirements wastes cache. > > > > > > > > I don't know. Power of 2 ring size is pretty standard, I'd rather avoid > > > > the complication and the gratuitous difference with 1.0. > > > > > > I agree. I don't think dropping the power of 2 requirement buys us so > > > much that it makes up for the added complexity. > > > > I recalled why I came up with this. The issue is cache associativity. > > Recall that besides the ring we have event suppression > > structures - if we are lucky and things run at the same speed > > everything can work by polling keeping events disabled, then > > event suppression structures are never written to, they are read-only. > > > > However if ring and event suppression share a cache line ring accesses > > have a chance to push the event suppression out of cache, causing > > misses on read. > > > > This can happen if they are at the same offset in the set. > > E.g. with L1 cache 4Kbyte sets are common, so same offset > > within a 4K page. > > > > We can fix this by making event suppression adjacent in memory, e.g.: > > > > > > [interrupt suppress] > > [descriptor ring] > > [kick suppress] > > > > If this whole structure fits in a single set, ring accesses will > > not push kick or interrupt suppress out of cache. > > Specific layout can be left for drivers, but as set size is > > a power of two this might require a non-power of two ring size. > > > > I conclude that this is an optimization that needs to be > > benchmarked. > > This makes sense. But wouldn't the optimum layout not depend on the > platform?There's generally a tradeoff between performance and portability. Whether it's worth it would need to be tested. Further, it might be better to have platform-specific optimization tied to a given platform rather than a feature bit.> > > > I also note that the generic description does not have to force > > powers of two *even if devices actually require it*. > > I would be inclined to word the text in a way that makes > > relaxing the restriction easier. > > > > For example, we can say "free running 16 bit index" and this forces a > > power of two, but we can also say "free running index wrapping to 0 > > after (N*queue-size - 1) with N chosen such that the value fits in 16 > > bit" and this is exactly the same if queue size is a power of 2. > > > > So we can add text saying "ring size MUST be a power of two" > > and later it will be easy to relax just by adding a feature bit. > > A later feature bit sounds good.No need to delay benchmarking if someone has the time though :) -- MST