This is an update from v1 version. Changes: - update event suppression mechanism - separate options for indirect and direct s/g - lots of new features --- Performance analysis of this is in my kvm forum 2016 presentation. The idea is to have a r/w descriptor in a ring structure, replacing the used and available ring, index and descriptor buffer. * Descriptor ring: Guest adds descriptors with unique index values and DESC_HW set in flags. Host overwrites used descriptors with correct len, index, and DESC_HW clear. Flags are always set/cleared last. #define DESC_HW 0x0080 struct desc { __le64 addr; __le32 len; __le16 index; __le16 flags; }; When DESC_HW is set, descriptor belongs to device. When it is clear, it belongs to the driver. We can use 1 bit to set direction /* This marks a buffer as write-only (otherwise read-only). */ #define VRING_DESC_F_WRITE 2 * Scatter/gather support We can use 1 bit to chain s/g entries in a request, same as virtio 1.0: /* This marks a buffer as continuing via the next field. */ #define VRING_DESC_F_NEXT 1 Unlike virtio 1.0, all descriptors must have distinct ID values. Also unlike virtio 1.0, use of this flag will be an optional feature (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it. * Indirect buffers Can be marked like in virtio 1.0: /* This means the buffer contains a table of buffer descriptors. */ #define VRING_DESC_F_INDIRECT 4 Unlike virtio 1.0, this is a table, not a list: struct indirect_descriptor_table { /* The actual descriptors (16 bytes each) */ struct virtq_desc desc[len / 16]; }; The first descriptor is located at start of the indirect descriptor table, additional indirect descriptors come immediately afterwards. DESC_F_WRITE is the only valid flag for descriptors in the indirect table. Others should be set to 0 and are ignored. id is also set to 0 and should be ignored. virtio 1.0 seems to allow a s/g entry followed by an indirect descriptor. This does not seem useful, so we do not allow that anymore. This support would be an optional feature, same as in virtio 1.0 * Batching descriptors: virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing the used/avail index by values > 1. We can support this by chaining a list of descriptors through a bit the flags field. To allow use together with s/g, a different bit will be used. #define VRING_DESC_F_BATCH_NEXT 0x0010 Batching works for both driver and device descriptors. * Processing descriptors in and out of order Device processing all descriptors in order can simply flip the DESC_HW bit as it is done with descriptors. Device can write descriptors out in order as they are used, overwriting descriptors that are there. Device must not use a descriptor until DESC_HW is set. It is only required to look at the first descriptor submitted. Driver must not overwrite a descriptor until DESC_HW is clear. It is only required to look at the first descriptor submitted. * Device specific descriptor flags We have a lot of unused space in the descriptor. This can be put to good use by reserving some flag bits for device use. For example, network device can set a bit to request that header in the descriptor is suppressed (in case it's all 0s anyway). This reduces cache utilization. Note: this feature can be supported in virtio 1.0 as well, as we have unused bits in both descriptor and used ring there. * Descriptor length in device descriptors virtio 1.0 places strict requirements on descriptor length. For example it must be 0 in used ring of TX VQ of a network device since nothing is written. In practice guests do not seem to use this, so we can simplify devices a bit by removing this requirement - if length is unused it should be ignored by driver. Some devices use identically-sized buffers in all descriptors. Ignoring length for driver descriptors there could be an option too. * Writing at an offset Some devices might want to write into some descriptors at an offset, the offset would be in config space, and a descriptor flag could indicate this: #define VRING_DESC_F_OFFSET 0x0020 How exactly to use the offset would be device specific, for example it can be used to align beginning of packet in the 1st buffer for mergeable buffers to cache line boundary while also aligning rest of buffers. * Non power-of-2 ring sizes As the ring simply wraps around, there's no reason to require ring size to be power of two. It can be made a separate feature though. * Interrupt/event suppression virtio 1.0 has two mechanisms for suppression but only one can be used at a time. we pack them together in a structure - one for interrupts, one for notifications: struct event { __le16 idx; __le16 flags; } Both fields would be optional, with a feature bit: VIRTIO_F_EVENT_IDX VIRTIO_F_EVENT_FLAGS * Flags can be used like in virtio 1.0, by storing a special value there: #define VRING_F_EVENT_ENABLE 0x0 #define VRING_F_EVENT_DISABLE 0x1 * Event index would be in the range 0 to 2 * Queue Size (to detect wrap arounds) and wrap to 0 after that. The assumption is that each side maintains an internal descriptor counter 0 to 2 * Queue Size that wraps to 0. In that case, interrupt triggers when counter reaches the given value. * If both features are negotiated, a special flags value can be used to switch to event idx: #define VRING_F_EVENT_IDX 0x2 * Prototype A partial prototype can be found under tools/virtio/ringtest/ring.c Test run: [mst at tuck ringtest]$ time ./ring real 0m0.399s user 0m0.791s sys 0m0.000s [mst at tuck ringtest]$ time ./virtio_ring_0_9 real 0m0.503s user 0m0.999s sys 0m0.000s It is planned to update it to this interface. Future changes and enhancements can (and should) be tested against this prototype. * Feature sets In particular are we going overboard with feature bits? It becomes hard to support all combinations in drivers and devices. Maybe we should document reasonable feature sets to be supported for each device. * Known issues/ideas This layout is optimized for host/guest communication, in a sense even more aggressively than virtio 1.0. It might be suboptimal for PCI hardware implementations. However, one notes that current virtio pci drivers are unlikely to work with PCI hardware implementations anyway (e.g. due to use of SMP barriers for ordering). Suggestions for improving this are welcome but need to be tested to make sure our main use case does not regress. Of course some improvements might be made optional, but if we add too many of these driver becomes unmanageable. --- Note: should this proposal be accepted and approved, one or more claims disclosed to the TC admin and listed on the Virtio TC IPR page https://www.oasis-open.org/committees/virtio/ipr.php might become Essential Claims. -- MST
On 02/08/2017 04:20 AM, Michael S. Tsirkin wrote: [...]> * Prototype > > A partial prototype can be found under > tools/virtio/ringtest/ring.c > > Test run: > [mst at tuck ringtest]$ time ./ring > real 0m0.399s > user 0m0.791s > sys 0m0.000s > [mst at tuck ringtest]$ time ./virtio_ring_0_9 > real 0m0.503s > user 0m0.999s > sys 0m0.000sI see similar improvements on s390, so I think this would be a very nice improvement. [...]> Note: should this proposal be accepted and approved, one or more > claims disclosed to the TC admin and listed on the Virtio TC > IPR page https://www.oasis-open.org/committees/virtio/ipr.php > might become Essential Claims. >I have trouble parsing that legal stuff. Do I read that right, that these claims can be implemented as part of a virtio implementation without any worries (e.g. non open source HW implementation or non open source hypervisor)?
On 08/02/2017 04:20, Michael S. Tsirkin wrote:> * Scatter/gather support > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0: > > /* This marks a buffer as continuing via the next field. */ > #define VRING_DESC_F_NEXT 1 > > Unlike virtio 1.0, all descriptors must have distinct ID values. > > Also unlike virtio 1.0, use of this flag will be an optional feature > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.I would still prefer that we had _either_ single-direct or multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT. I can propose my idea for this in a separate message.> * Batching descriptors: > > virtio 1.0 allows passing a batch of descriptors in both directions, by > incrementing the used/avail index by values > 1. We can support this by > chaining a list of descriptors through a bit the flags field. > To allow use together with s/g, a different bit will be used. > > #define VRING_DESC_F_BATCH_NEXT 0x0010 > > Batching works for both driver and device descriptors.I'm still not sure how this would be useful. It cannot be mandatory to set the bit, I think, because you don't know when the host/guest is going to read descriptors. So both host and guest always have to look ahead one element in any case.> * Non power-of-2 ring sizes > > As the ring simply wraps around, there's no reason to > require ring size to be power of two. > It can be made a separate feature though.Power of 2 ring sizes are required in order to ignore the high bits of the indices. With non-power-of-2 sizes you are forced to keep the indices less than the ring size. Alternatively you can do this:> * Event index would be in the range 0 to 2 * Queue Size > (to detect wrap arounds) and wrap to 0 after that. > > The assumption is that each side maintains an internal > descriptor counter 0 to 2 * Queue Size that wraps to 0. > In that case, interrupt triggers when counter reaches > the given value.but it seems more complicated than just forcing power-of-2 and ignoring the high bits. Thanks, Paolo
On Wed, Feb 08, 2017 at 06:41:40PM +0100, Paolo Bonzini wrote:> > > On 08/02/2017 04:20, Michael S. Tsirkin wrote: > > * Scatter/gather support > > > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0: > > > > /* This marks a buffer as continuing via the next field. */ > > #define VRING_DESC_F_NEXT 1 > > > > Unlike virtio 1.0, all descriptors must have distinct ID values. > > > > Also unlike virtio 1.0, use of this flag will be an optional feature > > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it. > > I would still prefer that we had _either_ single-direct or > multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT. I can propose > my idea for this in a separate message.All it costs us spec-wise is a single bit :) The cost of indirect is an extra cache miss. We couldn't decide what's better for everyone in 1.0 days and I doubt we'll be able to now, but yes, benchmarking is needed to make sire it's required. Very easy to remove or not to use/support in drivers/devices though.> > * Batching descriptors: > > > > virtio 1.0 allows passing a batch of descriptors in both directions, by > > incrementing the used/avail index by values > 1. We can support this by > > chaining a list of descriptors through a bit the flags field. > > To allow use together with s/g, a different bit will be used. > > > > #define VRING_DESC_F_BATCH_NEXT 0x0010 > > > > Batching works for both driver and device descriptors. > > I'm still not sure how this would be useful.So this is used at least by virtio-net mergeable buffers to combine many buffers into a single packet. Similarly, on transmit linux sometimes supplies packets in batches (XMIT_MORE flag) if the other side processes them it seems nice to tell it: there's more to come soon, if you see this it is wise to poll now. That's why I kind of felt it's better as a standard bit.> It cannot be mandatory to > set the bit, I think, because you don't know when the host/guest is > going to read descriptors. So both host and guest always have to look > ahead one element in any case.Right but the point is what to do if you find nothing there? If you saw VRING_DESC_F_BATCH_NEXT it's a hint that you should poll, there's more to come soon.> > * Non power-of-2 ring sizes > > > > As the ring simply wraps around, there's no reason to > > require ring size to be power of two. > > It can be made a separate feature though. > > Power of 2 ring sizes are required in order to ignore the high bits of > the indices. With non-power-of-2 sizes you are forced to keep the > indices less than the ring size.Right. So if (unlikely(idx++ > size)) idx = 0; OTOH ring size that's twice larger than necessary because of power of two requirements wastes cache.> Alternatively you can do this: > > > * Event index would be in the range 0 to 2 * Queue Size > > (to detect wrap arounds) and wrap to 0 after that. > > > > The assumption is that each side maintains an internal > > descriptor counter 0 to 2 * Queue Size that wraps to 0. > > In that case, interrupt triggers when counter reaches > > the given value. > > but it seems more complicated than just forcing power-of-2 and ignoring > the high bits. > > Thanks, > > PaoloAbsolutely power of 2 lets you save a branch. At this stage I'm just recording all the ideas and then as a next step we can micro-benchmark prototypes and compare. -- MST
On Wed, Feb 08, 2017 at 02:37:57PM +0100, Christian Borntraeger wrote:> On 02/08/2017 04:20 AM, Michael S. Tsirkin wrote: > [...] > > * Prototype > > > > A partial prototype can be found under > > tools/virtio/ringtest/ring.c > > > > Test run: > > [mst at tuck ringtest]$ time ./ring > > real 0m0.399s > > user 0m0.791s > > sys 0m0.000s > > [mst at tuck ringtest]$ time ./virtio_ring_0_9 > > real 0m0.503s > > user 0m0.999s > > sys 0m0.000s > > I see similar improvements on s390, so I think this would be a very nice > improvement. > > [...] > > Note: should this proposal be accepted and approved, one or more > > claims disclosed to the TC admin and listed on the Virtio TC > > IPR page https://www.oasis-open.org/committees/virtio/ipr.php > > might become Essential Claims. > > > > I have trouble parsing that legal stuff. Do I read that right, that > these claims can be implemented as part of a virtio implementation without > any worries (e.g. non open source HW implementation or non open source > hypervisor)?By that legal stuff do you mean the IPR or the Note? Not representing Red Hat here, and definitely not legal advice, below is just my personal understanding of the IPR requirements. Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy: https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode As far as I can tell, the hardware and software question is covered by that policy, since it states: Covered Product - includes only those specific portions of a product (hardware, software or combinations thereof) Also as far as I can tell IPR does not mention source at all, so I think virtio IPR would apply to open and closed source software equally. The Note is included to satisfy the disclosure requirements. Does this answer the question? -- MST
> -----Original Message----- > From: virtio-dev at lists.oasis-open.org [mailto:virtio-dev at lists.oasis-open.org] > On Behalf Of Michael S. Tsirkin > Sent: Wednesday, February 08, 2017 5:20 AM > To: virtio-dev at lists.oasis-open.org > Cc: virtualization at lists.linux-foundation.org > Subject: [virtio-dev] packed ring layout proposal v2 > > This is an update from v1 version. > Changes: > - update event suppression mechanism > - separate options for indirect and direct s/g > - lots of new features > > --- > > Performance analysis of this is in my kvm forum 2016 presentation. > The idea is to have a r/w descriptor in a ring structure, > replacing the used and available ring, index and descriptor > buffer. > > * Descriptor ring: > > Guest adds descriptors with unique index values and DESC_HW set in flags. > Host overwrites used descriptors with correct len, index, and DESC_HW > clear. Flags are always set/cleared last. > > #define DESC_HW 0x0080 > > struct desc { > __le64 addr; > __le32 len; > __le16 index; > __le16 flags; > }; > > When DESC_HW is set, descriptor belongs to device. When it is clear, > it belongs to the driver. > > We can use 1 bit to set direction > /* This marks a buffer as write-only (otherwise read-only). */ > #define VRING_DESC_F_WRITE 2 >A valid bit per descriptor does not let the device do an efficient prefetch. An alternative is to use a producer index(PI). Using the PI posted by the driver, and the Consumer Index(CI) maintained by the device, the device knows how much work it has outstanding, so it can do the prefetch accordingly. There are few options for the device to acquire the PI. Most efficient will be to write the PI in the doorbell together with the queue number. I would like to raise the need for a Completion Queue(CQ). Multiple Work Queues(hold the work descriptors, WQ in short) can be connected to a single CQ. So when the device completes the work on the descriptor, it writes a Completion Queue Entry (CQE) to the CQ. CQEs are continuous in memory so prefetching by the driver is efficient, although the device might complete work descriptors out of order. The interrupt handler is connected to the CQ, so an allocation of a single CQ per core, with a single interrupt handler is possible although this core might be using multiple WQs. One application for multiple WQs with a single CQ is Quality of Service(QoS). A user can open a WQ per QoS value(pcp value for example), and the device will schedule the work accordingly.> * Scatter/gather support > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0: > > /* This marks a buffer as continuing via the next field. */ > #define VRING_DESC_F_NEXT 1 > > Unlike virtio 1.0, all descriptors must have distinct ID values. > > Also unlike virtio 1.0, use of this flag will be an optional feature > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it. > > * Indirect buffers > > Can be marked like in virtio 1.0: > > /* This means the buffer contains a table of buffer descriptors. */ > #define VRING_DESC_F_INDIRECT 4 > > Unlike virtio 1.0, this is a table, not a list: > struct indirect_descriptor_table { > /* The actual descriptors (16 bytes each) */ > struct virtq_desc desc[len / 16]; > }; > > The first descriptor is located at start of the indirect descriptor > table, additional indirect descriptors come immediately afterwards. > DESC_F_WRITE is the only valid flag for descriptors in the indirect > table. Others should be set to 0 and are ignored. id is also set to 0 > and should be ignored. > > virtio 1.0 seems to allow a s/g entry followed by > an indirect descriptor. This does not seem useful, > so we do not allow that anymore. > > This support would be an optional feature, same as in virtio 1.0 > > * Batching descriptors: > > virtio 1.0 allows passing a batch of descriptors in both directions, by > incrementing the used/avail index by values > 1. We can support this by > chaining a list of descriptors through a bit the flags field. > To allow use together with s/g, a different bit will be used. > > #define VRING_DESC_F_BATCH_NEXT 0x0010 > > Batching works for both driver and device descriptors. > > > > * Processing descriptors in and out of order > > Device processing all descriptors in order can simply flip > the DESC_HW bit as it is done with descriptors. > > Device can write descriptors out in order as they are used, overwriting > descriptors that are there. > > Device must not use a descriptor until DESC_HW is set. > It is only required to look at the first descriptor > submitted. > > Driver must not overwrite a descriptor until DESC_HW is clear. > It is only required to look at the first descriptor > submitted. > > * Device specific descriptor flags > We have a lot of unused space in the descriptor. This can be put to > good use by reserving some flag bits for device use. > For example, network device can set a bit to request > that header in the descriptor is suppressed > (in case it's all 0s anyway). This reduces cache utilization. > > Note: this feature can be supported in virtio 1.0 as well, > as we have unused bits in both descriptor and used ring there. > > * Descriptor length in device descriptors > > virtio 1.0 places strict requirements on descriptor length. For example > it must be 0 in used ring of TX VQ of a network device since nothing is > written. In practice guests do not seem to use this, so we can simplify > devices a bit by removing this requirement - if length is unused it > should be ignored by driver. > > Some devices use identically-sized buffers in all descriptors. > Ignoring length for driver descriptors there could be an option too. > > * Writing at an offset > > Some devices might want to write into some descriptors > at an offset, the offset would be in config space, > and a descriptor flag could indicate this: > > #define VRING_DESC_F_OFFSET 0x0020 > > How exactly to use the offset would be device specific, > for example it can be used to align beginning of packet > in the 1st buffer for mergeable buffers to cache line boundary > while also aligning rest of buffers. > > * Non power-of-2 ring sizes > > As the ring simply wraps around, there's no reason to > require ring size to be power of two. > It can be made a separate feature though. > > > * Interrupt/event suppression > > virtio 1.0 has two mechanisms for suppression but only > one can be used at a time. we pack them together > in a structure - one for interrupts, one for notifications: > > struct event { > __le16 idx; > __le16 flags; > } > > Both fields would be optional, with a feature bit: > VIRTIO_F_EVENT_IDX > VIRTIO_F_EVENT_FLAGS > > * Flags can be used like in virtio 1.0, by storing a special > value there: > > #define VRING_F_EVENT_ENABLE 0x0 > > #define VRING_F_EVENT_DISABLE 0x1 > > * Event index would be in the range 0 to 2 * Queue Size > (to detect wrap arounds) and wrap to 0 after that. > > The assumption is that each side maintains an internal > descriptor counter 0 to 2 * Queue Size that wraps to 0. > In that case, interrupt triggers when counter reaches > the given value. > > * If both features are negotiated, a special flags value > can be used to switch to event idx: > > #define VRING_F_EVENT_IDX 0x2 > > > * Prototype > > A partial prototype can be found under > tools/virtio/ringtest/ring.c > > Test run: > [mst at tuck ringtest]$ time ./ring > real 0m0.399s > user 0m0.791s > sys 0m0.000s > [mst at tuck ringtest]$ time ./virtio_ring_0_9 > real 0m0.503s > user 0m0.999s > sys 0m0.000s > > It is planned to update it to this interface. Future changes and > enhancements can (and should) be tested against this prototype. > > * Feature sets > In particular are we going overboard with feature bits? It becomes hard > to support all combinations in drivers and devices. Maybe we should > document reasonable feature sets to be supported for each device. > > * Known issues/ideas > > This layout is optimized for host/guest communication, > in a sense even more aggressively than virtio 1.0. > It might be suboptimal for PCI hardware implementations. > However, one notes that current virtio pci drivers are > unlikely to work with PCI hardware implementations anyway > (e.g. due to use of SMP barriers for ordering). > > Suggestions for improving this are welcome but need to be tested to make > sure our main use case does not regress. Of course some improvements > might be made optional, but if we add too many of these driver becomes > unmanageable. > > --- > > Note: should this proposal be accepted and approved, one or more > claims disclosed to the TC admin and listed on the Virtio TC > IPR page > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww > w.oasis- > open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m > ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9 > ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0 > might become Essential Claims. > > -- > MST > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org
On Sun, Jul 16, 2017 at 06:00:45AM +0000, Lior Narkis wrote:> > -----Original Message----- > > From: virtio-dev at lists.oasis-open.org [mailto:virtio-dev at lists.oasis-open.org] > > On Behalf Of Michael S. Tsirkin > > Sent: Wednesday, February 08, 2017 5:20 AM > > To: virtio-dev at lists.oasis-open.org > > Cc: virtualization at lists.linux-foundation.org > > Subject: [virtio-dev] packed ring layout proposal v2 > > > > This is an update from v1 version. > > Changes: > > - update event suppression mechanism > > - separate options for indirect and direct s/g > > - lots of new features > > > > --- > > > > Performance analysis of this is in my kvm forum 2016 presentation. > > The idea is to have a r/w descriptor in a ring structure, > > replacing the used and available ring, index and descriptor > > buffer. > > > > * Descriptor ring: > > > > Guest adds descriptors with unique index values and DESC_HW set in flags. > > Host overwrites used descriptors with correct len, index, and DESC_HW > > clear. Flags are always set/cleared last. > > > > #define DESC_HW 0x0080 > > > > struct desc { > > __le64 addr; > > __le32 len; > > __le16 index; > > __le16 flags; > > }; > > > > When DESC_HW is set, descriptor belongs to device. When it is clear, > > it belongs to the driver. > > > > We can use 1 bit to set direction > > /* This marks a buffer as write-only (otherwise read-only). */ > > #define VRING_DESC_F_WRITE 2 > > > > A valid bit per descriptor does not let the device do an efficient prefetch. > An alternative is to use a producer index(PI). > Using the PI posted by the driver, and the Consumer Index(CI) maintained by the device, the device knows how much work it has outstanding, so it can do the prefetch accordingly. > There are few options for the device to acquire the PI. > Most efficient will be to write the PI in the doorbell together with the queue number.Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout". Or just the PI if we don't need the queue number.> I would like to raise the need for a Completion Queue(CQ). > Multiple Work Queues(hold the work descriptors, WQ in short) can be connected to a single CQ. > So when the device completes the work on the descriptor, it writes a Completion Queue Entry (CQE) to the CQ.This seems similar to the design we currently have with a separate used ring. It seems to underperform writing into the available ring at least on a microbenchmark. The reason seems to be that more cache lines need to get invalidated and re-fetched.> CQEs are continuous in memory so prefetching by the driver is efficient, although the device might complete work descriptors out of order.That's not different from proposal you are replying to - descriptors can be used and written out in any order, they are contigious so driver can prefetch. I'd like to add that attempts to add prefetch e.g. for the virtio used ring never showed any conclusive gains - some workloads would get minor a speedup, others lose a bit of performance. I do not think anyone looked deeply into why that was the case. It would be easy for you to add this code to current virtio drivers and/or devices and try it out.> The interrupt handler is connected to the CQ, so an allocation of a single CQ per core, with a single interrupt handler is possible although this core might be using multiple WQs.Sending a single interrupt from multiple rings might save some cycles. event index/interrupt disable are currently in RAM so access is very cheap for the guest. If you are going to share, just disable all interrupts when you start processing. I was wondering how do people want to do this in hardware in fact. Are you going to read event index after each descriptor? It might make sense to move event index/flags into device memory. And once you do this, writing these out might become slower, and then some kind of sharing of the event index might help. No one suggested anything like this so far - maybe it's less of an issue than I think, e.g. because of interrupt coalescing (which virtio also does not have, but could be added if there is interest) or maybe people just mostly care about polling performance.> One application for multiple WQs with a single CQ is Quality of Service(QoS). > A user can open a WQ per QoS value(pcp value for example), and the device will schedule the work accordingly.A ring per QOS level might make sense e.g. in a network device. I don't see why it's helpful to write out completed entries into a separate ring for that though. As we don't have QOS support in virtio net at all right now, would that be best discussed once we have a working prototype and everyone can see what the problem is?> > * Scatter/gather support > > > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0: > > > > /* This marks a buffer as continuing via the next field. */ > > #define VRING_DESC_F_NEXT 1 > > > > Unlike virtio 1.0, all descriptors must have distinct ID values. > > > > Also unlike virtio 1.0, use of this flag will be an optional feature > > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it. > > > > * Indirect buffers > > > > Can be marked like in virtio 1.0: > > > > /* This means the buffer contains a table of buffer descriptors. */ > > #define VRING_DESC_F_INDIRECT 4 > > > > Unlike virtio 1.0, this is a table, not a list: > > struct indirect_descriptor_table { > > /* The actual descriptors (16 bytes each) */ > > struct virtq_desc desc[len / 16]; > > }; > > > > The first descriptor is located at start of the indirect descriptor > > table, additional indirect descriptors come immediately afterwards. > > DESC_F_WRITE is the only valid flag for descriptors in the indirect > > table. Others should be set to 0 and are ignored. id is also set to 0 > > and should be ignored. > > > > virtio 1.0 seems to allow a s/g entry followed by > > an indirect descriptor. This does not seem useful, > > so we do not allow that anymore. > > > > This support would be an optional feature, same as in virtio 1.0 > > > > * Batching descriptors: > > > > virtio 1.0 allows passing a batch of descriptors in both directions, by > > incrementing the used/avail index by values > 1. We can support this by > > chaining a list of descriptors through a bit the flags field. > > To allow use together with s/g, a different bit will be used. > > > > #define VRING_DESC_F_BATCH_NEXT 0x0010 > > > > Batching works for both driver and device descriptors. > > > > > > > > * Processing descriptors in and out of order > > > > Device processing all descriptors in order can simply flip > > the DESC_HW bit as it is done with descriptors. > > > > Device can write descriptors out in order as they are used, overwriting > > descriptors that are there. > > > > Device must not use a descriptor until DESC_HW is set. > > It is only required to look at the first descriptor > > submitted. > > > > Driver must not overwrite a descriptor until DESC_HW is clear. > > It is only required to look at the first descriptor > > submitted. > > > > * Device specific descriptor flags > > We have a lot of unused space in the descriptor. This can be put to > > good use by reserving some flag bits for device use. > > For example, network device can set a bit to request > > that header in the descriptor is suppressed > > (in case it's all 0s anyway). This reduces cache utilization. > > > > Note: this feature can be supported in virtio 1.0 as well, > > as we have unused bits in both descriptor and used ring there. > > > > * Descriptor length in device descriptors > > > > virtio 1.0 places strict requirements on descriptor length. For example > > it must be 0 in used ring of TX VQ of a network device since nothing is > > written. In practice guests do not seem to use this, so we can simplify > > devices a bit by removing this requirement - if length is unused it > > should be ignored by driver. > > > > Some devices use identically-sized buffers in all descriptors. > > Ignoring length for driver descriptors there could be an option too. > > > > * Writing at an offset > > > > Some devices might want to write into some descriptors > > at an offset, the offset would be in config space, > > and a descriptor flag could indicate this: > > > > #define VRING_DESC_F_OFFSET 0x0020 > > > > How exactly to use the offset would be device specific, > > for example it can be used to align beginning of packet > > in the 1st buffer for mergeable buffers to cache line boundary > > while also aligning rest of buffers. > > > > * Non power-of-2 ring sizes > > > > As the ring simply wraps around, there's no reason to > > require ring size to be power of two. > > It can be made a separate feature though. > > > > > > * Interrupt/event suppression > > > > virtio 1.0 has two mechanisms for suppression but only > > one can be used at a time. we pack them together > > in a structure - one for interrupts, one for notifications: > > > > struct event { > > __le16 idx; > > __le16 flags; > > } > > > > Both fields would be optional, with a feature bit: > > VIRTIO_F_EVENT_IDX > > VIRTIO_F_EVENT_FLAGS > > > > * Flags can be used like in virtio 1.0, by storing a special > > value there: > > > > #define VRING_F_EVENT_ENABLE 0x0 > > > > #define VRING_F_EVENT_DISABLE 0x1 > > > > * Event index would be in the range 0 to 2 * Queue Size > > (to detect wrap arounds) and wrap to 0 after that. > > > > The assumption is that each side maintains an internal > > descriptor counter 0 to 2 * Queue Size that wraps to 0. > > In that case, interrupt triggers when counter reaches > > the given value. > > > > * If both features are negotiated, a special flags value > > can be used to switch to event idx: > > > > #define VRING_F_EVENT_IDX 0x2 > > > > > > * Prototype > > > > A partial prototype can be found under > > tools/virtio/ringtest/ring.c > > > > Test run: > > [mst at tuck ringtest]$ time ./ring > > real 0m0.399s > > user 0m0.791s > > sys 0m0.000s > > [mst at tuck ringtest]$ time ./virtio_ring_0_9 > > real 0m0.503s > > user 0m0.999s > > sys 0m0.000s > > > > It is planned to update it to this interface. Future changes and > > enhancements can (and should) be tested against this prototype. > > > > * Feature sets > > In particular are we going overboard with feature bits? It becomes hard > > to support all combinations in drivers and devices. Maybe we should > > document reasonable feature sets to be supported for each device. > > > > * Known issues/ideas > > > > This layout is optimized for host/guest communication, > > in a sense even more aggressively than virtio 1.0. > > It might be suboptimal for PCI hardware implementations. > > However, one notes that current virtio pci drivers are > > unlikely to work with PCI hardware implementations anyway > > (e.g. due to use of SMP barriers for ordering). > > > > Suggestions for improving this are welcome but need to be tested to make > > sure our main use case does not regress. Of course some improvements > > might be made optional, but if we add too many of these driver becomes > > unmanageable. > > > > --- > > > > Note: should this proposal be accepted and approved, one or more > > claims disclosed to the TC admin and listed on the Virtio TC > > IPR page > > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww > > w.oasis- > > open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m > > ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9 > > ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P > > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0 > > might become Essential Claims. > > > > -- > > MST > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org
On 2017?09?10? 13:06, Michael S. Tsirkin wrote:> This is an update from v2 version. > Changes: > - update event suppression mechanism > - add wrap counter: DESC_WRAP flag in addition to > DESC_DRIVER flag used for validity so device does not have to > write out all used descriptors.Do we have benchmark result to show the advantage of DESC_DRIVER over e.g avail/used index?> - more options especially helpful for hardware implementations > - in-order processing option due to popular demand > - list of TODO items to consider as a follow-up, only two are > open questions we need to descide now, marked as blocker, > others are future enhancement ideas. > > --- > > Performance analysis of this is in my kvm forum 2016 presentation. > The idea is to have a r/w descriptor in a ring structure, > replacing the used and available ring, index and descriptor > buffer. > > Note: the following mode of operation will actually work > without changes when descriptor rings do not overlap, with driver > writing out available entries in a read-only driver descriptor ring, > device writing used entries in a write-only device descriptor ring. > > TODO: does this have any value for some? E.g. as a security feature? > > > * Descriptor ring: > > Driver writes descriptors with unique index values and DESC_DRIVER set in flags.You probably mean DESC_HW here?> Descriptors are written in a ring order: from start to end of ring, > wrapping around to the beginning. > Device writes used descriptors with correct len, index, and DESC_HW clear. > Again descriptors are written in ring order. This might not be the same > order of driver descriptors, and not all descriptors have to be written out. > > Driver and device are expected to maintain (internally) a wrap-around > bit, starting at 0 and changing value each time they start writing out > descriptors at the beginning of the ring. This bit is passed as > DESC_WRAP bit in the flags field.I'm not sure this is really needed, I think it could be done through checking vq.num?> > Flags are always set/cleared last. > > Note that driver can write descriptors out in any order, but device > will not execute descriptor X+1 until descriptor X has been > read as valid. > > Driver operation: > > Driver makes descriptors available to device by writing out descriptors > in the descriptor ring. Once ring is full, driver waits for device to > use some descriptors before making more available. > > Descriptors can be used by device in any order, but must be read from > ring in-order, and must be read completely before starting use. Thus, > once a descriptor is used, driver can over-write both this descriptor > and any descriptors which preceded it in the ring. > > Driver can detect use of descriptor either by device specific means > (e.g. detect a buffer data change by device) or in a generic way > by detecting that a used buffer has been written out by device. > > > Driver writes out available scatter/gather descriptor entries in guest > descriptor format: > > > #define DESC_WRAP 0x0040 > #define DESC_DRIVER 0x0080 > > struct desc { > __le64 addr; > __le32 len; > __le16 index; > __le16 flags; > }; > > Fields: > > addr - physical address of a s/g entry > len - length of an entry > index - unique index. The low $\lceil log(N) \rceil - 1$ > bits of the index is a driver-supplied value which can have any value > (under driver control). The high bits are reserved and should be > set to 0.Drivers usually have their own information ring, so I'm not sure exposing such flexibility is really needed. For completion the only hardware meaningful information is the index of the descriptor. And DESC_WRAP could be checked implicitly through index in desc < index of this descriptor. (Though I'm still not quite sure DESC_WRAP is needed).> > flags - descriptor flags. > > Descriptors written by driver have DESC_DRIVER set. > > Writing out this field makes the descriptor available for the device to use, > so all other fields must be valid when it is written. > > DESC_WRAP - device uses this field to detect descriptor change by driver.This looks a little bit confused, device in fact can check this through DESC_HW (or DESC_DRIVER) too?> > Driver can use 1 bit to set direction > /* This marks a descriptor as write-only (otherwise read-only). */ > #define VRING_DESC_F_WRITE 2 > > > Device operation (using descriptors): > > Device is looking for descriptors in ring order. After detecting that > the flags value has changed with DESC_DRIVER set and DESC_WRAP matching > the wrap-around counter, it can start using the descriptors. > Descriptors can be used in any order, but must be read from ring > in-order. In other words device must not read descriptor X after it > started using descriptor X+1. Further, all buffer descriptors must be > read completely before device starts using the buffer. > > This because once a descriptor is used, driver can over-write both this > descriptor and any preceeding descriptors in ring. > > To help driver detect use of descriptors and to pass extra meta-data > to driver, device writes out used entries in device descriptor format: > > > #define DESC_WRAP 0x0040 > #define DESC_DRIVER 0x0080 > > struct desc { > __le64 reserved; > __le32 len; > __le16 index; > __le16 flags; > }; > > Fields: > > reserved - can be any value, ignored by driver > len - length written by device. only valid if VRING_DESC_F_WRITE is set > len bytes of data from beginning of buffer are assumed to have been updated > index - unique index copied from the driver descriptor that has been used. > flags - descriptor flags. > > Descriptors written by device have DESC_DRIVER clear. > > Writing out this field notifies the driver that it can re-use the > descriptor id. It is also a signal that driver can over-write the > relevant descriptor (with the supplied id), and any other > > DESC_WRAP - driver uses this field to detect descriptor change by device. > > If device has used a buffer containing a write descriptor, it sets this bit: > #define VRING_DESC_F_WRITE 2 > > * Driver scatter/gather support > > Driver can use 1 bit to chain s/g entries in a request, similar to virtio 1.0: > > /* This marks a buffer as continuing in the next ring entry. */ > #define VRING_DESC_F_NEXT 1 > > When driver descriptors are chained in this way, multiple > descriptors are treated as a part of a single transaction > containing an optional write buffer followed by an optional read buffer. > All descriptors in the chain must have the same ID. > > Unlike virtio 1.0, use of this flag will be an optional feature > so both devices and drivers can opt out of it. > If they do, they can either negotiate indirect descriptors or use > single-descriptor entries exclusively for buffers. > > Device might detect a partial descriptor chain (VRING_DESC_F_NEXT > set but next descriptor not valid). In that case it must not > use any parts of the chain - it will later be completed by driver, > but device is allowed to store the valid parts of the chain as > driver is not allowed to change them anymore.Does it mean e.g device need to busy wait for complete chain if it found an incomplete one? Looks suboptimal.> > Two options are available: > > Device can write out the same number of descriptors for the chain, > setting VRING_DESC_F_NEXT for all but the last descriptor. > Driver will ignore all used descriptors with VRING_DESC_F_NEXT bit set. > > Device only writes out a single descriptor for the whole chain. > However, to keep device and driver in sync, it then skips a number of > descriptors corresponding to the length of the chain before writing out > the next used descriptor. > After detecting a used descriptor driver must find out the length of the > chain that it built in order to know where to look for the next > device descriptor. > > * Indirect buffers > > Indirect buffer descriptors is an optional feature. > These are always written by driver, not the device. > Indirect buffers have a special flag bit set - like in virtio 1.0: > > /* This means the buffer contains a table of buffer descriptors. */ > #define VRING_DESC_F_INDIRECT 4 > > VRING_DESC_F_WRITE and VRING_DESC_F_NEXT are always clear. > > len specifies the length of the indirect descriptor buffer in bytes > and must be a multiple of 16. > > Unlike virtio 1.0, the buffer pointed to is a table, not a list: > struct indirect_descriptor_table { > /* The actual descriptors (16 bytes each) */ > struct indirect_desc desc[len / 16]; > }; > > The first descriptor is located at start of the indirect descriptor > table, additional indirect descriptors come immediately afterwards. > > struct indirect_desc { > __le64 addr; > __le32 len; > __le16 reserved; > __le16 flags; > }; > > > DESC_F_WRITE is the only valid flag for descriptors in the indirect > table. Others should be set to 0 and are ignored. reserved field is > also set to 0 and should be ignored. > > TODO (blocker): virtio 1.0 allows a s/g entry followed by > an indirect descriptor. Is this useful? > > This support would be an optional feature, same as in virtio 1.0 > > * Batching descriptors: > > virtio 1.0 allows passing a batch of descriptors in both directions, by > incrementing the used/avail index by values > 1. > At the moment only batching of used descriptors is used. > > We can support this by chaining a list of device descriptors through > VRING_DESC_F_MORE flag. Device sets this bit to signal > driver that this is part of a batch of used descriptors > which are all part of a single transaction.If this is a part of a single transaction, I don't see obvious different with DESC_F_NEXT?). I thought for batching, each descriptor is independent and should belong to several different transactions. (E.g for net, each descriptor could be an independent packet).> > Driver might detect a partial descriptor chain (VRING_DESC_F_MORE > set but next descriptor not valid). In that case it must not > use any parts of the chain - it will later be completed by device, > but driver is allowed to store the valid parts of the chain as > device is not allowed to change them anymore. > > Descriptor should not have both VRING_DESC_F_MORE and > VRING_DESC_F_NEXT set. > > * Using descriptors in order > > Some devices can guarantee that descriptors are used in > the order in which they were made available. > This allows driver optimizations and can be negotiated through > a feature bit. > > * Per ring flags > > It is useful to support features for some rings but not others. > E.g. it's reasonable to use single buffers for RX rings but > sg or indirect for TX rings of the network device. > Generic configuration space will be extended so features can > be negotiated per ring. > > * Selective use of descriptors > > As described above, descriptors with NEXT bit set are part of a > scatter/gather chain and so do not have to cause device to write a used > descriptor out. > > Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to > signal to device that it does not have to write out the used descriptor > as it is part of a batch of descriptors. Device has two options (similar > to VRING_DESC_F_NEXT): > > Device can write out the same number of descriptors for the batch, > setting VRING_DESC_F_MORE for all but the last descriptor. > Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set. > > Device only writes out a single descriptor for the whole batch. > However, to keep device and driver in sync, it then skips a number of > descriptors corresponding to the length of the batch before writing out > the next used descriptor. > After detecting a used descriptor driver must find out the length of the > batch that it built in order to know where to look for the next > device descriptor.A silly question, how can driver find out the length of the batch effectively? Looks like it can only scan the ring until one that has DESC_HW cleared?> > > TODO (blocker): skipping descriptors for selective and > scatter/gather seem to be only requested with in-order right now. Let's > require in-order for this skipping? This will simplify the accounting > by driver. > > > * Interrupt/event suppression > > virtio 1.0 has two mechanisms for suppression but only > one can be used at a time. we pack them together > in a structure - one for interrupts, one for notifications: > > struct event { > __le16 idx; > __le16 flags; > } > > Both fields would be optional, with a feature bit: > VIRTIO_F_EVENT_IDX > VIRTIO_F_EVENT_FLAGS > > Flags can be used like in virtio 1.0, by storing a special > value there: > > #define VRING_F_EVENT_ENABLE 0x0 > > #define VRING_F_EVENT_DISABLE 0x1 > > Event index includes the index of the descriptor > which should trigger the event, and the wrap counter > in the high bit.Not specific to v3, but looks like with event index, we can't achieve interruptless or exitless consider idx may wrap.> > In that case, interrupt triggers when descriptor is written at a given > location in the ring (or skipped in case of NEXT/MORE). > > If both features are negotiated, a special flags value > can be used to switch to event idx: > > #define VRING_F_EVENT_IDX 0x2 > > * Available notification > > Driver currently writes out the queue number to device to > kick off ring processing. > > As queue number is currently 16 bit, we can extend that > to additionally include the offset within ring of the descriptor > which triggered the kick event in bits 16 to 30, > and the wrap counter in the high bit (31). > > Device is allowed to pre-fetch descriptors beyond the specified > offset but is not required to do so.With DESC_HW or other flag, prefetching may introduce extra overhead I think since it need to keep scan descriptor until DESC_HW is not set?> > > > * TODO: interrupt coalescing > > Does it make sense just for networking or for all devices? > If later should we make it a per ring or a global feature? > > > * TODO: event index/flags in device memory? > > Should we move the event index/flags to device memory? > Might be helpful for hardware configuration so they do not > need to do DMA reads to check whether interrupt is needed. > OTOH maybe interrupt coalescing is sufficient for this. > > > * TODO: Device specific descriptor flags > > We have a lot of unused space in the descriptor. This can be put to > good use by reserving some flag bits for device use. > For example, network device can set a bit to request > that header in the descriptor is suppressed > (in case it's all 0s anyway). This reduces cache utilization. > > Note: this feature can be supported in virtio 1.0 as well, > as we have unused bits in both descriptor and used ring there.I think we need try at least packing virtio-net header in the descriptor ring.> > * TODO: Descriptor length in device descriptors > > Some devices use identically-sized buffers in all descriptors. > Ignoring length for driver descriptors there could be an option too. > > * TODO: Writing at an offset > > Some devices might want to write into some descriptors > at an offset, the offset would be in reserved field in the descriptor, > possibly a descriptor flag could indicate this: > > #define VRING_DESC_F_OFFSET 0x0020 > > How exactly to use the offset would be device specific, > for example it can be used to align beginning of packet > in the 1st buffer for mergeable buffers to cache line boundary > while also aligning rest of buffers.May be even more e.g NET_SKB_PAD, then we could use build_skb() for Linux drivers.> > * TODO: Non power-of-2 ring sizes > > As the ring simply wraps around, there's no reason to > require ring size to be power of two. > It can be made a separate feature though. > > > TODO: limits on buffer alignment/size > > Seems to be useful for RX for networking. > Is there a need to negotiate above in a generic way > or is this a networking specific optimization? > > TODO: expose wrap counters to device for debugging > > TODO: expose last avail/used offsets to device/driver for debugging > > TODO: ability to reset individual ringsAny actual usage of this? Thanks> > --- > > Note: should this proposal be accepted and approved, one or more > claims disclosed to the TC admin and listed on the Virtio TC > IPR page https://www.oasis-open.org/committees/virtio/ipr.php > might become Essential Claims. > Note: the page above is unfortunately out of date and out of > my hands. I'm in the process of updating ipr disclosures > in github instead. Will make sure all is in place before > this proposal is put to vote. As usual this TC operates under the > Non-Assertion Mode of the OASIS IPR Policy, which protects > anyone implementing the virtio spec. >
On Sun, Sep 10, 2017 at 1:06 AM, Michael S. Tsirkin <mst at redhat.com> wrote:> This is an update from v2 version. > Changes: > - update event suppression mechanism > - add wrap counter: DESC_WRAP flag in addition to > DESC_DRIVER flag used for validity so device does not have to > write out all used descriptors. > - more options especially helpful for hardware implementations > - in-order processing option due to popular demand > - list of TODO items to consider as a follow-up, only two are > open questions we need to descide now, marked as blocker, > others are future enhancement ideas.Perhaps this would make a good topic for a BoF session at the upcoming netdev. A new ring structure can be useful elsewhere, too, such as for af_packet v4.> > --- > > Performance analysis of this is in my kvm forum 2016 presentation. > The idea is to have a r/w descriptor in a ring structure, > replacing the used and available ring, index and descriptor > buffer. > > Note: the following mode of operation will actually work > without changes when descriptor rings do not overlap, with driver > writing out available entries in a read-only driver descriptor ring, > device writing used entries in a write-only device descriptor ring.The ring is always read-write, as the consumer has to toggle the DESC_DRIVER flag, right? Which mode are you referring to?> TODO: does this have any value for some? E.g. as a security feature? > > > * Descriptor ring: > > Driver writes descriptors with unique index values and DESC_DRIVER set in flags. > Descriptors are written in a ring order: from start to end of ring, > wrapping around to the beginning. > Device writes used descriptors with correct len, index, and DESC_HW clear. > Again descriptors are written in ring order. This might not be the same > order of driver descriptors, and not all descriptors have to be written out.Virtio rings are used in both directions between guest device driver and host device, as well as in increasingly many situations outside vm i/o. I suggest using producer and consumer instead of driver and device when describing ring operations. When working on the virtio-net tx path, having to invert all the documentation in my head, because it is written from an rx point of view was a bit tricky ;-) I would then also convert DESC_DRIVER to DESC_VALID or so.> Driver and device are expected to maintain (internally) a wrap-around > bit, starting at 0 and changing value each time they start writing out > descriptors at the beginning of the ring. This bit is passed as > DESC_WRAP bit in the flags field.So, the flag effectively doubles the namespace of the id from 16 bit to 17 bit? Instead, how about using a larger identifier. Such as 32 bit. This also future proofs the design for cases where the ring may grow to exceed 65536 entries. Doing so is not a short term change, but it ould avoid the need for indirect descriptors and give greated room for out of order acknowledgement.> Flags are always set/cleared last. > > Note that driver can write descriptors out in any order, but device > will not execute descriptor X+1 until descriptor X has been > read as valid.Why this constraint on the ring?> Driver operation: > > Driver makes descriptors available to device by writing out descriptors > in the descriptor ring. Once ring is full, driver waits for device to > use some descriptors before making more available. > > Descriptors can be used by device in any order, but must be read from > ring in-order, and must be read completely before starting use. Thus, > once a descriptor is used, driver can over-write both this descriptor > and any descriptors which preceded it in the ring.Does this mean that completing a descriptor by the consumer implicitly completes all descriptors that precede it in the ring?> Driver can detect use of descriptor either by device specific means > (e.g. detect a buffer data change by device) or in a generic way > by detecting that a used buffer has been written out by device.I don't quite follow this.> Driver writes out available scatter/gather descriptor entries in guest > descriptor format: > > > #define DESC_WRAP 0x0040 > #define DESC_DRIVER 0x0080 > > struct desc { > __le64 addr; > __le32 len; > __le16 index; > __le16 flags; > }; > > Fields: > > addr - physical address of a s/g entry > len - length of an entryIs this ever larger than 16-bit? If not, then reducing to 16-bit allows growing index to 32-bit.> index - unique index. The low $\lceil log(N) \rceil - 1$ > bits of the index is a driver-supplied value which can have any value > (under driver control). The high bits are reserved and should be > set to 0. > > flags - descriptor flags. > > Descriptors written by driver have DESC_DRIVER set. > > Writing out this field makes the descriptor available for the device to use, > so all other fields must be valid when it is written. > > DESC_WRAP - device uses this field to detect descriptor change by driver. > > Driver can use 1 bit to set direction > /* This marks a descriptor as write-only (otherwise read-only). */ > #define VRING_DESC_F_WRITE 2This is a per-ring flag, as opposed to the per-descriptor flags DESC_*. Please make that explicit and state in which structure it is set.> > > Device operation (using descriptors): > > Device is looking for descriptors in ring order. After detecting that > the flags value has changed with DESC_DRIVER set and DESC_WRAP matching > the wrap-around counter, it can start using the descriptors. > Descriptors can be used in any order, but must be read from ring > in-order. In other words device must not read descriptor X after it > started using descriptor X+1.Why? This is the same question as above, really. This seems like a device constraint, not necessarily a constraint to impose on the ring format.> Further, all buffer descriptors must be > read completely before device starts using the buffer. > > This because once a descriptor is used, driver can over-write both this > descriptor and any preceeding descriptors in ring.This does explain the above constraint. I guess that I just do not understand the reason for this behavior.> To help driver detect use of descriptors and to pass extra meta-data > to driver, device writes out used entries in device descriptor format: > > > #define DESC_WRAP 0x0040 > #define DESC_DRIVER 0x0080 > > struct desc { > __le64 reserved; > __le32 len; > __le16 index; > __le16 flags; > }; > > Fields: > > reserved - can be any value, ignored by driver > len - length written by device. only valid if VRING_DESC_F_WRITE is set > len bytes of data from beginning of buffer are assumed to have been updated > index - unique index copied from the driver descriptor that has been used. > flags - descriptor flags. > > Descriptors written by device have DESC_DRIVER clear. > > Writing out this field notifies the driver that it can re-use the > descriptor id. It is also a signal that driver can over-write the > relevant descriptor (with the supplied id), and any other > > DESC_WRAP - driver uses this field to detect descriptor change by device. > > If device has used a buffer containing a write descriptor, it sets this bit: > #define VRING_DESC_F_WRITE 2 > > * Driver scatter/gather support > > Driver can use 1 bit to chain s/g entries in a request, similar to virtio 1.0: > > /* This marks a buffer as continuing in the next ring entry. */ > #define VRING_DESC_F_NEXT 1Isn't this a descriptor flag, so DESC_NEXT?> > When driver descriptors are chained in this way, multiple > descriptors are treated as a part of a single transaction > containing an optional write buffer followed by an optional read buffer.Can you elaborate on the last part about optional write and read buffer?> All descriptors in the chain must have the same ID.If so, then the explicit flag is not needed?> Unlike virtio 1.0, use of this flag will be an optional feature > so both devices and drivers can opt out of it. > If they do, they can either negotiate indirect descriptors or use > single-descriptor entries exclusively for buffers. > > Device might detect a partial descriptor chain (VRING_DESC_F_NEXT > set but next descriptor not valid).This can be forbidden, by requiring the producer to set the DESC_DRIVER bit on the first descriptor only after the entire chain has been written. Do chains have to consist of consecutive descriptors?> In that case it must not > use any parts of the chain - it will later be completed by driver, > but device is allowed to store the valid parts of the chain as > driver is not allowed to change them anymore. > > Two options are available: > > Device can write out the same number of descriptors for the chain, > setting VRING_DESC_F_NEXT for all but the last descriptor. > Driver will ignore all used descriptors with VRING_DESC_F_NEXT bit set. > > Device only writes out a single descriptor for the whole chain. > However, to keep device and driver in sync, it then skips a number of > descriptors corresponding to the length of the chain before writing out > the next used descriptor. > After detecting a used descriptor driver must find out the length of the > chain that it built in order to know where to look for the next > device descriptor. > > * Indirect buffers > > Indirect buffer descriptors is an optional feature. > These are always written by driver, not the device. > Indirect buffers have a special flag bit set - like in virtio 1.0: > > /* This means the buffer contains a table of buffer descriptors. */ > #define VRING_DESC_F_INDIRECT 4 > > VRING_DESC_F_WRITE and VRING_DESC_F_NEXT are always clear. > > len specifies the length of the indirect descriptor buffer in bytes > and must be a multiple of 16.Multiple of sizeof(struct indirect_desc). Also, struct indirect_desc is identical to struct desc. No need for a separate struct definition?> > Unlike virtio 1.0, the buffer pointed to is a table, not a list:This is just a linear array, right?> struct indirect_descriptor_table { > /* The actual descriptors (16 bytes each) */ > struct indirect_desc desc[len / 16]; > }; > > The first descriptor is located at start of the indirect descriptor > table, additional indirect descriptors come immediately afterwards. > > struct indirect_desc { > __le64 addr; > __le32 len; > __le16 reserved; > __le16 flags; > }; > > > DESC_F_WRITE is the only valid flag for descriptors in the indirect > table. Others should be set to 0 and are ignored. reserved field is > also set to 0 and should be ignored. > > TODO (blocker): virtio 1.0 allows a s/g entry followed by > an indirect descriptor. Is this useful?Sounds like unnecessary complexity.> This support would be an optional feature, same as in virtio 1.0 > > * Batching descriptors: > > virtio 1.0 allows passing a batch of descriptors in both directions, by > incrementing the used/avail index by values > 1. > At the moment only batching of used descriptors is used. > > We can support this by chaining a list of device descriptors through > VRING_DESC_F_MORE flag. Device sets this bit to signal > driver that this is part of a batch of used descriptors > which are all part of a single transaction. > > Driver might detect a partial descriptor chain (VRING_DESC_F_MORE > set but next descriptor not valid). In that case it must not > use any parts of the chain - it will later be completed by device, > but driver is allowed to store the valid parts of the chain as > device is not allowed to change them anymore. > > Descriptor should not have both VRING_DESC_F_MORE and > VRING_DESC_F_NEXT set. > > * Using descriptors in order > > Some devices can guarantee that descriptors are used in > the order in which they were made available. > This allows driver optimizations and can be negotiated through > a feature bit. > > * Per ring flags > > It is useful to support features for some rings but not others. > E.g. it's reasonable to use single buffers for RX rings but > sg or indirect for TX rings of the network device. > Generic configuration space will be extended so features can > be negotiated per ring. > > * Selective use of descriptors > > As described above, descriptors with NEXT bit set are part of a > scatter/gather chain and so do not have to cause device to write a used > descriptor out. > > Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to > signal to device that it does not have to write out the used descriptor > as it is part of a batch of descriptors. Device has two options (similar > to VRING_DESC_F_NEXT): > > Device can write out the same number of descriptors for the batch, > setting VRING_DESC_F_MORE for all but the last descriptor. > Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set. > > Device only writes out a single descriptor for the whole batch. > However, to keep device and driver in sync, it then skips a number of > descriptors corresponding to the length of the batch before writing out > the next used descriptor. > After detecting a used descriptor driver must find out the length of the > batch that it built in order to know where to look for the next > device descriptor. > > > TODO (blocker): skipping descriptors for selective and > scatter/gather seem to be only requested with in-order right now. Let's > require in-order for this skipping? This will simplify the accounting > by driver. > > > * Interrupt/event suppression > > virtio 1.0 has two mechanisms for suppression but only > one can be used at a time. we pack them together > in a structure - one for interrupts, one for notifications: > > struct event { > __le16 idx; > __le16 flags; > } > > Both fields would be optional, with a feature bit: > VIRTIO_F_EVENT_IDX > VIRTIO_F_EVENT_FLAGS > > Flags can be used like in virtio 1.0, by storing a special > value there: > > #define VRING_F_EVENT_ENABLE 0x0 > > #define VRING_F_EVENT_DISABLE 0x1 > > Event index includes the index of the descriptor > which should trigger the event, and the wrap counter > in the high bit. > > In that case, interrupt triggers when descriptor is written at a given > location in the ring (or skipped in case of NEXT/MORE). > > If both features are negotiated, a special flags value > can be used to switch to event idx: > > #define VRING_F_EVENT_IDX 0x2 > > * Available notification > > Driver currently writes out the queue number to device to > kick off ring processing. > > As queue number is currently 16 bit, we can extend that > to additionally include the offset within ring of the descriptor > which triggered the kick event in bits 16 to 30, > and the wrap counter in the high bit (31). > > Device is allowed to pre-fetch descriptors beyond the specified > offset but is not required to do so. > > > > * TODO: interrupt coalescing > > Does it make sense just for networking or for all devices? > If later should we make it a per ring or a global feature? > > > * TODO: event index/flags in device memory? > > Should we move the event index/flags to device memory? > Might be helpful for hardware configuration so they do not > need to do DMA reads to check whether interrupt is needed.Agreed. This also resembles physical devices more closely.> OTOH maybe interrupt coalescing is sufficient for this. > > > * TODO: Device specific descriptor flags > > We have a lot of unused space in the descriptor. This can be put to > good use by reserving some flag bits for device use. > For example, network device can set a bit to request > that header in the descriptor is suppressed > (in case it's all 0s anyway). This reduces cache utilization. > > Note: this feature can be supported in virtio 1.0 as well, > as we have unused bits in both descriptor and used ring there. > > * TODO: Descriptor length in device descriptors > > Some devices use identically-sized buffers in all descriptors. > Ignoring length for driver descriptors there could be an option too. > > * TODO: Writing at an offset > > Some devices might want to write into some descriptors > at an offset, the offset would be in reserved field in the descriptor, > possibly a descriptor flag could indicate this: > > #define VRING_DESC_F_OFFSET 0x0020 > > How exactly to use the offset would be device specific, > for example it can be used to align beginning of packet > in the 1st buffer for mergeable buffers to cache line boundary > while also aligning rest of buffers. > > * TODO: Non power-of-2 ring sizes > > As the ring simply wraps around, there's no reason to > require ring size to be power of two. > It can be made a separate feature though. > > > TODO: limits on buffer alignment/size > > Seems to be useful for RX for networking. > Is there a need to negotiate above in a generic way > or is this a networking specific optimization? > > TODO: expose wrap counters to device for debugging > > TODO: expose last avail/used offsets to device/driver for debugging > > TODO: ability to reset individual rings > > --- > > Note: should this proposal be accepted and approved, one or more > claims disclosed to the TC admin and listed on the Virtio TC > IPR page https://www.oasis-open.org/committees/virtio/ipr.php > might become Essential Claims. > Note: the page above is unfortunately out of date and out of > my hands. I'm in the process of updating ipr disclosures > in github instead. Will make sure all is in place before > this proposal is put to vote. As usual this TC operates under the > Non-Assertion Mode of the OASIS IPR Policy, which protects > anyone implementing the virtio spec. > > -- > MST > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org >
On Mon, Sep 11, 2017 at 3:47 AM, Jason Wang <jasowang at redhat.com> wrote:> > > On 2017?09?10? 13:06, Michael S. Tsirkin wrote: >> >> This is an update from v2 version. >> Changes: >> - update event suppression mechanism >> - add wrap counter: DESC_WRAP flag in addition to >> DESC_DRIVER flag used for validity so device does not have to >> write out all used descriptors. > > > Do we have benchmark result to show the advantage of DESC_DRIVER over e.g > avail/used index?The KVM forum presentation has some numbers. I'm not sure that synthetic benchmarks will provide much value, as we understand the trade-off quite well. The benefit of this model is improved best case performance, by having a single cacheline read instead of two for the indirect used/avail ring model. The drawback is worse worst case, as scanning the ring of descriptors introduces more cacheline misses than scanning the compressed used/avail ring. This model is easier to implement in hardware and the common case is likely close to the best case, so I think it makes sense.
> -----Original Message----- > From: virtualization-bounces at lists.linux-foundation.org > [mailto:virtualization-bounces at lists.linux-foundation.org] On Behalf Of > Michael S. Tsirkin > Sent: Sunday, September 10, 2017 8:06 AM > To: virtio-dev at lists.oasis-open.org > Cc: virtualization at lists.linux-foundation.org > Subject: packed ring layout proposal v3 > > This is an update from v2 version....> When driver descriptors are chained in this way, multiple descriptors are > treated as a part of a single transaction containing an optional write buffer > followed by an optional read buffer. > All descriptors in the chain must have the same ID. >... I think you should consider removing the "same ID" requirement. Assuming out of order execution, how is the driver supposed to re-assign unique IDs to the previously chained descriptor? Do you expected driver to copy original IDs somewhere else before the chaining and then restore them after the chain is executed? Thanks, Ilya
Hi Michael,> -----Original Message----- > From: virtio-dev at lists.oasis-open.org [mailto:virtio-dev at lists.oasis-open.org] > On Behalf Of Michael S. Tsirkin > Sent: Sunday, September 10, 2017 1:06 PM > To: virtio-dev at lists.oasis-open.org > Cc: virtualization at lists.linux-foundation.org > Subject: [virtio-dev] packed ring layout proposal v3 >[...]> * Descriptor ring: > > Driver writes descriptors with unique index values and DESC_DRIVER set in > flags. > Descriptors are written in a ring order: from start to end of ring, wrapping > around to the beginning. > Device writes used descriptors with correct len, index, and DESC_HW clear. > Again descriptors are written in ring order. This might not be the same order > of driver descriptors, and not all descriptors have to be written out. > > Driver and device are expected to maintain (internally) a wrap-around bit, > starting at 0 and changing value each time they start writing out descriptors > at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags > field.One simple question there, trying to understand the usage of DESC_WRAP flag. DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2? Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)? DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?> > Flags are always set/cleared last. > > Note that driver can write descriptors out in any order, but device will not > execute descriptor X+1 until descriptor X has been read as valid. > > Driver operation: >[...]> > DESC_WRAP - device uses this field to detect descriptor change by driver.Device uses this field to detect change of wrap-around boundary by driver? [...]> > Device operation (using descriptors): >[...]> > DESC_WRAP - driver uses this field to detect descriptor change by device.Driver uses this field to detect change of wrap-around boundary by device? By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag. Thanks, Steve>[...]> > --- > > Note: should this proposal be accepted and approved, one or more > claims disclosed to the TC admin and listed on the Virtio TC > IPR page https://www.oasis-open.org/committees/virtio/ipr.php > might become Essential Claims. > Note: the page above is unfortunately out of date and out of > my hands. I'm in the process of updating ipr disclosures > in github instead. Will make sure all is in place before > this proposal is put to vote. As usual this TC operates under the > Non-Assertion Mode of the OASIS IPR Policy, which protects > anyone implementing the virtio spec. > > -- > MST > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org
Hi,> -----Original Message----- > From: virtio-dev at lists.oasis-open.org [mailto:virtio-dev at lists.oasis-open.org] On > Behalf Of Michael S. Tsirkin > Sent: Sunday, September 10, 2017 1:06 PM > To: virtio-dev at lists.oasis-open.org > Cc: virtualization at lists.linux-foundation.org > Subject: [virtio-dev] packed ring layout proposal v3 >[...]> * Batching descriptors: > > virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing > the used/avail index by values > 1. > At the moment only batching of used descriptors is used. > > We can support this by chaining a list of device descriptors through > VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is part of > a batch of used descriptors which are all part of a single transaction.It supposes each s/g chain represents for a packet, while each descriptor among batching chain represents for a packet. There're a few thoughts of batching chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT). - batching chain: It's up to device to coalesce the write-out of a batched used descriptors. As the batching size is variable, driver has to detect validity of each descriptor unless the number of incoming valid descriptor is predictable, being curious on the benefits of driver from VRING_DESC_F_MORE to take batching descriptors as a single transaction. On device perspective, it's great to write out one descriptor for the whole chain. However, it assumes no other useful fields in each descriptor of chain needs to write. TX buffer reclaiming can be the candidate, while RX side has to update 'len' at least. Even for this purpose, instead of writing out VRING_DESC_F_MORE on a few descriptors to suppress device writing back, it's cheaper to set flag (e.g. VRING_DESC_F_WB) on single descriptor of chain to hint the expected position for device to write back. - s/g chain: It makes sense to indicate the packet boundary. Considering in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor always belongs to the same s/g chain until end of packet indicators occur. So one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the last descriptor of the chain.> > Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but next > descriptor not valid). In that case it must not use any parts of the chain - it will > later be completed by device, but driver is allowed to store the valid parts of the > chain as device is not allowed to change them anymore.As each descriptor represent for a whole packet(otherwise it's s/g chain), wondering why it must not use any parts of the chain.> > Descriptor should not have both VRING_DESC_F_MORE and > VRING_DESC_F_NEXT set. >[...]> > * Selective use of descriptors > > As described above, descriptors with NEXT bit set are part of a scatter/gather > chain and so do not have to cause device to write a used descriptor out. > > Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to signal to > device that it does not have to write out the used descriptor as it is part of a batch > of descriptors. Device has two options (similar to VRING_DESC_F_NEXT): > > Device can write out the same number of descriptors for the batch, setting > VRING_DESC_F_MORE for all but the last descriptor. > Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.It will write out last descriptor without VRING_DESC_F_MORE anyway, so the following statement seems not like another option.> > Device only writes out a single descriptor for the whole batch. > However, to keep device and driver in sync, it then skips a number of descriptors > corresponding to the length of the batch before writing out the next used > descriptor. > After detecting a used descriptor driver must find out the length of the batch that > it built in order to know where to look for the next device descriptor.It would be good to keep it simple on device side, and to have the driver control the expectation.> > > TODO (blocker): skipping descriptors for selective and scatter/gather seem to be > only requested with in-order right now. Let's require in-order for this skipping? > This will simplify the accounting by driver. > >Thanks, Steve
On Wed, Sep 20, 2017 at 09:11:57AM +0000, Liang, Cunming wrote:> Hi Michael, > > > -----Original Message----- > > From: virtio-dev at lists.oasis-open.org [mailto:virtio-dev at lists.oasis-open.org] > > On Behalf Of Michael S. Tsirkin > > Sent: Sunday, September 10, 2017 1:06 PM > > To: virtio-dev at lists.oasis-open.org > > Cc: virtualization at lists.linux-foundation.org > > Subject: [virtio-dev] packed ring layout proposal v3 > > > [...] > > * Descriptor ring: > > > > Driver writes descriptors with unique index values and DESC_DRIVER set in > > flags. > > Descriptors are written in a ring order: from start to end of ring, wrapping > > around to the beginning. > > Device writes used descriptors with correct len, index, and DESC_HW clear. > > Again descriptors are written in ring order. This might not be the same order > > of driver descriptors, and not all descriptors have to be written out. > > > > Driver and device are expected to maintain (internally) a wrap-around bit, > > starting at 0 and changing value each time they start writing out descriptors > > at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags > > field. > > One simple question there, trying to understand the usage of DESC_WRAP flag. > > DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2? > > Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)? > DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?No, this is necessary if not all descriptors are overwritten by device after they are used. Each time driver overwrites a descriptor, the value in DESC_WRAP changes which makes it possible for device to detect that there's a new descriptor.> > > > Flags are always set/cleared last. > > > > Note that driver can write descriptors out in any order, but device will not > > execute descriptor X+1 until descriptor X has been read as valid. > > > > Driver operation: > > > [...] > > > > DESC_WRAP - device uses this field to detect descriptor change by driver. > > Device uses this field to detect change of wrap-around boundary by driver? > > [...] > > > > Device operation (using descriptors): > > > [...] > > > > DESC_WRAP - driver uses this field to detect descriptor change by device. > > Driver uses this field to detect change of wrap-around boundary by device? > > By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag. > > > Thanks, > SteveSo v2 simply said descriptor has a single bit: driver writes 1 there, device writes 0. This requires device to overwrite each descriptor and people asked for a way to communicate where some descriptors are not overwritten. This new bit helps device distinguish new and old descriptors written by driver.> > > [...] > > > > --- > > > > Note: should this proposal be accepted and approved, one or more > > claims disclosed to the TC admin and listed on the Virtio TC > > IPR page https://www.oasis-open.org/committees/virtio/ipr.php > > might become Essential Claims. > > Note: the page above is unfortunately out of date and out of > > my hands. I'm in the process of updating ipr disclosures > > in github instead. Will make sure all is in place before > > this proposal is put to vote. As usual this TC operates under the > > Non-Assertion Mode of the OASIS IPR Policy, which protects > > anyone implementing the virtio spec. > > > > -- > > MST > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org
On Thu, Sep 21, 2017 at 01:36:37PM +0000, Liang, Cunming wrote:> Hi, > > > -----Original Message----- > > From: virtio-dev at lists.oasis-open.org [mailto:virtio-dev at lists.oasis-open.org] On > > Behalf Of Michael S. Tsirkin > > Sent: Sunday, September 10, 2017 1:06 PM > > To: virtio-dev at lists.oasis-open.org > > Cc: virtualization at lists.linux-foundation.org > > Subject: [virtio-dev] packed ring layout proposal v3 > > > [...] > > * Batching descriptors: > > > > virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing > > the used/avail index by values > 1. > > At the moment only batching of used descriptors is used. > > > > We can support this by chaining a list of device descriptors through > > VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is part of > > a batch of used descriptors which are all part of a single transaction. > > It supposes each s/g chain represents for a packet, while each descriptor among batching chain represents for a packet. There're a few thoughts of batching chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT). > > - batching chain: It's up to device to coalesce the write-out of a batched used descriptors. As the batching size is variable, driver has to detect validity of each descriptor unless the number of incoming valid descriptor is predictable, being curious on the benefits of driver from VRING_DESC_F_MORE to take batching descriptors as a single transaction. On device perspective, it's great to write out one descriptor for the whole chain. However, it assumes no other useful fields in each descriptor of chain needs to write. TX buffer reclaiming can be the candidate, while RX side has to update 'len' at least. Even for this purpose, instead of writing out VRING_DESC_F_MORE on a few descriptors to suppress device writing back, it's cheaper to set flag (e.g. VRING_DESC_F_WB) on single descriptor of chain to hint the expected position for device to write back.But driver does not really benefit from batching and does not know how many to batch, this depends on device. E.g. a software device does not need batching at all, a pci express device would want batches to be multiples of what fits in a pci express transaction, etc. We would have to provide that info from device to driver.> - s/g chain: It makes sense to indicate the packet boundary. Considering in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor always belongs to the same s/g chain until end of packet indicators occur. So one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the last descriptor of the chain.EOP would be the reverse of NEXT then? I think it does not matter much, but NEXT matches what is there in virtio 1.0 right now. It also means that simple implementations with short buffers can have flags set to 0 which seems cleaner.> > > > Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but next > > descriptor not valid). In that case it must not use any parts of the chain - it will > > later be completed by device, but driver is allowed to store the valid parts of the > > chain as device is not allowed to change them anymore. > As each descriptor represent for a whole packet(otherwise it's s/g chain),For RX mergeable buffers, a packet is composed of multiple s/g chains.> wondering why it must not use any parts of the chain.This is to match what is there in virtio 1.0 right now: driver does not touch any used descriptors until the used index is updated.> > > > Descriptor should not have both VRING_DESC_F_MORE and > > VRING_DESC_F_NEXT set. > > > [...] > > > > * Selective use of descriptors > > > > As described above, descriptors with NEXT bit set are part of a scatter/gather > > chain and so do not have to cause device to write a used descriptor out. > > > > Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to signal to > > device that it does not have to write out the used descriptor as it is part of a batch > > of descriptors. Device has two options (similar to VRING_DESC_F_NEXT): > > > > Device can write out the same number of descriptors for the batch, setting > > VRING_DESC_F_MORE for all but the last descriptor. > > Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set. > It will write out last descriptor without VRING_DESC_F_MORE anyway, so the following statement seems not like another option.I don't understand this statement. All I said is that it's up to device whether to write out the descriptors with VRING_DESC_F_MORE, or to skip the write out.> > > > Device only writes out a single descriptor for the whole batch. > > However, to keep device and driver in sync, it then skips a number of descriptors > > corresponding to the length of the batch before writing out the next used > > descriptor. > > After detecting a used descriptor driver must find out the length of the batch that > > it built in order to know where to look for the next device descriptor. > It would be good to keep it simple on device side, and to have the driver control the expectation.I'm not sure what above means either. That is exactly what above proposal says: device simply writes out a single descriptor. Driver has to keep track and know where the next one will be written. Example Driver writes two pairs chained with MORE: 0 + 1, 2 + 3 Device writes: 0 and 3> > > > > > TODO (blocker): skipping descriptors for selective and scatter/gather seem to be > > only requested with in-order right now. Let's require in-order for this skipping? > > This will simplify the accounting by driver. > > > > > > Thanks, > Steve
> > -----Original Message----- > > From: virtualization-bounces at lists.linux-foundation.org > > [mailto:virtualization-bounces at lists.linux-foundation.org] On Behalf > > Of Michael S. Tsirkin > > > > This is an update from v2 version. >> ... > > When driver descriptors are chained in this way, multiple descriptors > > are treated as a part of a single transaction containing an optional > > write buffer followed by an optional read buffer. > > All descriptors in the chain must have the same ID. > >I apologize for the repost, I didn't realize I have to be a member of the virtio-dev mailing list. I'm concerned about the "same ID" requirement in chained descriptors. Assuming out of order execution, how is the driver supposed to re-assign unique IDs to the previously chained descriptor? Is the driver expected to copy original IDs somewhere else before the chaining and then restore the IDs after the chain is executed? Thanks, Ilya
On Sun, Oct 08, 2017 at 06:16:44AM +0000, Ilya Lesokhin wrote:> > > -----Original Message----- > > > From: virtualization-bounces at lists.linux-foundation.org > > > [mailto:virtualization-bounces at lists.linux-foundation.org] On Behalf > > > Of Michael S. Tsirkin > > > > > > This is an update from v2 version. > >> ... > > > When driver descriptors are chained in this way, multiple descriptors > > > are treated as a part of a single transaction containing an optional > > > write buffer followed by an optional read buffer. > > > All descriptors in the chain must have the same ID. > > > > > I apologize for the repost, I didn't realize I have to be a member of the > virtio-dev mailing list. > > I'm concerned about the "same ID" requirement in chained descriptors.It's there really just so we can remove the doubt about which descriptor's ID should be used. My testing does not show a performance win from this, so I'm fine with removing this requirement though I'd be curious to know why is it a problem.> Assuming out of order execution, how is the driver supposed to re-assign > unique IDs to the previously chained descriptor?For example, driver can have a simple allocator for the IDs.> Is the driver expected to copy original IDs somewhere else before the > chaining and then restore the IDs after the chain is executed? > > Thanks, > IlyaAs device overwrites the ID, driver will have to write it out each time, that's true. It's going to be a requirement even if descriptors on the chain do not need to have the same ID. -- MST