Virtio BoF minutes KVM Forum 2017 Attendees: Amnon Ilan, Maxime Coqueline, Vlad Yasevich, Malcolm Crossley, David Vrabel, Ilya Lesokhin, Cunming Lian, Jens Freimann Topics: packed ring layout with respect to hardware implementations References: https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html https://lists.oasis-open.org/archives/virtio-dev/201709/msg00013.html Malcolm Crossley, David Vrabel: - keep in mind not to only optimize for network with small frame sizes. Storage has much larger sizes - is there really no cacheline ping pong, because we are overwriting the same cache line? 4 descs in one line, once we access two at the same time it will cause cache coherency, messages, no? - interesting quirk, because we flip a bit, but intel doesn't support writing single bytes, it will always be a full dword. will that be a problem? - interesting to look into NVME protocols, it seems to solve some of the same problems hardware-wise - vmware vmxnet3 has a separate data ring for when they have bigger amounts of data. not to copy, but still interesting Steve: - is the _MORE flag from packed ring layout proposal still in use? what is it's meaning? Ilya: - you might have more completions than descriptors available - partial descriptor chains are a problem for hardware because you might have to read a bunch of conscriptors twice - how would you do deal with a big buffer that cointains a large number of small packets with respect to completions? - is one bit for completion enough? right now it means descriptor was actually used. how to we signal when it was completed? - concerned about not being able to do scatter/gatter with the ring layout. Network drivers heavily using indirect buffers. - for a hardware implementation a completion ring is a very convenient form for some use cases, so we want an efficient implementation for them. If we had an inline descriptor then a completion ring is just a normal ring and we won't need another ring type. - doesn't like the fact that we need to do a linear scan to find the length of a descriptor chain. It would be nice if we could have the length of the chain in the first descriptor (i.e. the number of chained descriptors, not the number of posted descriptors which can be deduced from the id field) Vlad: - there were discussions about having a bigger descriptor. then we would have more space to put things like a vnet header into the descriptor. It would also mean less conflicts with accessing the same cache line. (descriptors already grew to 16 bytes, do we need more?) - was playing around with the idea of different ring types for different devices e.g. scsi, net. starting with generic information then comes protocol specific data. Ilya agrees. length of descriptor would be flexibla by adding a descriptor length field. How to continue / TODOs: - do benchmarking with bigger frame sizes on fast enough NICs - turn prototype code into a RFC series (work in progress) - more people interested to join monthly meetings Open questions: - Do we need an (optional) completion ring? - Is there a situation where 4 descriptors in a cache line is a problem because we access the same cache line, causing cache ping-pong? - Interrupt suppression requires device to do a memory read after writing out descriptors? Will that be too costly? Let driver write out index? regards Jens
On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote:> Ilya: - you might have more completions than descriptors available > - partial descriptor chains are a problem for hardware because you might have > to read a bunch of conscriptors twice - how would you do deal with a big > buffer that cointains a large number of > small packets with respect to completions? > - is one bit for completion enough? right now it means descriptor was actually > used. how to we signal when it was completed?I am not sure I understand the difference. Under virtio, driver makes a descriptor available, then device reads/writes memory depending on descriptor type, then marks it as used. What does completed mean?> - concerned about not being able to do scatter/gatter with the ring layout. > Network drivers heavily using indirect buffers. - for a hardware > implementation a completion ring is a very convenient form for > some use cases, so we want an efficient implementation for them. If we had an > inline descriptor then a completion ring is just a normal ring and we won't > need another ring type. > - doesn't like the fact that we need to do a linear scan to find the length of > a descriptor chain. It would be nice if we could have the length of the chain > in the first descriptor (i.e. the number of chained descriptors, not the number > of posted descriptors which can be deduced from the id field)Not responding to rest of points since I don't understand the basic assumption above yet. -- MST
On Wednesday, November 01, 2017 4:59 PM, Michael S. Tsirkin wrote:> On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote: > > Ilya: - you might have more completions than descriptors available > > - partial descriptor chains are a problem for hardware because you > > might have to read a bunch of conscriptors twice - how would you do > > deal with a big buffer that cointains a large number of small packets > > with respect to completions? > > - is one bit for completion enough? right now it means descriptor was > > actually used. how to we signal when it was completed? > > I am not sure I understand the difference. Under virtio, driver makes a > descriptor available, then device reads/writes memory depending on descriptor > type, then marks it as used. > > What does completed mean? >During the BOF, someone raised the point that there is no indication that the HW has Read the descriptor. I think after some discussion we've agreed that it's not a useful indication. My issues with the current completion or used notifications are as follows: 1. There is no room for extra metadata such as checksum or flow tag. You could put that in the descriptor payload but it's somewhat inconvenient. You have to either use and additional descriptor for metadata per chain. Or putting it in one of the buffers and forcing the lifetime of the metadata and data to be the same. 2. Current format assumes 1-1 corresponds between descriptors and completions. You did offer a skipping optimization for many descriptors -> 1 completion. But it is somewhat inefficient. And you didn't offer a solution for 1 descriptor -> multiple completions. Mellanox has a feature called striding RQ where you post a large buffer and The NIC fills it with multiple back to back packets with padding. Each packet generates its own completion. 3. There is a usage model where you have multiple produce rings And a single completion ring. You could implement the completion ring using an additional virtio ring, but The current model will require an extra indirection as it force you to write into The buffers the descriptor in the completion ring point to. Rather than writing the Completion into the ring itself. Additionally the device is still required to write to the original producer ring in addition to the completion ring. I think the best and most flexible design is to have variable size descriptor that start with a dword header. The dword header will include - an ownership bit, an opcode and descriptor length. The opcode and the "length" dwords following the header will be device specific. The owner bit meaning changes on each ring wrap around so the device doesn't Need to update. Each device (or device class) can choose whether completions are reported directly inside the descriptors in that ring or in a separate completion ring. completions rings can be implemented in an efficient manner with this design. The driver will initialize a dedicated completion ring with empty completion sized descriptors. And the device will write the completions directly into the ring.