Stefano Garzarella
2023-May-03 12:13 UTC
[PATCH RFC net-next v2 0/4] virtio/vsock: support datagrams
On Sat, Apr 15, 2023 at 03:55:05PM +0000, Bobby Eshleman wrote:>On Fri, Apr 28, 2023 at 12:43:09PM +0200, Stefano Garzarella wrote: >> On Sat, Apr 15, 2023 at 07:13:47AM +0000, Bobby Eshleman wrote: >> > CC'ing virtio-dev at lists.oasis-open.org because this thread is starting >> > to touch the spec. >> > >> > On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote: >> > > Hi Bobby, >> > > >> > > On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote: >> > > > CC'ing Cong. >> > > > >> > > > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote: >> > > > > Hey all! >> > > > > >> > > > > This series introduces support for datagrams to virtio/vsock. >> > > >> > > Great! Thanks for restarting this work! >> > > >> > >> > No problem! >> > >> > > > > >> > > > > It is a spin-off (and smaller version) of this series from the summer: >> > > > > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman at bytedance.com/ >> > > > > >> > > > > Please note that this is an RFC and should not be merged until >> > > > > associated changes are made to the virtio specification, which will >> > > > > follow after discussion from this series. >> > > > > >> > > > > This series first supports datagrams in a basic form for virtio, and >> > > > > then optimizes the sendpath for all transports. >> > > > > >> > > > > The result is a very fast datagram communication protocol that >> > > > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety >> > > > > of multi-threaded workload samples. >> > > > > >> > > > > For those that are curious, some summary data comparing UDP and VSOCK >> > > > > DGRAM (N=5): >> > > > > >> > > > > vCPUS: 16 >> > > > > virtio-net queues: 16 >> > > > > payload size: 4KB >> > > > > Setup: bare metal + vm (non-nested) >> > > > > >> > > > > UDP: 287.59 MB/s >> > > > > VSOCK DGRAM: 509.2 MB/s >> > > > > >> > > > > Some notes about the implementation... >> > > > > >> > > > > This datagram implementation forces datagrams to self-throttle according >> > > > > to the threshold set by sk_sndbuf. It behaves similar to the credits >> > > > > used by streams in its effect on throughput and memory consumption, but >> > > > > it is not influenced by the receiving socket as credits are. >> > > >> > > So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right? >> > > >> > >> > Correct. >> > >> > > We should check if VMCI behaves the same. >> > > >> > > > > >> > > > > The device drops packets silently. There is room for improvement by >> > > > > building into the device and driver some intelligence around how to >> > > > > reduce frequency of kicking the virtqueue when packet loss is high. I >> > > > > think there is a good discussion to be had on this. >> > > >> > > Can you elaborate a bit here? >> > > >> > > Do you mean some mechanism to report to the sender that a destination >> > > (cid, port) is full so the packet will be dropped? >> > > >> > >> > Correct. There is also the case of there being no receiver at all for >> > this address since this case isn't rejected upon connect(). Ideally, >> > such a socket (which will have 100% packet loss) will be throttled >> > aggressively. >> > >> > Before we go down too far on this path, I also want to clarify that >> > using UDP over vhost/virtio-net also has this property... this can be >> > observed by using tcpdump to dump the UDP packets on the bridge network >> > your VM is using. UDP packets sent to a garbage address can be seen on >> > the host bridge (this is the nature of UDP, how is the host supposed to >> > know the address eventually goes nowhere). I mention the above because I >> > think it is possible for vsock to avoid this cost, given that it >> > benefits from being point-to-point and g2h/h2g. >> > >> > If we're okay with vsock being on par, then the current series does >> > that. I propose something below that can be added later and maybe >> > negotiated as a feature bit too. >> >> I see and I agree on that, let's do it step by step. >> If we can do it in the first phase is great, but I think is fine to add >> this feature later. >> >> > >> > > Can we adapt the credit mechanism? >> > > >> > >> > I've thought about this a lot because the attraction of the approach for >> > me would be that we could get the wait/buffer-limiting logic for free >> > and without big changes to the protocol, but the problem is that the >> > unreliable nature of datagrams means that the source's free-running >> > tx_cnt will become out-of-sync with the destination's fwd_cnt upon >> > packet loss. >> >> We need to understand where the packet can be lost. >> If the packet always reaches the destination (vsock driver or device), >> we can discard it, but also update the counters. >> >> > >> > Imagine a source that initializes and starts sending packets before a >> > destination socket even is created, the source's self-throttling will be >> > dysfunctional because its tx_cnt will always far exceed the >> > destination's fwd_cnt. >> >> Right, the other problem I see is that the socket aren't connected, so >> we have 1-N relationship. >> > >Oh yeah, good point. > >> > >> > We could play tricks with the meaning of the CREDIT_UPDATE message and >> > fwd_cnt/buf_alloc fields, but I don't think we want to go down that >> > path. >> > >> > I think that the best and simplest approach introduces a congestion >> > notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the >> > destination sends this notification. At a given repeated time period T, >> > the source can check if it has received any notifications in the last T. >> > If so, it halves its buffer allocation. If not, it doubles its buffer >> > allocation unless it is already at its max or original value. >> > >> > An "invalid" socket which never has any receiver will converge towards a >> > rate limit of one packet per time T * log2(average pkt size). That is, a >> > socket with 100% packet loss will only be able to send 16 bytes every >> > 4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within >> > 160ms given at least one packet sent per 5ms. I have no idea if that is >> > a reasonable default T for vsock, I just pulled it out of a hat for the >> > sake of the example. >> > >> > "Normal" sockets will be responsive to high loss and rebalance during >> > low loss. The source is trying to guess and converge on the actual >> > buffer state of the destination. >> > >> > This would reuse the already-existing throttling mechanisms that >> > throttle based upon buffer allocation. The usage of sk_sndbuf would have >> > to be re-worked. The application using sendmsg() will see EAGAIN when >> > throttled, or just sleep if !MSG_DONTWAIT. >> >> I see, it looks interesting, but I think we need to share that >> information between multiple sockets, since the same destination >> (cid, port), can be reached by multiple sockets. >> > >Good point, that is true. > >> Another approach could be to have both congestion notification and >> decongestion, but maybe it produces double traffic. >> > >I think this could simplify things and could reduce noise. It is also >probably sufficient for the source to simply halt upon congestion >notification and resume upon decongestion notification, instead of >scaling up and down like I suggested above. It also avoids the >burstiness that would occur with a "congestion notification"-only >approach where the source guesses when to resume and guesses wrong. > >The congestion notification may want to have an expiration period after >which the sender can resume without receiving a decongestion >notification? If it receives congestion again, then it can halt again.Yep, I agree. Anyway the congestion/decongestion messages should be just a hint, because the other peer has to keep the state and a malicious host/guest could use it for DoS, so the peer could discard these packets if it has no more space to save the state. Thanks, Stefano