Steven Luong (sluong)
2018-Dec-13 23:24 UTC
kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
Folks, We came across a memory race condition between VPP vhost driver and the kernel vhost. VPP is running a tap interface over vhost backend. In this case, VPP is acting as the vhost driver mode and the kernel vhost is acting as the vhost device mode. In the kernel vhost?s TX traffic direction which is VPP?s RX traffic direction, kernel vhost is the producer and VPP is the consumer. Kernel vhost places traffic in kernel vhost?s TX vring. VPP removes traffic in VPP?s RX vring. It is inevitable that the vring may become full under heavy traffic and the kernel vhost may have to stop and wait for the guest (VPP) to empty the vring and to refill the vring with descriptors. When that case happens, kernel vhost clears the bit in the vring?s used flag to request an interrupt/notification. Due to shared memory race condition, VPP may miss the clearing of the vring?s used flag from kernel vhost and didn?t send kernel vhost an interrupt after VPP empties and refills the vring with new descriptors. Unfortunately, this missed notification causes the kernel vhost to be stuck because once the kernel vhost is waiting for an interrupt/notification from the guest, only an interrupt/notification from the guest can resume/re-enable the guest from submitting new packets to the vring. This design seems vulnerable. Should the kernel vhost totally depend on the notification from the guest? Quoting the text from http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html /* The device uses this in used->flags to advise the driver: don?t kick me * when you add a buffer. It?s unreliable, so it?s simply an * optimization. */ #define VIRTQ_USED_F_NO_NOTIFY 1 I interpret that the notification is simply an optimization, not a reliable notification mechanism. So the kernel vhost should not bet the farm on it. We encounter the same race condition in kernel vhost?s RX traffic direction which causes the kernel vhost to be stuck due to missed interrupt/notification although it is less frequent. Steven -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20181213/fef548f0/attachment.html>
Michael S. Tsirkin
2018-Dec-17 22:55 UTC
kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote:> Folks, > > > > We came across a memory race condition between VPP vhost driver and the kernel > vhost. VPP is running a tap interface over vhost backend. In this case, VPP is > acting as the vhost driver mode and the kernel vhost is acting as the vhost > device mode. > > > > In the kernel vhost?s TX traffic direction which is VPP?s RX traffic direction, > kernel vhost is the producer and VPP is the consumer.Looking at vhost net kernel code, it seems to use the same terminology as virtio net. Can you pls clarify which ring number do you use 0 or 1? I will assume ring 0 below.> Kernel vhost places > traffic in kernel vhost?s TX vring. VPP removes traffic in VPP?s RX vring.So VPP makes buffers available and vhost kernel uses them?> It > is inevitable that the vring may become full under heavy traffic and the kernel > vhost may have to stop and wait for the guest (VPP) to empty the vring and to > refill the vring with descriptors. When that case happens, kernel vhost clears > the bit in the vring?s used flag to request an interrupt/notification. Due to > shared memory race condition, VPP may miss the clearing of the vring?s used > flag from kernel vhost and didn?t send kernel vhost an interrupt after VPP > empties and refills the vring with new descriptors.So kernel vhost's wakeups are signalled through eventfd - I assume this is what you mean by an interrupt? To prevent the race that you describe, vhost has this code: /* OK, now we need to know about added descriptors. */ if (!headcount) { if (unlikely(busyloop_intr)) { vhost_poll_queue(&vq->poll); } else if (unlikely(vhost_enable_notify(&net->dev, vq))) { /* They have slipped one in as we were * doing that: check again. */ vhost_disable_notify(&net->dev, vq); continue; } /* Nothing new? Wait for eventfd to tell us So ring should be re-checked after notifications are enabled. If ring is no longer empty, vhost will proceed to handle the ring. This code has been around for many years. Thus if VPP doesn't work with it then I suspect it does something differently, e.g. is it possible that it is missing a memory barrier after writing out the available buffers and before checking the flags?> Unfortunately, this missed > notification causes the kernel vhost to be stuck because once the kernel vhost > is waiting for an interrupt/notification from the guest, only an interrupt/ > notification from the guest can resume/re-enable the guest from submitting new > packets to the vring. This design seems vulnerable.The protocol right now relies on notifications never being lost. I can imagine an alternative where device also re-checks the rings periodically, but that would involve running timers host side which is only possible if there's a free processor that can handle them. Further, it will still lead to stalls and packet drops, which will be harder to debug. That's just my $.02, pls feel free to take it with the virtio tc maybe others will feel differently.> Should the kernel vhost > totally depend on the notification from the guest? Quoting the text from > > > > http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html > > > > /* The device uses this in used->flags to advise the driver: don?t kick me > * when you add a buffer. It?s unreliable, so it?s simply an > * optimization. */ > #define VIRTQ_USED_F_NO_NOTIFY 1 > > > > I interpret that the notification is simply an optimization, not a reliable > notification mechanism.What was meant I think is that suppression is unreliable.>So the kernel vhost should not bet the farm on it. > > > > We encounter the same race condition in kernel vhost?s RX traffic direction > which causes the kernel vhost to be stuck due to missed interrupt/notification > although it is less frequent.For the reverse direction the spec actually has some pseudo-code and a suggestion about handling that race: For optimal performance, a driver MAY disable used buffer notifications while processing the used ring, but beware the problem of missing notifications between emptying the ring and reenabling no- tifications. This is usually handled by re-checking for more used buffers after notifications are re- enabled: virtq_disable_used_buffer_notifications(vq); for (;;) { if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) { virtq_enable_used_buffer_notifications(vq); mb(); if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) break; virtq_disable_used_buffer_notifications(vq); } struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz]; process_buffer(e); vq->last_seen_used++;> > > Steven >> _______________________________________________ > Virtualization mailing list > Virtualization at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Steven Luong (sluong)
2018-Dec-17 23:56 UTC
kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
?On 12/17/18, 2:55 PM, "Michael S. Tsirkin" <mst at redhat.com> wrote: On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote: > Folks, > > > > We came across a memory race condition between VPP vhost driver and the kernel > vhost. VPP is running a tap interface over vhost backend. In this case, VPP is > acting as the vhost driver mode and the kernel vhost is acting as the vhost > device mode. > > > > In the kernel vhost?s TX traffic direction which is VPP?s RX traffic direction, > kernel vhost is the producer and VPP is the consumer. Looking at vhost net kernel code, it seems to use the same terminology as virtio net. Can you pls clarify which ring number do you use 0 or 1? <SVL> 0. </SVL> I will assume ring 0 below. > Kernel vhost places > traffic in kernel vhost?s TX vring. VPP removes traffic in VPP?s RX vring. So VPP makes buffers available and vhost kernel uses them? <SVL> Correct. </SVL> > It > is inevitable that the vring may become full under heavy traffic and the kernel > vhost may have to stop and wait for the guest (VPP) to empty the vring and to > refill the vring with descriptors. When that case happens, kernel vhost clears > the bit in the vring?s used flag to request an interrupt/notification. Due to > shared memory race condition, VPP may miss the clearing of the vring?s used > flag from kernel vhost and didn?t send kernel vhost an interrupt after VPP > empties and refills the vring with new descriptors. So kernel vhost's wakeups are signalled through eventfd - I assume this is what you mean by an interrupt? To prevent the race that you describe, vhost has this code: /* OK, now we need to know about added descriptors. */ if (!headcount) { if (unlikely(busyloop_intr)) { vhost_poll_queue(&vq->poll); } else if (unlikely(vhost_enable_notify(&net->dev, vq))) { /* They have slipped one in as we were * doing that: check again. */ vhost_disable_notify(&net->dev, vq); continue; } /* Nothing new? Wait for eventfd to tell us So ring should be re-checked after notifications are enabled. If ring is no longer empty, vhost will proceed to handle the ring. This code has been around for many years. Thus if VPP doesn't work with it then I suspect it does something differently, e.g. is it possible that it is missing a memory barrier after writing out the available buffers and before checking the flags? <SVL> Can the race condition be avoided totally? Here is the scenario. t0: VPP refills the vring with new descriptors, not yet sets avail->idx to make it available to kernel vhost. t1: kernel vhost checks the vring, still sees no space available, but does not yet execute the line of code to clear the used->flags t2: vpp executes sfence, and sets avail->idx to make it available to kernel vhost t3: vpp checks used->flags, it is still 1, vpp skips sending the interrupt t4: kernel vhost clears used->flags to indicate interrupt is required. But VPP just missed it. The result is kernel vhost got stuck even though the vring is filled with new descriptors. Steven </SVL> > Unfortunately, this missed > notification causes the kernel vhost to be stuck because once the kernel vhost > is waiting for an interrupt/notification from the guest, only an interrupt/ > notification from the guest can resume/re-enable the guest from submitting new > packets to the vring. This design seems vulnerable. The protocol right now relies on notifications never being lost. I can imagine an alternative where device also re-checks the rings periodically, but that would involve running timers host side which is only possible if there's a free processor that can handle them. Further, it will still lead to stalls and packet drops, which will be harder to debug. That's just my $.02, pls feel free to take it with the virtio tc maybe others will feel differently. > Should the kernel vhost > totally depend on the notification from the guest? Quoting the text from > > > > http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html > > > > /* The device uses this in used->flags to advise the driver: don?t kick me > * when you add a buffer. It?s unreliable, so it?s simply an > * optimization. */ > #define VIRTQ_USED_F_NO_NOTIFY 1 > > > > I interpret that the notification is simply an optimization, not a reliable > notification mechanism. What was meant I think is that suppression is unreliable. >So the kernel vhost should not bet the farm on it. > > > > We encounter the same race condition in kernel vhost?s RX traffic direction > which causes the kernel vhost to be stuck due to missed interrupt/notification > although it is less frequent. For the reverse direction the spec actually has some pseudo-code and a suggestion about handling that race: For optimal performance, a driver MAY disable used buffer notifications while processing the used ring, but beware the problem of missing notifications between emptying the ring and reenabling no- tifications. This is usually handled by re-checking for more used buffers after notifications are re- enabled: virtq_disable_used_buffer_notifications(vq); for (;;) { if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) { virtq_enable_used_buffer_notifications(vq); mb(); if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) break; virtq_disable_used_buffer_notifications(vq); } struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz]; process_buffer(e); vq->last_seen_used++; > > > Steven > > _______________________________________________ > Virtualization mailing list > Virtualization at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Possibly Parallel Threads
- kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
- kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
- kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
- How to dereference a pointer to a pointer
- memory barriers in virtq.lua?