Michael S. Tsirkin
2014-Feb-12  16:36 UTC
[PATCH net 2/3] vhost: fix ref cnt checking deadlock
vhost checked the counter within the refcnt before decrementing.  It
really wanted to know that there aren't too many references, as a way to
batch freeing resources a bit more efficiently.
This works well but it we now access the
ref counter twice so there's a race:
all users might see a high count and decide
to defer freeing resources.
In the end no one initiates freeing resources
until the last reference is gone (which is on VM shotdown
so might happen after a looooong time).
Let's do what we should have done straight away:
add a kref API to return the kref value atomically,
and use that to avoid the deadlock.
Reported-by: Qin Chuanyu <qinchuanyu at huawei.com>
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 831eb4f..7eaf2de 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -140,9 +140,9 @@ vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
 	return ubufs;
 }
 
-static void vhost_net_ubuf_put(struct vhost_net_ubuf_ref *ubufs)
+static int vhost_net_ubuf_put(struct vhost_net_ubuf_ref *ubufs)
 {
-	kref_put(&ubufs->kref, vhost_net_zerocopy_done_signal);
+	return kref_sub_return(&ubufs->kref, 1,
vhost_net_zerocopy_done_signal);
 }
 
 static void vhost_net_ubuf_put_and_wait(struct vhost_net_ubuf_ref *ubufs)
@@ -306,22 +306,21 @@ static void vhost_zerocopy_callback(struct ubuf_info
*ubuf, bool success)
 {
 	struct vhost_net_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
-	int cnt = atomic_read(&ubufs->kref.refcount);
+	int cnt;
 
 	/* set len to mark this desc buffers done DMA */
 	vq->heads[ubuf->desc].len = success ?
 		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
-	vhost_net_ubuf_put(ubufs);
+	cnt = vhost_net_ubuf_put(ubufs);
 
 	/*
 	 * Trigger polling thread if guest stopped submitting new buffers:
-	 * in this case, the refcount after decrement will eventually reach 1
-	 * so here it is 2.
+	 * in this case, the refcount after decrement will eventually reach 1.
 	 * We also trigger polling periodically after each 16 packets
 	 * (the value 16 here is more or less arbitrary, it's tuned to trigger
 	 * less than 10% of times).
 	 */
-	if (cnt <= 2 || !(cnt % 16))
+	if (cnt <= 1 || !(cnt % 16))
 		vhost_poll_queue(&vq->poll);
 }
 
-- 
MST
This fixes a deadlock with vhost reported in the field, as well as a theoretical race issue found by code review. Patches 1+2 are needed for stable. Thanks to Qin Chuanyu for reporting the issue! Michael S. Tsirkin (3): kref: add kref_sub_return vhost: fix ref cnt checking deadlock vhost: fix a theoretical race in device cleanup include/linux/kref.h | 33 ++++++++++++++++++++++++++++++++- drivers/vhost/net.c | 15 ++++++++++----- 2 files changed, 42 insertions(+), 6 deletions(-) -- MST
Michael S. Tsirkin
2014-Feb-12  16:38 UTC
[PATCH net 3/3] vhost: fix a theoretical race in device cleanup
vhost_zerocopy_callback accesses VQ right after it drops the last ubuf reference. In theory, this could race with device removal which waits on the ubuf kref, and crash on use after free. Do all accesses within rcu read side critical section, and all synchronize on release. Since callbacks are always invoked from bh, synchronize_rcu_bh seems enough and will help release complete a bit faster. Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- drivers/vhost/net.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7eaf2de..78a9d42 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -308,6 +308,8 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) struct vhost_virtqueue *vq = ubufs->vq; int cnt; + rcu_read_lock_bh(); + /* set len to mark this desc buffers done DMA */ vq->heads[ubuf->desc].len = success ? VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN; @@ -322,6 +324,8 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) */ if (cnt <= 1 || !(cnt % 16)) vhost_poll_queue(&vq->poll); + + rcu_read_unlock_bh(); } /* Expects to be always run from workqueue - which acts as @@ -804,6 +808,8 @@ static int vhost_net_release(struct inode *inode, struct file *f) fput(tx_sock->file); if (rx_sock) fput(rx_sock->file); + /* Make sure no callbacks are outstanding */ + synchronize_rcu_bh(); /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_net_flush(n); -- MST
It is sometimes useful to get the value of the reference count after
decrement.
For example, vhost wants to execute some periodic cleanup operations
once number of references drops below a specific value, before it
reaches zero (for efficiency).
Add an API to do this atomically and efficiently using
atomic_sub_return.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
Greg, could you ack this API extension please?
I think it is cleanest to merge this through -net together
with the first user.
 include/linux/kref.h | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/linux/kref.h b/include/linux/kref.h
index 484604d..cb20550 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -61,7 +61,7 @@ static inline void kref_get(struct kref *kref)
  *
  * Subtract @count from the refcount, and if 0, call release().
  * Return 1 if the object was removed, otherwise return 0.  Beware, if this
- * function returns 0, you still can not count on the kref from remaining in
+ * function returns 0, you still can not count on the kref remaining in
  * memory.  Only use the return value if you want to see if the kref is now
  * gone, not present.
  */
@@ -78,6 +78,38 @@ static inline int kref_sub(struct kref *kref, unsigned int
count,
 }
 
 /**
+ * kref_sub_return - subtract a number of refcounts for object.
+ * @kref: object.
+ * @count: Number of recounts to subtract.
+ * @release: pointer to the function that will clean up the object when the
+ *	     last reference to the object is released.
+ *	     This pointer is required, and it is not acceptable to pass kfree
+ *	     in as this function.  If the caller does pass kfree to this
+ *	     function, you will be publicly mocked mercilessly by the kref
+ *	     maintainer, and anyone else who happens to notice it.  You have
+ *	     been warned.
+ *
+ * Subtract @count from the refcount, and if 0, call release().
+ * Return the new refcount.  Beware, if this function returns > N, you still
+ * can not count on there being at least N other references, and in
+ * particular, on the kref remaining in memory.
+ * Only use the return value if you want to see if there are at most,
+ * not at least, N other references to kref,
+ */
+static inline int kref_sub_return(struct kref *kref, unsigned int count,
+				  void (*release)(struct kref *kref))
+{
+	int r;
+
+	WARN_ON(release == NULL);
+
+	r = atomic_sub_return((int) count, &kref->refcount);
+	if (!r)
+		release(kref);
+	return r;
+}
+
+/**
  * kref_put - decrement refcount for object.
  * @kref: object.
  * @release: pointer to the function that will clean up the object when the
-- 
MST
On Wed, Feb 12, 2014 at 06:38:21PM +0200, Michael S. Tsirkin wrote:> It is sometimes useful to get the value of the reference count after > decrement. > For example, vhost wants to execute some periodic cleanup operations > once number of references drops below a specific value, before it > reaches zero (for efficiency).You should never care about what the value of the kref is, if you are using it correctly :) So I really don't want to add this function, as I'm sure people will use it incorrectly. You should only care if the reference drops to 0, if not, then your usage doesn't really fit into the "kref" model, and so, just use an atomic variable. I really want to know why it matters for "efficiency" that you know this number. How does that help anything, as the number could then go up later on, and the work you did at a "lower" number is obsolete, right? thanks, greg k-h
Reasonably Related Threads
- [PATCH net 0/3] vhost fixes for 3.14, -stable
- [PATCH net v2] vhost: fix ref cnt checking deadlock
- [PATCH net v2] vhost: fix ref cnt checking deadlock
- [PATCH net v2] vhost: fix a theoretical race in device cleanup
- [PATCH net v2] vhost: fix a theoretical race in device cleanup