Jason Wang
2015-Oct-29 08:45 UTC
[PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net
Hi all: This series tries to add basic busy polling for vhost net. The idea is simple: at the end of tx processing, busy polling for new tx added descriptor and rx receive socket for a while. The maximum number of time (in us) could be spent on busy polling was specified through module parameter. Test were done through: - 50 us as busy loop timeout - Netperf 2.6 - Two machines with back to back connected mlx4 - Guest with 8 vcpus and 1 queue Result shows very huge improvement on both tx (at most 158%) and rr (at most 53%) while rx is as much as in the past. Most cases the cpu utilization is also improved: Guest TX: size/session/+thu%/+normalize% 64/ 1/ +17%/ +6% 64/ 4/ +9%/ +17% 64/ 8/ +34%/ +21% 512/ 1/ +48%/ +40% 512/ 4/ +31%/ +20% 512/ 8/ +39%/ +22% 1024/ 1/ +158%/ +99% 1024/ 4/ +20%/ +11% 1024/ 8/ +40%/ +18% 2048/ 1/ +108%/ +74% 2048/ 4/ +21%/ +7% 2048/ 8/ +32%/ +14% 4096/ 1/ +94%/ +77% 4096/ 4/ +7%/ -6% 4096/ 8/ +9%/ -4% 16384/ 1/ +33%/ +9% 16384/ 4/ +10%/ -6% 16384/ 8/ +19%/ +2% 65535/ 1/ +15%/ -6% 65535/ 4/ +8%/ -9% 65535/ 8/ +14%/ 0% Guest RX: size/session/+thu%/+normalize% 64/ 1/ -3%/ -3% 64/ 4/ +4%/ +20% 64/ 8/ -1%/ -1% 512/ 1/ +20%/ +12% 512/ 4/ +1%/ +3% 512/ 8/ 0%/ -5% 1024/ 1/ +9%/ -2% 1024/ 4/ 0%/ +5% 1024/ 8/ +1%/ 0% 2048/ 1/ 0%/ +3% 2048/ 4/ -2%/ +3% 2048/ 8/ -1%/ -3% 4096/ 1/ -8%/ +3% 4096/ 4/ 0%/ +2% 4096/ 8/ 0%/ +5% 16384/ 1/ +3%/ 0% 16384/ 4/ +2%/ +2% 16384/ 8/ 0%/ +13% 65535/ 1/ 0%/ +3% 65535/ 4/ +2%/ -1% 65535/ 8/ +1%/ +14% TCP_RR: size/session/+thu%/+normalize% 1/ 1/ +8%/ -6% 1/ 50/ +18%/ +15% 1/ 100/ +22%/ +19% 1/ 200/ +25%/ +23% 64/ 1/ +2%/ -19% 64/ 50/ +46%/ +39% 64/ 100/ +47%/ +39% 64/ 200/ +50%/ +44% 512/ 1/ 0%/ -28% 512/ 50/ +50%/ +44% 512/ 100/ +53%/ +47% 512/ 200/ +51%/ +58% 1024/ 1/ +3%/ -14% 1024/ 50/ +45%/ +37% 1024/ 100/ +53%/ +49% 1024/ 200/ +48%/ +55% Changes from V1: - Add a comment for vhost_has_work() to explain why it could be lockless - Add param description for busyloop_timeout - Split out the busy polling logic into a new helper - Check and exit the loop when there's a pending signal - Disable preemption during busy looping to make sure lock_clock() was correctly used. Todo: - Make the busyloop timeout could be configure per VM through ioctl. Please review. Thanks Jason Wang (2): vhost: introduce vhost_has_work() vhost_net: basic polling support drivers/vhost/net.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++---- drivers/vhost/vhost.c | 7 +++++++ drivers/vhost/vhost.h | 1 + 3 files changed, 58 insertions(+), 4 deletions(-) -- 1.8.3.1
Jason Wang
2015-Oct-29 08:45 UTC
[PATCH net-next rfc V2 1/2] vhost: introduce vhost_has_work()
This path introduces a helper which can give a hint for whether or not there's a work queued in the work list. Signed-off-by: Jason Wang <jasowang at redhat.com> --- drivers/vhost/vhost.c | 7 +++++++ drivers/vhost/vhost.h | 1 + 2 files changed, 8 insertions(+) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index eec2f11..163b365 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) } EXPORT_SYMBOL_GPL(vhost_work_queue); +/* A lockless hint for busy polling code to exit the loop */ +bool vhost_has_work(struct vhost_dev *dev) +{ + return !list_empty(&dev->work_list); +} +EXPORT_SYMBOL_GPL(vhost_has_work); + void vhost_poll_queue(struct vhost_poll *poll) { vhost_work_queue(poll->dev, &poll->work); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 4772862..ea0327d 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -37,6 +37,7 @@ struct vhost_poll { void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); +bool vhost_has_work(struct vhost_dev *dev); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, unsigned long mask, struct vhost_dev *dev); -- 1.8.3.1
Jason Wang
2015-Oct-29 08:45 UTC
[PATCH net-next rfc V2 2/2] vhost_net: basic polling support
This patch tries to poll for new added tx buffer for a while at the end of tx processing. The maximum time spent on polling were limited through a module parameter. To avoid block rx, the loop will end it there's new other works queued on vhost so in fact socket receive queue is also be polled. Signed-off-by: Jason Wang <jasowang at redhat.com> --- drivers/vhost/net.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 50 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9eda69e..30e6d3d 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -31,9 +31,13 @@ #include "vhost.h" static int experimental_zcopytx = 1; +static int busyloop_timeout = 50; module_param(experimental_zcopytx, int, 0444); MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;" " 1 -Enable; 0 - Disable"); +module_param(busyloop_timeout, int, 0444); +MODULE_PARM_DESC(busyloop_timeout, "Maximum number of time (in us) " + "could be spend on busy polling"); /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ @@ -287,6 +291,49 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) rcu_read_unlock_bh(); } +static inline unsigned long busy_clock(void) +{ + return local_clock() >> 10; +} + +static bool tx_can_busy_poll(struct vhost_dev *dev, + unsigned long endtime) +{ + return likely(!need_resched()) && + likely(!time_after(busy_clock(), endtime)) && + likely(!signal_pending(current)) && + !vhost_has_work(dev) && + single_task_running(); +} + +static int vhost_net_tx_get_vq_desc(struct vhost_virtqueue *vq, + struct iovec iov[], unsigned int iov_size, + unsigned int *out_num, unsigned int *in_num) +{ + unsigned long uninitialized_var(endtime); + int head; + + if (busyloop_timeout) { + preempt_disable(); + endtime = busy_clock() + busyloop_timeout; + } + +again: + head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), + out_num, in_num, NULL, NULL); + + if (head == vq->num && busyloop_timeout && + tx_can_busy_poll(vq->dev, endtime)) { + cpu_relax(); + goto again; + } + + if (busyloop_timeout) + preempt_enable(); + + return head; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -331,10 +378,9 @@ static void handle_tx(struct vhost_net *net) % UIO_MAXIOV == nvq->done_idx)) break; - head = vhost_get_vq_desc(vq, vq->iov, - ARRAY_SIZE(vq->iov), - &out, &in, - NULL, NULL); + head = vhost_net_tx_get_vq_desc(vq, vq->iov, + ARRAY_SIZE(vq->iov), + &out, &in); /* On error, stop handling until the next kick. */ if (unlikely(head < 0)) break; -- 1.8.3.1
Jason Wang
2015-Oct-30 11:58 UTC
[PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net
On 10/29/2015 04:45 PM, Jason Wang wrote:> Hi all: > > This series tries to add basic busy polling for vhost net. The idea is > simple: at the end of tx processing, busy polling for new tx added > descriptor and rx receive socket for a while. The maximum number of > time (in us) could be spent on busy polling was specified through > module parameter. > > Test were done through: > > - 50 us as busy loop timeout > - Netperf 2.6 > - Two machines with back to back connected mlx4 > - Guest with 8 vcpus and 1 queue > > Result shows very huge improvement on both tx (at most 158%) and rr > (at most 53%) while rx is as much as in the past. Most cases the cpu > utilization is also improved: >Just notice there's something wrong in the setup. So the numbers are incorrect here. Will re-run and post correct number here. Sorry.
Jason Wang
2015-Nov-03 07:46 UTC
[PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net
On 10/30/2015 07:58 PM, Jason Wang wrote:> > On 10/29/2015 04:45 PM, Jason Wang wrote: >> Hi all: >> >> This series tries to add basic busy polling for vhost net. The idea is >> simple: at the end of tx processing, busy polling for new tx added >> descriptor and rx receive socket for a while. The maximum number of >> time (in us) could be spent on busy polling was specified through >> module parameter. >> >> Test were done through: >> >> - 50 us as busy loop timeout >> - Netperf 2.6 >> - Two machines with back to back connected mlx4 >> - Guest with 8 vcpus and 1 queue >> >> Result shows very huge improvement on both tx (at most 158%) and rr >> (at most 53%) while rx is as much as in the past. Most cases the cpu >> utilization is also improved: >> > Just notice there's something wrong in the setup. So the numbers are > incorrect here. Will re-run and post correct number here. > > Sorry.Here's the updated testing result: 1) 1 vcpu 1 queue: TCP_RR size/session/+thu%/+normalize% 1/ 1/ 0%/ -25% 1/ 50/ +12%/ 0% 1/ 100/ +12%/ +1% 1/ 200/ +9%/ -1% 64/ 1/ +3%/ -21% 64/ 50/ +8%/ 0% 64/ 100/ +7%/ 0% 64/ 200/ +9%/ 0% 256/ 1/ +1%/ -25% 256/ 50/ +7%/ -2% 256/ 100/ +6%/ -2% 256/ 200/ +4%/ -2% 512/ 1/ +2%/ -19% 512/ 50/ +5%/ -2% 512/ 100/ +3%/ -3% 512/ 200/ +6%/ -2% 1024/ 1/ +2%/ -20% 1024/ 50/ +3%/ -3% 1024/ 100/ +5%/ -3% 1024/ 200/ +4%/ -2% Guest RX size/session/+thu%/+normalize% 64/ 1/ -4%/ -5% 64/ 4/ -3%/ -10% 64/ 8/ -3%/ -5% 512/ 1/ +15%/ +1% 512/ 4/ -5%/ -5% 512/ 8/ -2%/ -4% 1024/ 1/ -5%/ -16% 1024/ 4/ -2%/ -5% 1024/ 8/ -6%/ -6% 2048/ 1/ +10%/ +5% 2048/ 4/ -8%/ -4% 2048/ 8/ -1%/ -4% 4096/ 1/ -9%/ -11% 4096/ 4/ +1%/ -1% 4096/ 8/ +1%/ 0% 16384/ 1/ +20%/ +11% 16384/ 4/ 0%/ -3% 16384/ 8/ +1%/ 0% 65535/ 1/ +36%/ +13% 65535/ 4/ -10%/ -9% 65535/ 8/ -3%/ -2% Guest TX size/session/+thu%/+normalize% 64/ 1/ -7%/ -16% 64/ 4/ -14%/ -23% 64/ 8/ -9%/ -20% 512/ 1/ -62%/ -56% 512/ 4/ -62%/ -56% 512/ 8/ -61%/ -53% 1024/ 1/ -66%/ -61% 1024/ 4/ -77%/ -73% 1024/ 8/ -73%/ -67% 2048/ 1/ -74%/ -75% 2048/ 4/ -77%/ -74% 2048/ 8/ -72%/ -68% 4096/ 1/ -65%/ -68% 4096/ 4/ -66%/ -63% 4096/ 8/ -62%/ -57% 16384/ 1/ -25%/ -28% 16384/ 4/ -28%/ -17% 16384/ 8/ -24%/ -10% 65535/ 1/ -17%/ -14% 65535/ 4/ -22%/ -5% 65535/ 8/ -25%/ -9% - obvious improvement on TCP_RR (at most 12%) - improvement on guest RX - huge decreasing on Guest TX (at most -75%), this is probably because virtio-net driver suffers from buffer bloat by orphaning skb before transmission. The faster vhost it is, the smaller packet it could produced. To reduce the impact on this, turning off gso in guest can result the following result: size/session/+thu%/+normalize% 64/ 1/ +3%/ -11% 64/ 4/ +4%/ -10% 64/ 8/ +4%/ -10% 512/ 1/ +2%/ +5% 512/ 4/ 0%/ -1% 512/ 8/ 0%/ 0% 1024/ 1/ +11%/ 0% 1024/ 4/ 0%/ -1% 1024/ 8/ +3%/ +1% 2048/ 1/ +4%/ -1% 2048/ 4/ +8%/ +3% 2048/ 8/ 0%/ -1% 4096/ 1/ +4%/ -1% 4096/ 4/ +1%/ 0% 4096/ 8/ +2%/ 0% 16384/ 1/ +2%/ -2% 16384/ 4/ +3%/ +1% 16384/ 8/ 0%/ -1% 65535/ 1/ +9%/ +7% 65535/ 4/ 0%/ -3% 65535/ 8/ -1%/ -1% 2) 8 vcpus 1 queue: TCP_RR size/session/+thu%/+normalize% 1/ 1/ +5%/ -14% 1/ 50/ +2%/ +1% 1/ 100/ 0%/ -1% 1/ 200/ 0%/ 0% 64/ 1/ 0%/ -25% 64/ 50/ +5%/ +5% 64/ 100/ 0%/ 0% 64/ 200/ 0%/ -1% 256/ 1/ 0%/ -30% 256/ 50/ 0%/ 0% 256/ 100/ -2%/ -2% 256/ 200/ 0%/ 0% 512/ 1/ +1%/ -23% 512/ 50/ +1%/ +1% 512/ 100/ +1%/ 0% 512/ 200/ +1%/ +1% 1024/ 1/ +1%/ -23% 1024/ 50/ +5%/ +5% 1024/ 100/ 0%/ -1% 1024/ 200/ 0%/ 0% Guest RX size/session/+thu%/+normalize% 64/ 1/ +1%/ +1% 64/ 4/ -2%/ +1% 64/ 8/ +6%/ +19% 512/ 1/ +5%/ -7% 512/ 4/ -4%/ -4% 512/ 8/ 0%/ 0% 1024/ 1/ +1%/ +2% 1024/ 4/ -2%/ -2% 1024/ 8/ -1%/ +7% 2048/ 1/ +8%/ -2% 2048/ 4/ 0%/ +5% 2048/ 8/ -1%/ +13% 4096/ 1/ -1%/ +2% 4096/ 4/ 0%/ +6% 4096/ 8/ -2%/ +15% 16384/ 1/ -1%/ 0% 16384/ 4/ -2%/ -1% 16384/ 8/ -2%/ +2% 65535/ 1/ -2%/ 0% 65535/ 4/ -3%/ -3% 65535/ 8/ -2%/ +2% Guest TX size/session/+thu%/+normalize% 64/ 1/ +6%/ +3% 64/ 4/ +11%/ +8% 64/ 8/ 0%/ 0% 512/ 1/ +19%/ +18% 512/ 4/ -4%/ +1% 512/ 8/ -1%/ -1% 1024/ 1/ 0%/ +8% 1024/ 4/ -1%/ -1% 1024/ 8/ 0%/ +1% 2048/ 1/ +1%/ 0% 2048/ 4/ -1%/ -2% 2048/ 8/ 0%/ 0% 4096/ 1/ +12%/ +14% 4096/ 4/ 0%/ -1% 4096/ 8/ -2%/ -1% 16384/ 1/ +9%/ +6% 16384/ 4/ +3%/ -1% 16384/ 8/ +2%/ -1% 65535/ 1/ +1%/ -2% 65535/ 4/ 0%/ -4% 65535/ 8/ 0%/ -2% - latency get improved a little bit - small improvement on single session rx - no other obvious changes - this may because 8 vcpu could give enough stress on a single vhost thread. Then the busy polling was not trigged enough (unless on light load case e.g 1 session TCP_RR). 3) 8 vcpus 8 queues 8 vcpu 8 queue TCP_RR size/session/+thu%/+normalize% 1/ 1/ +6%/ -16% 1/ 50/ +14%/ +1% 1/ 100/ +17%/ +3% 1/ 200/ +16%/ +2% 64/ 1/ +2%/ -19% 64/ 50/ +10%/ 0% 64/ 100/ +17%/ +5% 64/ 200/ +15%/ +3% 256/ 1/ 0%/ -19% 256/ 50/ +5%/ -3% 256/ 100/ +4%/ -3% 256/ 200/ +2%/ -4% 512/ 1/ +4%/ -19% 512/ 50/ +7%/ -2% 512/ 100/ +4%/ -4% 512/ 200/ +3%/ -4% 1024/ 1/ +9%/ -19% 1024/ 50/ +6%/ -2% 1024/ 100/ +5%/ -3% 1024/ 200/ +5%/ -3% Guest RX size/session/+thu%/+normalize% 64/ 1/ +18%/ +13% 64/ 4/ 0%/ -1% 64/ 8/ -4%/ -11% 512/ 1/ +3%/ -6% 512/ 4/ +1%/ -11% 512/ 8/ -1%/ -7% 1024/ 1/ 0%/ -9% 1024/ 4/ +9%/ -16% 1024/ 8/ -1%/ -11% 2048/ 1/ 0%/ -2% 2048/ 4/ 0%/ -16% 2048/ 8/ -1%/ -2% 4096/ 1/ +3%/ 0% 4096/ 4/ -1%/ -12% 4096/ 8/ 0%/ -5% 16384/ 1/ -2%/ -6% 16384/ 4/ 0%/ -6% 16384/ 8/ 0%/ -6% 65535/ 1/ 0%/ 0% 65535/ 4/ 0%/ -9% 65535/ 8/ 0%/ +1% Guest TX size/session/+thu%/+normalize% 64/ 1/ +7%/ +3% 64/ 4/ +6%/ 0% 64/ 8/ +10%/ +5% 512/ 1/ 0%/ +14% 512/ 4/ +9%/ -1% 512/ 8/ +14%/ +4% 1024/ 1/ +44%/ +37% 1024/ 4/ +6%/ +2% 1024/ 8/ +19%/ +12% 2048/ 1/ -14%/ -16% 2048/ 4/ +11%/ +8% 2048/ 8/ +26%/ +28% 4096/ 1/ +21%/ +19% 4096/ 4/ +2%/ +10% 4096/ 8/ +14%/ +7% 16384/ 1/ +12%/ +4% 16384/ 4/ +7%/ +2% 16384/ 8/ +2%/ +9% 65535/ 1/ -3%/ -5% 65535/ 4/ +9%/ +5% 65535/ 8/ 0%/ -8% - TCP_RR get obviously improved (at most 17%) - obvious improvement on Guest TX (at most 44%)
Possibly Parallel Threads
- [PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net
- [PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net
- [PATCH V4 0/3] basic busy polling support for vhost_net
- [PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net
- [PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net