Jason Wang
2013-Nov-13 07:10 UTC
[PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
On 11/13/2013 06:21 AM, Michael Dalton wrote:> Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag > allocators") changed the mergeable receive buffer size from PAGE_SIZE to > MTU-size, introducing a single-stream regression for benchmarks with large > average packet size. There is no single optimal buffer size for all workloads. > For workloads with packet size <= MTU bytes, MTU + virtio-net header-sized > buffers are preferred as larger buffers reduce the TCP window due to SKB > truesize. However, single-stream workloads with large average packet sizes > have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. > > This commit auto-tunes the mergeable receiver buffer packet size by choosing > the packet buffer size based on an EWMA of the recent packet sizes for the > receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header > len to PAGE_SIZE. This improves throughput for large packet workloads, as > any workload with average packet size >= PAGE_SIZE will use PAGE_SIZE > buffers.Hi Michael: There's one concern with EWMA. How well does it handle multiple streams each with different packet size? E.g there may be two flows, one with 256 bytes each packet another is 64K. Looks like it can result we allocate PAGE_SIZE buffer for 256 (which is bad since the payload/truesize is low) bytes or 1500+ for 64K buffer (which is ok since we can do coalescing).> > These optimizations interact positively with recent commit > ba275241030c ("virtio-net: coalesce rx frags when possible during rx"), > which coalesces adjacent RX SKB fragments in virtio_net. The coalescing > optimizations benefit buffers of any size. > > Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs > between two QEMU VMs on a single physical machine. Each VM has two VCPUs > with all offloads & vhost enabled. All VMs and vhost threads run in a > single 4 CPU cgroup cpuset, using cgroups to ensure that other processes > in the system will not be scheduled on the benchmark CPUs. Trunk includes > SKB rx frag coalescing. > > net-next trunk w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s > net-next trunk (MTU-size bufs): 13170.01Gb/s > net-next trunk + auto-tune: 14555.94Gb/sDo you have perf numbers that just without this patch? We need to know how much EWMA help exactly.> > Signed-off-by: Michael Dalton <mwdalton at google.com> > --- > drivers/net/virtio_net.c | 73 +++++++++++++++++++++++++++++++++++------------- > 1 file changed, 53 insertions(+), 20 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 0c93054..b1086e0 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -27,6 +27,7 @@ > #include <linux/if_vlan.h> > #include <linux/slab.h> > #include <linux/cpu.h> > +#include <linux/average.h> > > static int napi_weight = NAPI_POLL_WEIGHT; > module_param(napi_weight, int, 0444); > @@ -37,10 +38,8 @@ module_param(gso, bool, 0444); > > /* FIXME: MTU in config. */ > #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) > -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ > - sizeof(struct virtio_net_hdr_mrg_rxbuf), \ > - L1_CACHE_BYTES)) > #define GOOD_COPY_LEN 128 > +#define RECEIVE_AVG_WEIGHT 64Maybe we can make this as a module parameter.> > #define VIRTNET_DRIVER_VERSION "1.0.0" > > @@ -79,6 +78,9 @@ struct receive_queue { > /* Chain pages by the private ptr. */ > struct page *pages; > > + /* Average packet length for mergeable receive buffers. */ > + struct ewma mrg_avg_pkt_len; > + > /* Page frag for GFP_ATOMIC packet buffer allocation. */ > struct page_frag atomic_frag; > > @@ -302,14 +304,17 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, > return skb; > } > > -static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) > +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb, > + struct page *head_page) > { > struct skb_vnet_hdr *hdr = skb_vnet_hdr(head_skb); > struct sk_buff *curr_skb = head_skb; > + struct page *page = head_page; > char *buf; > - struct page *page; > - int num_buf, len, offset, truesize; > + int num_buf, len, offset; > + u32 est_buffer_len; > > + len = head_skb->len; > num_buf = hdr->mhdr.num_buffers; > while (--num_buf) { > int num_skb_frags = skb_shinfo(curr_skb)->nr_frags; > @@ -320,7 +325,6 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) > head_skb->dev->stats.rx_length_errors++; > return -EINVAL; > } > - truesize = max_t(int, len, MERGE_BUFFER_LEN); > if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { > struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); > if (unlikely(!nskb)) { > @@ -338,20 +342,38 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) > if (curr_skb != head_skb) { > head_skb->data_len += len; > head_skb->len += len; > - head_skb->truesize += truesize; > + head_skb->truesize += len; > } > page = virt_to_head_page(buf); > offset = buf - (char *)page_address(page); > if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { > put_page(page); > skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, > - len, truesize); > + len, len); > } else { > skb_add_rx_frag(curr_skb, num_skb_frags, page, > - offset, len, truesize); > + offset, len, len); > } > --rq->num; > } > + /* All frags before the last frag are fully used -- for those frags, > + * truesize = len. Use the size of the most recent buffer allocation > + * from the last frag's page to estimate the truesize of the last frag. > + * EWMA with a weight of 64 makes the size adjustments quite small in > + * the frags allocated on one page (even a order-3 one), and truesize > + * doesn't need to be 100% accurate. > + */ > + if (page) { > + est_buffer_len = page_private(page); > + if (est_buffer_len > len) { > + u32 truesize_delta = est_buffer_len - len; > + > + curr_skb->truesize += truesize_delta; > + if (curr_skb != head_skb) > + head_skb->truesize += truesize_delta; > + }Is there a chance that est_buffer_len was smaller than or equal with len?> + } > + ewma_add(&rq->mrg_avg_pkt_len, head_skb->len); > return 0; > } > > @@ -382,16 +404,21 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) > skb_trim(skb, len); > } else if (vi->mergeable_rx_bufs) { > struct page *page = virt_to_head_page(buf); > - int truesize = max_t(int, len, MERGE_BUFFER_LEN); > + /* Use an initial truesize of 'len' bytes for page_to_skb -- > + * receive_mergeable will fixup the truesize of the last page > + * frag if the packet is non-linear (> GOOD_COPY_LEN bytes). > + */ > skb = page_to_skb(rq, page, > (char *)buf - (char *)page_address(page), > - len, truesize); > + len, len); > if (unlikely(!skb)) { > dev->stats.rx_dropped++; > put_page(page); > return; > } > - if (receive_mergeable(rq, skb)) { > + if (!skb_is_nonlinear(skb)) > + page = NULL; > + if (receive_mergeable(rq, skb, page)) { > dev_kfree_skb(skb); > return; > } > @@ -540,24 +567,29 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) > static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) > { > struct virtnet_info *vi = rq->vq->vdev->priv; > + const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); > struct page_frag *alloc_frag; > char *buf; > - int err, len, hole; > + int err, hole; > + u32 buflen; > > + buflen = hdr_len + clamp_t(u32, ewma_read(&rq->mrg_avg_pkt_len), > + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); > + buflen = ALIGN(buflen, L1_CACHE_BYTES); > alloc_frag = (gfp & __GFP_WAIT) ? &vi->sleep_frag : &rq->atomic_frag; > - if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp))) > + if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, gfp))) > return -ENOMEM; > buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; > get_page(alloc_frag->page); > - len = MERGE_BUFFER_LEN; > - alloc_frag->offset += len; > + alloc_frag->offset += buflen; > + set_page_private(alloc_frag->page, buflen);Not sure this is accurate, since buflen may change and several frags may share a single page. So the est_buffer_len we get in receive_mergeable() may not be the correct value.> hole = alloc_frag->size - alloc_frag->offset; > - if (hole < MERGE_BUFFER_LEN) { > - len += hole; > + if (hole < buflen) { > + buflen += hole; > alloc_frag->offset += hole; > } > > - sg_init_one(rq->sg, buf, len); > + sg_init_one(rq->sg, buf, buflen); > err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp); > if (err < 0) > put_page(virt_to_head_page(buf)); > @@ -1475,6 +1507,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi) > napi_weight); > > sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg)); > + ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT); > sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg)); > } >
Eric Dumazet
2013-Nov-13 07:40 UTC
[PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
On Wed, 2013-11-13 at 15:10 +0800, Jason Wang wrote:> There's one concern with EWMA. How well does it handle multiple streams > each with different packet size? E.g there may be two flows, one with > 256 bytes each packet another is 64K. Looks like it can result we > allocate PAGE_SIZE buffer for 256 (which is bad since the > payload/truesize is low) bytes or 1500+ for 64K buffer (which is ok > since we can do coalescing).It's hard to predict the future ;) 256 bytes frames consume 2.5 KB anyway on a traditional NIC. If it was a concern, we would have it already. If you receive a mix of big and small frames, there is no win.> > + if (page) { > > + est_buffer_len = page_private(page); > > + if (est_buffer_len > len) { > > + u32 truesize_delta = est_buffer_len - len; > > + > > + curr_skb->truesize += truesize_delta; > > + if (curr_skb != head_skb) > > + head_skb->truesize += truesize_delta; > > + } > > Is there a chance that est_buffer_len was smaller than or equal with len?Yes, and in this case we do not really care, see below.> > + } > > + ewma_add(&rq->mrg_avg_pkt_len, head_skb->len); > > return 0; > > } > > > > @@ -382,16 +404,21 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) > > skb_trim(skb, len); > > } else if (vi->mergeable_rx_bufs) { > > struct page *page = virt_to_head_page(buf); > > - int truesize = max_t(int, len, MERGE_BUFFER_LEN); > > + /* Use an initial truesize of 'len' bytes for page_to_skb -- > > + * receive_mergeable will fixup the truesize of the last page > > + * frag if the packet is non-linear (> GOOD_COPY_LEN bytes). > > + */ > > skb = page_to_skb(rq, page, > > (char *)buf - (char *)page_address(page), > > - len, truesize); > > + len, len); > > if (unlikely(!skb)) { > > dev->stats.rx_dropped++; > > put_page(page); > > return; > > } > > - if (receive_mergeable(rq, skb)) { > > + if (!skb_is_nonlinear(skb)) > > + page = NULL; > > + if (receive_mergeable(rq, skb, page)) { > > dev_kfree_skb(skb); > > return; > > } > > @@ -540,24 +567,29 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) > > static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) > > { > > struct virtnet_info *vi = rq->vq->vdev->priv; > > + const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); > > struct page_frag *alloc_frag; > > char *buf; > > - int err, len, hole; > > + int err, hole; > > + u32 buflen; > > > > + buflen = hdr_len + clamp_t(u32, ewma_read(&rq->mrg_avg_pkt_len), > > + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); > > + buflen = ALIGN(buflen, L1_CACHE_BYTES); > > alloc_frag = (gfp & __GFP_WAIT) ? &vi->sleep_frag : &rq->atomic_frag; > > - if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp))) > > + if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, gfp))) > > return -ENOMEM; > > buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; > > get_page(alloc_frag->page); > > - len = MERGE_BUFFER_LEN; > > - alloc_frag->offset += len; > > + alloc_frag->offset += buflen; > > + set_page_private(alloc_frag->page, buflen); > > Not sure this is accurate, since buflen may change and several frags may > share a single page. So the est_buffer_len we get in receive_mergeable() > may not be the correct value.skb->truesize has to be reasonably accurate. For example, fast clone storage is not accounted for in TCP skbs stored in socket write queues. Thats ~256 bytes per skb of 'missing' accounting. This is about 10% error when TSO/GSO is off. With this EWMA using a factor of 64, the potential error will be much less than 10%. Small frames tend to be consumed quite fast (ACK messages, small UDP frames) in most cases.
Michael S. Tsirkin
2013-Nov-13 17:42 UTC
[PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
On Wed, Nov 13, 2013 at 03:10:20PM +0800, Jason Wang wrote:> On 11/13/2013 06:21 AM, Michael Dalton wrote: > > Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag > > allocators") changed the mergeable receive buffer size from PAGE_SIZE to > > MTU-size, introducing a single-stream regression for benchmarks with large > > average packet size. There is no single optimal buffer size for all workloads. > > For workloads with packet size <= MTU bytes, MTU + virtio-net header-sized > > buffers are preferred as larger buffers reduce the TCP window due to SKB > > truesize. However, single-stream workloads with large average packet sizes > > have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. > > > > This commit auto-tunes the mergeable receiver buffer packet size by choosing > > the packet buffer size based on an EWMA of the recent packet sizes for the > > receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header > > len to PAGE_SIZE. This improves throughput for large packet workloads, as > > any workload with average packet size >= PAGE_SIZE will use PAGE_SIZE > > buffers. > > Hi Michael: > > There's one concern with EWMA. How well does it handle multiple streams > each with different packet size? E.g there may be two flows, one with > 256 bytes each packet another is 64K. Looks like it can result we > allocate PAGE_SIZE buffer for 256 (which is bad since the > payload/truesize is low) bytes or 1500+ for 64K buffer (which is ok > since we can do coalescing). > > > > These optimizations interact positively with recent commit > > ba275241030c ("virtio-net: coalesce rx frags when possible during rx"), > > which coalesces adjacent RX SKB fragments in virtio_net. The coalescing > > optimizations benefit buffers of any size. > > > > Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs > > between two QEMU VMs on a single physical machine. Each VM has two VCPUs > > with all offloads & vhost enabled. All VMs and vhost threads run in a > > single 4 CPU cgroup cpuset, using cgroups to ensure that other processes > > in the system will not be scheduled on the benchmark CPUs. Trunk includes > > SKB rx frag coalescing. > > > > net-next trunk w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s > > net-next trunk (MTU-size bufs): 13170.01Gb/s > > net-next trunk + auto-tune: 14555.94Gb/s > > Do you have perf numbers that just without this patch? We need to know > how much EWMA help exactly.Yes I'm curious too.> > > > Signed-off-by: Michael Dalton <mwdalton at google.com> > > --- > > drivers/net/virtio_net.c | 73 +++++++++++++++++++++++++++++++++++------------- > > 1 file changed, 53 insertions(+), 20 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 0c93054..b1086e0 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -27,6 +27,7 @@ > > #include <linux/if_vlan.h> > > #include <linux/slab.h> > > #include <linux/cpu.h> > > +#include <linux/average.h> > > > > static int napi_weight = NAPI_POLL_WEIGHT; > > module_param(napi_weight, int, 0444); > > @@ -37,10 +38,8 @@ module_param(gso, bool, 0444); > > > > /* FIXME: MTU in config. */ > > #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) > > -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ > > - sizeof(struct virtio_net_hdr_mrg_rxbuf), \ > > - L1_CACHE_BYTES)) > > #define GOOD_COPY_LEN 128 > > +#define RECEIVE_AVG_WEIGHT 64 > > Maybe we can make this as a module parameter.I'm not sure it's useful - no one is likely to tune it in practice. But how about a comment explaining how was the number chosen?> > > > #define VIRTNET_DRIVER_VERSION "1.0.0" > > > > @@ -79,6 +78,9 @@ struct receive_queue { > > /* Chain pages by the private ptr. */ > > struct page *pages; > > > > + /* Average packet length for mergeable receive buffers. */ > > + struct ewma mrg_avg_pkt_len; > > + > > /* Page frag for GFP_ATOMIC packet buffer allocation. */ > > struct page_frag atomic_frag; > > > > @@ -302,14 +304,17 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, > > return skb; > > } > > > > -static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) > > +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb, > > + struct page *head_page) > > { > > struct skb_vnet_hdr *hdr = skb_vnet_hdr(head_skb); > > struct sk_buff *curr_skb = head_skb; > > + struct page *page = head_page; > > char *buf; > > - struct page *page; > > - int num_buf, len, offset, truesize; > > + int num_buf, len, offset; > > + u32 est_buffer_len; > > > > + len = head_skb->len; > > num_buf = hdr->mhdr.num_buffers; > > while (--num_buf) { > > int num_skb_frags = skb_shinfo(curr_skb)->nr_frags; > > @@ -320,7 +325,6 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) > > head_skb->dev->stats.rx_length_errors++; > > return -EINVAL; > > } > > - truesize = max_t(int, len, MERGE_BUFFER_LEN); > > if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { > > struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); > > if (unlikely(!nskb)) { > > @@ -338,20 +342,38 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) > > if (curr_skb != head_skb) { > > head_skb->data_len += len; > > head_skb->len += len; > > - head_skb->truesize += truesize; > > + head_skb->truesize += len; > > } > > page = virt_to_head_page(buf); > > offset = buf - (char *)page_address(page); > > if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { > > put_page(page); > > skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, > > - len, truesize); > > + len, len); > > } else { > > skb_add_rx_frag(curr_skb, num_skb_frags, page, > > - offset, len, truesize); > > + offset, len, len); > > } > > --rq->num; > > } > > + /* All frags before the last frag are fully used -- for those frags, > > + * truesize = len. Use the size of the most recent buffer allocation > > + * from the last frag's page to estimate the truesize of the last frag. > > + * EWMA with a weight of 64 makes the size adjustments quite small in > > + * the frags allocated on one page (even a order-3 one), and truesize > > + * doesn't need to be 100% accurate. > > + */ > > + if (page) { > > + est_buffer_len = page_private(page); > > + if (est_buffer_len > len) { > > + u32 truesize_delta = est_buffer_len - len; > > + > > + curr_skb->truesize += truesize_delta; > > + if (curr_skb != head_skb) > > + head_skb->truesize += truesize_delta; > > + } > > Is there a chance that est_buffer_len was smaller than or equal with len? > > + } > > + ewma_add(&rq->mrg_avg_pkt_len, head_skb->len); > > return 0; > > } > > > > @@ -382,16 +404,21 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) > > skb_trim(skb, len); > > } else if (vi->mergeable_rx_bufs) { > > struct page *page = virt_to_head_page(buf); > > - int truesize = max_t(int, len, MERGE_BUFFER_LEN); > > + /* Use an initial truesize of 'len' bytes for page_to_skb -- > > + * receive_mergeable will fixup the truesize of the last page > > + * frag if the packet is non-linear (> GOOD_COPY_LEN bytes). > > + */ > > skb = page_to_skb(rq, page, > > (char *)buf - (char *)page_address(page), > > - len, truesize); > > + len, len); > > if (unlikely(!skb)) { > > dev->stats.rx_dropped++; > > put_page(page); > > return; > > } > > - if (receive_mergeable(rq, skb)) { > > + if (!skb_is_nonlinear(skb)) > > + page = NULL; > > + if (receive_mergeable(rq, skb, page)) { > > dev_kfree_skb(skb); > > return; > > } > > @@ -540,24 +567,29 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) > > static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) > > { > > struct virtnet_info *vi = rq->vq->vdev->priv; > > + const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); > > struct page_frag *alloc_frag; > > char *buf; > > - int err, len, hole; > > + int err, hole; > > + u32 buflen; > > > > + buflen = hdr_len + clamp_t(u32, ewma_read(&rq->mrg_avg_pkt_len), > > + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); > > + buflen = ALIGN(buflen, L1_CACHE_BYTES); > > alloc_frag = (gfp & __GFP_WAIT) ? &vi->sleep_frag : &rq->atomic_frag; > > - if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp))) > > + if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, gfp))) > > return -ENOMEM; > > buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; > > get_page(alloc_frag->page); > > - len = MERGE_BUFFER_LEN; > > - alloc_frag->offset += len; > > + alloc_frag->offset += buflen; > > + set_page_private(alloc_frag->page, buflen); > > Not sure this is accurate, since buflen may change and several frags may > share a single page. So the est_buffer_len we get in receive_mergeable() > may not be the correct value. > > hole = alloc_frag->size - alloc_frag->offset; > > - if (hole < MERGE_BUFFER_LEN) { > > - len += hole; > > + if (hole < buflen) { > > + buflen += hole; > > alloc_frag->offset += hole; > > } > > > > - sg_init_one(rq->sg, buf, len); > > + sg_init_one(rq->sg, buf, buflen); > > err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp); > > if (err < 0) > > put_page(virt_to_head_page(buf)); > > @@ -1475,6 +1507,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi) > > napi_weight); > > > > sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg)); > > + ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT); > > sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg)); > > } > >
Michael Dalton
2013-Nov-16 09:06 UTC
[PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Hi, Apologies for the delay, I wanted to get answers together for all of the open questions raised on this thread. The first patch in this patchset is already merged, so after the merge window re-opens I'll send out new patchsets covering the remaining 3 patches. After reflecting on feedback from this thread, I think it makes sense to separate out the per-receive queue page frag allocator patches from the autotuning patch when the merge window re-opens. The per-receive queue page frag allocator patches help deal with fragmentation (PAGE_SIZE does not evenly divide MERGE_BUFFER_LEN), and provide benefits whether or not auto-tuning is present. Auto-tuning can then be evaluated separately. On Wed, 2013-11-13 at 15:10 +0800, Jason Wang wrote:> There's one concern with EWMA. How well does it handle multiple streams > each with different packet size? E.g there may be two flows, one with > 256 bytes each packet another is 64K. Looks like it can result we > allocate PAGE_SIZE buffer for 256 (which is bad since the > payload/truesize is low) bytes or 1500+ for 64K buffer (which is ok > since we can do coalescing).If multiple streams of very different packet sizes are arriving on the same receive queue, no single buffer size is ideal(e.g., large buffers will cause small packets to take up too much memory, but small buffers may reduce throughput somewhat for large packets). We don't know a priori which packet will be delivered to a given receive queue packet buffer, so any size we choose will not be optimal for all cases if we have significant variance in packet sizes.> Do you have perf numbers that just without this patch? We need to know > how much EWMA help exactly.Great point, I should have included that in my initial benchmarking. I ran a benchmark in the same environment as my initial results, this time with the first 3 patches in this patchset applied but without the autotuning patch. The average performance over 5 runs of 30-second netperf was 13760.85Gb/s.> Is there a chance that est_buffer_len was smaller than or equal with len?Yes, that is possible if the average packet length decreases.> Not sure this is accurate, since buflen may change and several frags may > share a single page. So the est_buffer_len we get in receive_mergeable() > may not be the correct value.I agree it may not be 100% accurate but we can choose a weight that will cause the average packet size to change slowly. Even with an order 3 page there will not be too many packet buffers allocated from a single page. On Wed, 2013-11-13 at 17:42 +0800, Michael S. Tsirkin wrote:> I'm not sure it's useful - no one is likely to tune it in practice. > But how about a comment explaining how was the number chosen?That makes sense, I agree a comment is needed. The weight determines how quickly we react to a change in packet size. As we attempt to fill all free ring entries on refill (in try_fill_recv), I chose a large weight so that a short burst of traffic with a different average packet size will not substantially shift the packet buffer size for the entire ring the next time try_fill_recv is called. I'll add a comment that compares 64 to nearby values (32, 16). Best, Mike
Rusty Russell
2013-Nov-20 02:06 UTC
[PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Eric Dumazet <eric.dumazet at gmail.com> writes:> On Wed, 2013-11-13 at 15:10 +0800, Jason Wang wrote: > >> There's one concern with EWMA. How well does it handle multiple streams >> each with different packet size? E.g there may be two flows, one with >> 256 bytes each packet another is 64K. Looks like it can result we >> allocate PAGE_SIZE buffer for 256 (which is bad since the >> payload/truesize is low) bytes or 1500+ for 64K buffer (which is ok >> since we can do coalescing). > > It's hard to predict the future ;) > > 256 bytes frames consume 2.5 KB anyway on a traditional NIC. > If it was a concern, we would have it already. > > If you receive a mix of big and small frames, there is no win.Well, that's not quite true. The device could optimistically look through the queue a bit for a small buffer; it does not need to consume in order. We'd probably want a feature bit for this. I look forward to your thoughts on what mixing algorithm of different sizes to use, of course. Meanwhile, I suspect your patch works well because of 4k pages. 80% non-GSO packets won't drop the average len below 4k. On PPC64 with 64k pages, that's not true. I wonder if "last used len" would work about as well; it really might if we had a smarter device... Cheers, Rusty.
Apparently Analagous Threads
- [PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
- [PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance
- [PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
- [PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
- [PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance