Mike Waychison
2012-Jan-10 17:40 UTC
[PATCH v2 0/3] virtio_net: Better low memory handling.
The following series applies to net-next. The following series changes the low memory paths in virtio_net to not disable NAPI while waiting in the allocator in the slow path. It attempts to rectify some performance problems we've seen where the network performance drops significantly when memory is low. The working theory is that the disabling of NAPI while allocations are occuring in the slow refill_work() path can in turn cause significant delays in processing of the receive queue. The observed situation on an unmodified kernel is that a system with low memory will in turn quickly stop being being able to refill the rx queue with buffers, in turn causing packets to be dropped by the host OS (as NAPI polling is disabled) while the driver wait for memory to be reclaimed. The way the process-context refill loop works, the memory subsystem is effectively polled every half second when things look bleak, leading to significant packet loss and congestion window collapse. The first patch prepares for a batched refill path by splitting the receive buffer allocation and enqueue bits into separate functions. The second patch introduces the batched refill path itself. As implemented, it still disables NAPI completely for the whole refill. The third patch then pushed the disabling and enabling of NAPI down into the refill loop so that it only covers the actual enqueueing of receive buffers on the virtqueue. This patchset seems to help the test setups that I'm playing with. Example tests include using several bit-torrent clients within a VM and watching the network throughputs on the host. Other similar workloads (high network traffic, reasonably high disk IO causing memory pressure) also appear to have throughput problems alleviated by these changes. As a warning, I've only tested this using the "small buffers" and "mergeable" buffers paths. I haven't figured out how to force qemu-kvm to not support the feature bits that trigger 'mergable' receive buffers. ChangeLog: =========v2: - Rewritten to not introduce a spinlock to protect the virtqueue data. --- Mike Waychison (3): virtio_net: Split receive buffer alloc/add virtio_net: Batch receive buffer filling virtio_net: Don't disable NAPI while allocating drivers/net/virtio_net.c | 325 ++++++++++++++++++++++++++++++++++++++++------ 1 files changed, 285 insertions(+), 40 deletions(-)
Mike Waychison
2012-Jan-10 17:41 UTC
[PATCH v2 1/3] virtio_net: Split receive buffer alloc/add
In preparation for allocating receive buffers in the slow path without disabling NAPI, split the allocation and addition of receive buffers apart into two separate functions (per receive buffer type). While here, move the vi->num accounting into the add functions. Signed-off-by: Mike Waychison <mikew at google.com> --- drivers/net/virtio_net.c | 150 +++++++++++++++++++++++++++++++++++----------- 1 files changed, 113 insertions(+), 37 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 76fe14e..5531089 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -353,17 +353,35 @@ frame_err: dev_kfree_skb(skb); } -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp) +/* + * Allocate an skb for "small" receive buffer configurations. + * May return NULL if oom. + * No serialization required. + */ +static struct sk_buff *alloc_recvbuf_small(struct virtnet_info *vi, gfp_t gfp) { struct sk_buff *skb; - struct skb_vnet_hdr *hdr; - int err; skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp); if (unlikely(!skb)) - return -ENOMEM; + return NULL; skb_put(skb, MAX_PACKET_LEN); + return skb; +} + +/* + * Add a skb to the receive queue for "small" receive buffer configurations. + * Returns the number of virtqueue slots left free on success, negative errno + * otherwise. + * Always consumes skb. + * Must be serialized with the napi poll path. + */ +static int add_recvbuf_small(struct virtnet_info *vi, struct sk_buff *skb, + gfp_t gfp) +{ + struct skb_vnet_hdr *hdr; + int err; hdr = skb_vnet_hdr(skb); sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr); @@ -373,36 +391,54 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp) err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, 2, skb, gfp); if (err < 0) dev_kfree_skb(skb); + else + vi->num++; return err; } -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp) +/* + * Allocate an list of pages for "big" receive buffer configurations. + * Pages are chained through ->private. + * May return null if oom. + * No serialization required. + */ +static struct page *alloc_recvbuf_big(struct virtnet_info *vi, gfp_t gfp) { - struct page *first, *list = NULL; - char *p; - int i, err, offset; + struct page *first, *tail = NULL; + int i; - /* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */ - for (i = MAX_SKB_FRAGS + 1; i > 1; --i) { + /* Build a list of pages chained through ->private. Built in reverse order */ + for (i = 0; i < MAX_SKB_FRAGS + 1; ++i) { first = get_a_page(vi, gfp); if (!first) { - if (list) - give_pages(vi, list); - return -ENOMEM; + if (tail) + give_pages(vi, tail); + return NULL; } - sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE); - /* chain new page in list head to match sg */ - first->private = (unsigned long)list; - list = first; + /* chain new page in list head */ + first->private = (unsigned long)tail; + tail = first; } + return first; +} + +/* + * Add a chain of pages to the receive queue for "big" receive buffer + * configurations. + * Returns the number of virtqueue slots left free on success, negative errno + * otherwise. + * Always consumes the entire chain of pages. + * Must be serialized with the napi poll path. + */ +static int add_recvbuf_big(struct virtnet_info *vi, struct page *first, + gfp_t gfp) +{ + struct page *page; + char *p; + int i, err, offset; - first = get_a_page(vi, gfp); - if (!first) { - give_pages(vi, list); - return -ENOMEM; - } p = page_address(first); /* vi->rx_sg[0], vi->rx_sg[1] share the same page */ @@ -413,30 +449,55 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp) offset = sizeof(struct padded_vnet_hdr); sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset); - /* chain first in list head */ - first->private = (unsigned long)list; + /* Chain in the rest of the pages */ + i = 2; /* Offset to insert further pages into the sg */ + page = (struct page *)first->private; + while (page) { + sg_set_buf(&vi->rx_sg[i], page_address(page), PAGE_SIZE); + page = (struct page *)page->private; + i++; + } + err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2, first, gfp); if (err < 0) give_pages(vi, first); + else + vi->num++; return err; } -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp) +/* + * Allocate a page for "mergeable" receive buffer configurations. + * May return NULL if oom. + * No serialization required. + */ +static struct page *alloc_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp) +{ + return get_a_page(vi, gfp); +} + +/* + * Add a page to the receive queue for "mergeable" receive buffer + * configurations. + * Returns the number of virtqueue slots left free on success, negative errno + * otherwise. + * Always consumes the page. + * Must be serialized with the napi poll path. + */ +static int add_recvbuf_mergeable(struct virtnet_info *vi, struct page *page, + gfp_t gfp) { - struct page *page; int err; - page = get_a_page(vi, gfp); - if (!page) - return -ENOMEM; - sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE); err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, 1, page, gfp); if (err < 0) give_pages(vi, page); + else + vi->num++; return err; } @@ -454,17 +515,32 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) bool oom; do { - if (vi->mergeable_rx_bufs) - err = add_recvbuf_mergeable(vi, gfp); - else if (vi->big_packets) - err = add_recvbuf_big(vi, gfp); - else - err = add_recvbuf_small(vi, gfp); + if (vi->mergeable_rx_bufs) { + struct page *page; + page = alloc_recvbuf_mergeable(vi, gfp); + if (!page) + err = -ENOMEM; + else + err = add_recvbuf_mergeable(vi, page, gfp); + } else if (vi->big_packets) { + struct page *page; + page = alloc_recvbuf_big(vi, gfp); + if (!page) + err = -ENOMEM; + else + err = add_recvbuf_big(vi, page, gfp); + } else { + struct sk_buff *skb; + skb = alloc_recvbuf_small(vi, gfp); + if (!skb) + err = -ENOMEM; + else + err = add_recvbuf_small(vi, skb, gfp); + } oom = err == -ENOMEM; if (err < 0) break; - ++vi->num; } while (err > 0); if (unlikely(vi->num > vi->max)) vi->max = vi->num;
Mike Waychison
2012-Jan-10 17:41 UTC
[PATCH v2 2/3] virtio_net: Batch receive buffer filling
In preparation of moving the allocation of receive buffers on the slow path outside of the NAPI disable block in refill_work(), introduce a new method, try_fill_recvbatch(), which fill the receive buffers in a batched mode. Although their algorithms are similar, the list enqeueing and cleanup are different enough that duplicating the overall algorithm resulted in cleaner code. This new function is implemented either by fill_recvbatch_pages() in the case of "big" or "mergeable" receive buffers, or fill_recvbatch_small() for the small buffer fallback case. The batched operation allows us to later push the disabling of napi on the virtio_net device down to only cover the bits that manipulate the virtio queue, letting the bulk of the allocations operate while the nic can still process received packets. Signed-off-by: Mike Waychison <mikew at google.com> --- drivers/net/virtio_net.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 141 insertions(+), 1 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5531089..10d9761 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -37,6 +37,7 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) #define GOOD_COPY_LEN 128 +#define MAX_RX_ALLOCATE_BATCH 32 #define VIRTNET_SEND_COMMAND_SG_MAX 2 #define VIRTNET_DRIVER_VERSION "1.0.0" @@ -572,6 +573,144 @@ static void virtnet_napi_enable(struct virtnet_info *vi) } } +/* + * Try to fill a "big" or "mergeable" receive queue using batching. + * Caller must serialize against NAPI. + * Returns false if we failed to finish due to oom. + */ +static bool fill_recvbatch_pages(struct virtnet_info *vi) +{ + bool oom = false; + bool full = false; + LIST_HEAD(local_list); + struct page *page, *npage; + int i; + + BUG_ON(!vi->big_packets && !vi->mergeable_rx_bufs); +fill_more: + /* Allocate a batch. */ + for (i = 0; i < MAX_RX_ALLOCATE_BATCH; i++) { + if (vi->mergeable_rx_bufs) + page = alloc_recvbuf_mergeable(vi, GFP_KERNEL); + else /* vi->big_packets */ + page = alloc_recvbuf_big(vi, GFP_KERNEL); + if (!page) { + oom = true; + break; + } + list_add_tail(&page->lru, &local_list); + } + + /* Enqueue batch as available. */ + list_for_each_entry_safe(page, npage, &local_list, lru) { + int err; + + list_del(&page->lru); + if (vi->mergeable_rx_bufs) + err = add_recvbuf_mergeable(vi, page, GFP_KERNEL); + else /* vi->big_packets */ + err = add_recvbuf_big(vi, page, GFP_KERNEL); + if (err > 0) + continue; + if (err == -ENOSPC || err == 0) + full = true; + else if (err == -ENOMEM) + oom = true; + else + BUG(); + break; + } + if (unlikely(vi->num > vi->max)) + vi->max = vi->num; + + /* Cleanup any remaining entries on the list */ + if (unlikely(!list_empty(&local_list))) { + list_for_each_entry_safe(page, npage, &local_list, lru) { + list_del(&page->lru); + give_pages(vi, page); + } + } + + if (!oom && !full) + goto fill_more; + + return !oom; +} + +/* + * Try to fill a "small" receive queue using batching. + * Caller must serialize against NAPI. + * Returns false if we failed to finish due to oom. + */ +static bool fill_recvbatch_small(struct virtnet_info *vi) +{ + bool oom = false; + bool full = false; + LIST_HEAD(local_list); + struct list_head *pos, *npos; + struct sk_buff *skb; + int i; + +fill_more: + /* Allocate a batch. */ + for (i = 0; i < MAX_RX_ALLOCATE_BATCH; i++) { + skb = alloc_recvbuf_small(vi, GFP_KERNEL); + if (!skb) { + oom = true; + break; + } + list_add_tail((struct list_head *)skb, &local_list); + } + + /* Enqueue batch as available. */ + list_for_each_safe(pos, npos, &local_list) { + int err; + + list_del(pos); + skb = (struct sk_buff *)pos; + + err = add_recvbuf_small(vi, skb, GFP_KERNEL); + if (err > 0) + continue; + if (err == -ENOSPC || err == 0) + full = true; + else if (err == -ENOMEM) + oom = true; + else + BUG(); + break; + } + if (unlikely(vi->num > vi->max)) + vi->max = vi->num; + + /* Cleanup any remaining entries on the list */ + if (unlikely(!list_empty(&local_list))) { + list_for_each_safe(pos, npos, &local_list) { + skb = (struct sk_buff *)pos; + list_del(pos); + dev_kfree_skb(skb); + } + } + + if (!oom && !full) + goto fill_more; + + return !oom; +} + +/* + * Refill the receive queues from process context. + * Caller must serialize against NAPI. + * Returns false if we failed to allocate due to memory pressure. + */ +static bool try_fill_recvbatch(struct virtnet_info *vi) +{ + if (vi->mergeable_rx_bufs || vi->big_packets) + return fill_recvbatch_pages(vi); + else + return fill_recvbatch_small(vi); +} + static void refill_work(struct work_struct *work) { struct virtnet_info *vi; @@ -579,7 +718,8 @@ static void refill_work(struct work_struct *work) vi = container_of(work, struct virtnet_info, refill.work); napi_disable(&vi->napi); - still_empty = !try_fill_recv(vi, GFP_KERNEL); + still_empty = !try_fill_recvbatch(vi); + virtqueue_kick(vi->rvq); virtnet_napi_enable(vi); /* In theory, this can happen: if we don't get any buffers in
Mike Waychison
2012-Jan-10 17:41 UTC
[PATCH v2 3/3] virtio_net: Don't disable NAPI while allocating
This patch changes the batch refill path so that NAPI is only disabled when we are in the midst of enqueue available receive buffers. allocators. For "small" receive buffers, pushing the enabling and disabling of NAPI is pretty straight-forward. For the "big" and "mergeable" receive buffers however care must be taken to avoid letting either the allocator or the release code from touching the per-device free-page list at vi->pages (which is only serialized by the virtue of NAPI serialization). To handle this case, the allocators are modified to take a "napi_serialized" flag, which indicates whether or not they should be touching vi->pages. As well, for page release, __give_pages() is introduced to free chains of pages (threaded through ->private) and is to be used when we haven't stopped NAPI. The failure paths in the "add" functions for the receive buffers do not need to be updated, as they must be called serially with the NAPI poll, and therefore are allowed to touch vi->pages (indirectly through give_pages). Signed-off-by: Mike Waychison <mikew at google.com> --- drivers/net/virtio_net.c | 67 +++++++++++++++++++++++++++++++++------------- 1 files changed, 48 insertions(+), 19 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 10d9761..ec0e1fb 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -118,6 +118,20 @@ static void give_pages(struct virtnet_info *vi, struct page *page) vi->pages = page; } +/* + * Give a list of pages threaded through the ->private field back to the page + * allocator. To be used by paths that aren't serialized with napi. + */ +static void __give_pages(struct page *page) +{ + struct page *nextpage; + do { + nextpage = (struct page *)page->private; + __free_page(page); + page = nextpage; + } while (page); +} + static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) { struct page *p = vi->pages; @@ -402,19 +416,24 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct sk_buff *skb, * Allocate an list of pages for "big" receive buffer configurations. * Pages are chained through ->private. * May return null if oom. - * No serialization required. + * napi_serialized must be false if we aren't serialized with the napi path. */ -static struct page *alloc_recvbuf_big(struct virtnet_info *vi, gfp_t gfp) +static struct page *alloc_recvbuf_big(struct virtnet_info *vi, + bool napi_serialized, gfp_t gfp) { struct page *first, *tail = NULL; int i; /* Build a list of pages chained through ->private. Built in reverse order */ for (i = 0; i < MAX_SKB_FRAGS + 1; ++i) { - first = get_a_page(vi, gfp); + first = napi_serialized ? get_a_page(vi, gfp) : alloc_page(gfp); if (!first) { - if (tail) - give_pages(vi, tail); + if (tail) { + if (napi_serialized) + give_pages(vi, tail); + else + __give_pages(tail); + } return NULL; } @@ -472,11 +491,18 @@ static int add_recvbuf_big(struct virtnet_info *vi, struct page *first, /* * Allocate a page for "mergeable" receive buffer configurations. * May return NULL if oom. - * No serialization required. + * @napi_serialized must be false if we aren't serialized with the napi path. */ -static struct page *alloc_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp) +static struct page *alloc_recvbuf_mergeable(struct virtnet_info *vi, + bool napi_serialized, gfp_t gfp) { - return get_a_page(vi, gfp); + struct page *page; + if (napi_serialized) + return get_a_page(vi, gfp); + page = alloc_page(gfp); + if (page) + page->private = 0; + return page; } /* @@ -518,14 +544,14 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) do { if (vi->mergeable_rx_bufs) { struct page *page; - page = alloc_recvbuf_mergeable(vi, gfp); + page = alloc_recvbuf_mergeable(vi, true, gfp); if (!page) err = -ENOMEM; else err = add_recvbuf_mergeable(vi, page, gfp); } else if (vi->big_packets) { struct page *page; - page = alloc_recvbuf_big(vi, gfp); + page = alloc_recvbuf_big(vi, true, gfp); if (!page) err = -ENOMEM; else @@ -575,7 +601,7 @@ static void virtnet_napi_enable(struct virtnet_info *vi) /* * Try to fill a "big" or "mergeable" receive queue using batching. - * Caller must serialize against NAPI. + * This call will serialize itself against NAPI. * Returns false if we failed to finish due to oom. */ static bool fill_recvbatch_pages(struct virtnet_info *vi) @@ -591,9 +617,9 @@ fill_more: /* Allocate a batch. */ for (i = 0; i < MAX_RX_ALLOCATE_BATCH; i++) { if (vi->mergeable_rx_bufs) - page = alloc_recvbuf_mergeable(vi, GFP_KERNEL); + page = alloc_recvbuf_mergeable(vi, false, GFP_KERNEL); else /* vi->big_packets */ - page = alloc_recvbuf_big(vi, GFP_KERNEL); + page = alloc_recvbuf_big(vi, false, GFP_KERNEL); if (!page) { oom = true; break; @@ -602,6 +628,7 @@ fill_more: } /* Enqueue batch as available. */ + napi_disable(&vi->napi); list_for_each_entry_safe(page, npage, &local_list, lru) { int err; @@ -622,12 +649,14 @@ fill_more: } if (unlikely(vi->num > vi->max)) vi->max = vi->num; + virtqueue_kick(vi->rvq); + virtnet_napi_enable(vi); /* Cleanup any remaining entries on the list */ if (unlikely(!list_empty(&local_list))) { list_for_each_entry_safe(page, npage, &local_list, lru) { list_del(&page->lru); - give_pages(vi, page); + __give_pages(page); } } @@ -639,7 +668,7 @@ fill_more: /* * Try to fill a "small" receive queue using batching. - * Caller must serialize against NAPI. + * This call will serialize itself against NAPI. * Returns false if we failed to finish due to oom. */ static bool fill_recvbatch_small(struct virtnet_info *vi) @@ -663,6 +692,7 @@ fill_more: } /* Enqueue batch as available. */ + napi_disable(&vi->napi); list_for_each_safe(pos, npos, &local_list) { int err; @@ -682,6 +712,8 @@ fill_more: } if (unlikely(vi->num > vi->max)) vi->max = vi->num; + virtqueue_kick(vi->rvq); + virtnet_napi_enable(vi); /* Cleanup any remaining entries on the list */ if (unlikely(!list_empty(&local_list))) { @@ -700,7 +732,7 @@ fill_more: /* * Refill the receive queues from process context. - * Caller must serialize against NAPI. + * Callee will serialize against NAPI itself. * Returns false if we failed to allocate due to memory pressure. */ static bool try_fill_recvbatch(struct virtnet_info *vi) @@ -717,10 +749,7 @@ static void refill_work(struct work_struct *work) bool still_empty; vi = container_of(work, struct virtnet_info, refill.work); - napi_disable(&vi->napi); still_empty = !try_fill_recvbatch(vi); - virtqueue_kick(vi->rvq); - virtnet_napi_enable(vi); /* In theory, this can happen: if we don't get any buffers in * we will *never* try to fill again. */
Seemingly Similar Threads
- [PATCH v2 0/3] virtio_net: Better low memory handling.
- [RFC PATCH v1 0/2] virtio_net: Better low memory handling.
- [RFC PATCH v1 0/2] virtio_net: Better low memory handling.
- [net-next rfc v7 0/3] Multiqueue virtio-net
- [net-next rfc v7 0/3] Multiqueue virtio-net