This patch series enhances the existing virtio-balloon with the following new features: 1) fast ballooning: transfer ballooned pages between the guest and host in chunks using sgs, instead of one by one; and 2) cmdq: a new virtqueue to send commands between the device and driver. Currently, it supports commands to report memory stats (replace the old statq mechanism) and report guest unused pages. Change Log: v11->v12: 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages. 2) virtio-ring: enable the driver to build up a desc chain using vring desc. 3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro to lock/unlock the vq when a vq operation starts/ends. 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async() 5) virtio-balloon: describe chunks of ballooned pages and free pages blocks directly using one or more chains of desc from the vq. v10->v11: 1) virtio_balloon: use vring_desc to describe a chunk; 2) virtio_ring: support to add an indirect desc table to virtqueue; 3) virtio_balloon: use cmdq to report guest memory statistics. v9->v10: 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON; 2) virtio-balloon: add virtballoon_validate(); 3) virtio-balloon: msg format change; 4) virtio-balloon: move miscq handling to a task on system_freezable_wq; 5) virtio-balloon: code cleanup. v8->v9: 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous implementation; 2) Simpler function to get the free page block. v7->v8: 1) Use only one chunk format, instead of two. 2) re-write the virtio-balloon implementation patch. 3) commit changes 4) patch re-org Liang Li (1): virtio-balloon: deflate via a page list Matthew Wilcox (1): Introduce xbitmap Wei Wang (6): virtio-balloon: coding format cleanup xbitmap: add xb_find_next_bit() and xb_zero() virtio-balloon: VIRTIO_BALLOON_F_SG mm: support reporting free page blocks mm: export symbol of next_zone and first_online_pgdat virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ drivers/virtio/virtio_balloon.c | 414 ++++++++++++++++++++++++++++++++---- drivers/virtio/virtio_ring.c | 224 +++++++++++++++++-- include/linux/mm.h | 5 + include/linux/radix-tree.h | 2 + include/linux/virtio.h | 22 ++ include/linux/xbitmap.h | 53 +++++ include/uapi/linux/virtio_balloon.h | 11 + lib/radix-tree.c | 164 +++++++++++++- mm/mmzone.c | 2 + mm/page_alloc.c | 96 +++++++++ 10 files changed, 926 insertions(+), 67 deletions(-) create mode 100644 include/linux/xbitmap.h -- 2.7.4
From: Liang Li <liang.z.li at intel.com> This patch saves the deflated pages to a list, instead of the PFN array. Accordingly, the balloon_pfn_to_page() function is removed. Signed-off-by: Liang Li <liang.z.li at intel.com> Signed-off-by: Michael S. Tsirkin <mst at redhat.com> Signed-off-by: Wei Wang <wei.w.wang at intel.com> --- drivers/virtio/virtio_balloon.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 22caf80..7f38ae6 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -104,12 +104,6 @@ static u32 page_to_balloon_pfn(struct page *page) return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE; } -static struct page *balloon_pfn_to_page(u32 pfn) -{ - BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE); - return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE); -} - static void balloon_ack(struct virtqueue *vq) { struct virtio_balloon *vb = vq->vdev->priv; @@ -182,18 +176,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) return num_allocated_pages; } -static void release_pages_balloon(struct virtio_balloon *vb) +static void release_pages_balloon(struct virtio_balloon *vb, + struct list_head *pages) { - unsigned int i; - struct page *page; + struct page *page, *next; - /* Find pfns pointing at start of each page, get pages and free them. */ - for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) { - page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev, - vb->pfns[i])); + list_for_each_entry_safe(page, next, pages, lru) { if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) adjust_managed_page_count(page, 1); + list_del(&page->lru); put_page(page); /* balloon reference */ } } @@ -203,6 +195,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) unsigned num_freed_pages; struct page *page; struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; + LIST_HEAD(pages); /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb->pfns)); @@ -216,6 +209,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) if (!page) break; set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + list_add(&page->lru, &pages); vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } @@ -227,7 +221,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) */ if (vb->num_pfns != 0) tell_host(vb, vb->deflate_vq); - release_pages_balloon(vb); + release_pages_balloon(vb, &pages); mutex_unlock(&vb->balloon_lock); return num_freed_pages; } -- 2.7.4
Clean up the comment format. Signed-off-by: Wei Wang <wei.w.wang at intel.com> --- drivers/virtio/virtio_balloon.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 7f38ae6..f0b3a0b 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -132,8 +132,10 @@ static void set_page_pfns(struct virtio_balloon *vb, { unsigned int i; - /* Set balloon pfns pointing at this page. - * Note that the first pfn points at start of the page. */ + /* + * Set balloon pfns pointing at this page. + * Note that the first pfn points at start of the page. + */ for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++) pfns[i] = cpu_to_virtio32(vb->vdev, page_to_balloon_pfn(page) + i); -- 2.7.4
From: Matthew Wilcox <mawilcox at microsoft.com> The eXtensible Bitmap is a sparse bitmap representation which is efficient for set bits which tend to cluster. It supports up to 'unsigned long' worth of bits, and this commit adds the bare bones -- xb_set_bit(), xb_clear_bit() and xb_test_bit(). Signed-off-by: Matthew Wilcox <mawilcox at microsoft.com> Signed-off-by: Wei Wang <wei.w.wang at intel.com> --- include/linux/radix-tree.h | 2 + include/linux/xbitmap.h | 49 ++++++++++++++++ lib/radix-tree.c | 138 ++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 187 insertions(+), 2 deletions(-) create mode 100644 include/linux/xbitmap.h diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 3e57350..428ccc9 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -317,6 +317,8 @@ void radix_tree_iter_delete(struct radix_tree_root *, struct radix_tree_iter *iter, void __rcu **slot); void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *); void *radix_tree_delete(struct radix_tree_root *, unsigned long); +bool __radix_tree_delete(struct radix_tree_root *root, + struct radix_tree_node *node, void __rcu **slot); void radix_tree_clear_tags(struct radix_tree_root *, struct radix_tree_node *, void __rcu **slot); unsigned int radix_tree_gang_lookup(const struct radix_tree_root *, diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h new file mode 100644 index 0000000..0b93a46 --- /dev/null +++ b/include/linux/xbitmap.h @@ -0,0 +1,49 @@ +/* + * eXtensible Bitmaps + * Copyright (c) 2017 Microsoft Corporation <mawilcox at microsoft.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility. + * All bits are initially zero. + */ + +#include <linux/idr.h> + +struct xb { + struct radix_tree_root xbrt; +}; + +#define XB_INIT { \ + .xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT), \ +} +#define DEFINE_XB(name) struct xb name = XB_INIT + +static inline void xb_init(struct xb *xb) +{ + INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT); +} + +int xb_set_bit(struct xb *xb, unsigned long bit); +bool xb_test_bit(const struct xb *xb, unsigned long bit); +int xb_clear_bit(struct xb *xb, unsigned long bit); + +static inline bool xb_empty(const struct xb *xb) +{ + return radix_tree_empty(&xb->xbrt); +} + +void xb_preload(gfp_t gfp); + +static inline void xb_preload_end(void) +{ + preempt_enable(); +} diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 898e879..d624914 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -37,6 +37,7 @@ #include <linux/rcupdate.h> #include <linux/slab.h> #include <linux/string.h> +#include <linux/xbitmap.h> /* Number of nodes in fully populated tree of given height */ @@ -78,6 +79,14 @@ static struct kmem_cache *radix_tree_node_cachep; #define IDA_PRELOAD_SIZE (IDA_MAX_PATH * 2 - 1) /* + * The XB can go up to unsigned long, but also uses a bitmap. + */ +#define XB_INDEX_BITS (BITS_PER_LONG - ilog2(IDA_BITMAP_BITS)) +#define XB_MAX_PATH (DIV_ROUND_UP(XB_INDEX_BITS, \ + RADIX_TREE_MAP_SHIFT)) +#define XB_PRELOAD_SIZE (XB_MAX_PATH * 2 - 1) + +/* * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { @@ -840,6 +849,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, offset, 0, 0); if (!child) return -ENOMEM; + if (is_idr(root)) + all_tag_set(child, IDR_FREE); rcu_assign_pointer(*slot, node_to_entry(child)); if (node) node->count++; @@ -1986,8 +1997,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root, delete_node(root, node, update_node, private); } -static bool __radix_tree_delete(struct radix_tree_root *root, - struct radix_tree_node *node, void __rcu **slot) +bool __radix_tree_delete(struct radix_tree_root *root, + struct radix_tree_node *node, void __rcu **slot) { void *old = rcu_dereference_raw(*slot); int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0; @@ -2137,6 +2148,129 @@ int ida_pre_get(struct ida *ida, gfp_t gfp) } EXPORT_SYMBOL(ida_pre_get); +void xb_preload(gfp_t gfp) +{ + __radix_tree_preload(gfp, XB_PRELOAD_SIZE); + if (!this_cpu_read(ida_bitmap)) { + struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp); + + if (!bitmap) + return; + bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap); + kfree(bitmap); + } +} +EXPORT_SYMBOL(xb_preload); + +int xb_set_bit(struct xb *xb, unsigned long bit) +{ + int err; + unsigned long index = bit / IDA_BITMAP_BITS; + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %= IDA_BITMAP_BITS; + ebit = bit + 2; + + err = __radix_tree_create(root, index, 0, &node, &slot); + if (err) + return err; + bitmap = rcu_dereference_raw(*slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp = (unsigned long)bitmap; + + if (ebit < BITS_PER_LONG) { + tmp |= 1UL << ebit; + rcu_assign_pointer(*slot, (void *)tmp); + return 0; + } + bitmap = this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT; + rcu_assign_pointer(*slot, bitmap); + } + + if (!bitmap) { + if (ebit < BITS_PER_LONG) { + bitmap = (void *)((1UL << ebit) | + RADIX_TREE_EXCEPTIONAL_ENTRY); + __radix_tree_replace(root, node, slot, bitmap, NULL, + NULL); + return 0; + } + bitmap = this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + __radix_tree_replace(root, node, slot, bitmap, NULL, NULL); + } + + __set_bit(bit, bitmap->bitmap); + return 0; +} + +int xb_clear_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index = bit / IDA_BITMAP_BITS; + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %= IDA_BITMAP_BITS; + ebit = bit + 2; + + bitmap = __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp = (unsigned long)bitmap; + + if (ebit >= BITS_PER_LONG) + return 0; + tmp &= ~(1UL << ebit); + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + return 0; + } + + if (!bitmap) + return 0; + + __clear_bit(bit, bitmap->bitmap); + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } + + return 0; +} + +bool xb_test_bit(const struct xb *xb, unsigned long bit) +{ + unsigned long index = bit / IDA_BITMAP_BITS; + const struct radix_tree_root *root = &xb->xbrt; + struct ida_bitmap *bitmap = radix_tree_lookup(root, index); + + bit %= IDA_BITMAP_BITS; + + if (!bitmap) + return false; + if (radix_tree_exception(bitmap)) { + bit += RADIX_TREE_EXCEPTIONAL_SHIFT; + if (bit > BITS_PER_LONG) + return false; + return (unsigned long)bitmap & (1UL << bit); + } + return test_bit(bit, bitmap->bitmap); +} + void __rcu **idr_get_free(struct radix_tree_root *root, struct radix_tree_iter *iter, gfp_t gfp, int end) { -- 2.7.4
Wei Wang
2017-Jul-12 12:40 UTC
[PATCH v12 4/8] xbitmap: add xb_find_next_bit() and xb_zero()
xb_find_next_bit() is added to support find the next "1" or "0" bit in the given range. xb_zero() is added to support zero the given range of bits. Signed-off-by: Wei Wang <wei.w.wang at intel.com> --- include/linux/xbitmap.h | 4 ++++ lib/radix-tree.c | 26 ++++++++++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h index 0b93a46..88c2045 100644 --- a/include/linux/xbitmap.h +++ b/include/linux/xbitmap.h @@ -36,6 +36,10 @@ int xb_set_bit(struct xb *xb, unsigned long bit); bool xb_test_bit(const struct xb *xb, unsigned long bit); int xb_clear_bit(struct xb *xb, unsigned long bit); +void xb_zero(struct xb *xb, unsigned long start, unsigned long end); +unsigned long xb_find_next_bit(struct xb *xb, unsigned long start, + unsigned long end, bool set); + static inline bool xb_empty(const struct xb *xb) { return radix_tree_empty(&xb->xbrt); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index d624914..c45b910 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -2271,6 +2271,32 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit) return test_bit(bit, bitmap->bitmap); } +void xb_zero(struct xb *xb, unsigned long start, unsigned long end) +{ + unsigned long i; + + for (i = start; i <= end; i++) + xb_clear_bit(xb, i); +} + +/* + * Find the next one (@set = 1) or zero (@set = 0) bit within the bit range + * from @start to @end in @xb. If no such bit is found in the given range, + * bit end + 1 will be returned. + */ +unsigned long xb_find_next_bit(struct xb *xb, unsigned long start, + unsigned long end, bool set) +{ + unsigned long i; + + for (i = start; i <= end; i++) { + if (xb_test_bit(xb, i) == set) + break; + } + + return i; +} + void __rcu **idr_get_free(struct radix_tree_root *root, struct radix_tree_iter *iter, gfp_t gfp, int end) { -- 2.7.4
Add a new feature, VIRTIO_BALLOON_F_SG, which enables to transfer a chunk of ballooned (i.e. inflated/deflated) pages using scatter-gather lists to the host. The implementation of the previous virtio-balloon is not very efficient, because the balloon pages are transferred to the host one by one. Here is the breakdown of the time in percentage spent on each step of the balloon inflating process (inflating 7GB of an 8GB idle guest). 1) allocating pages (6.5%) 2) sending PFNs to host (68.3%) 3) address translation (6.1%) 4) madvise (19%) It takes about 4126ms for the inflating process to complete. The above profiling shows that the bottlenecks are stage 2) and stage 4). This patch optimizes step 2) by transferring pages to the host in sgs. An sg describes a chunk of guest physically continuous pages. With this mechanism, step 4) can also be optimized by doing address translation and madvise() in chunks rather than page by page. With this new feature, the above ballooning process takes ~491ms resulting in an improvement of ~88%. TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of a single page each time. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Signed-off-by: Liang Li <liang.z.li at intel.com> Suggested-by: Michael S. Tsirkin <mst at redhat.com> --- drivers/virtio/virtio_balloon.c | 141 ++++++++++++++++++++++--- drivers/virtio/virtio_ring.c | 199 +++++++++++++++++++++++++++++++++--- include/linux/virtio.h | 20 ++++ include/uapi/linux/virtio_balloon.h | 1 + 4 files changed, 329 insertions(+), 32 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index f0b3a0b..aa4e7ec 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -32,6 +32,7 @@ #include <linux/mm.h> #include <linux/mount.h> #include <linux/magic.h> +#include <linux/xbitmap.h> /* * Balloon device works in 4K page units. So each page is pointed to by @@ -79,6 +80,9 @@ struct virtio_balloon { /* Synchronize access/update to this struct virtio_balloon elements */ struct mutex balloon_lock; + /* The xbitmap used to record ballooned pages */ + struct xb page_xb; + /* The array of pfns we tell the Host about. */ unsigned int num_pfns; __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb, page_to_balloon_pfn(page) + i); } +/* + * Send balloon pages in sgs to host. + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for + * continuous "1" bits, which correspond to continuous pages, to chunk into + * sgs. + * + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that + * need to be serached. + */ +static void tell_host_sgs(struct virtio_balloon *vb, + struct virtqueue *vq, + unsigned long page_xb_start, + unsigned long page_xb_end) +{ + unsigned int head_id = VIRTQUEUE_DESC_ID_INIT, + prev_id = VIRTQUEUE_DESC_ID_INIT; + unsigned long sg_pfn_start, sg_pfn_end; + uint64_t sg_addr; + uint32_t sg_size; + + sg_pfn_start = page_xb_start; + while (sg_pfn_start < page_xb_end) { + sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start, + page_xb_end, 1); + if (sg_pfn_start == page_xb_end + 1) + break; + sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1, + page_xb_end, 0); + sg_addr = sg_pfn_start << PAGE_SHIFT; + sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE; + virtqueue_add_chain_desc(vq, sg_addr, sg_size, &head_id, + &prev_id, 0); + xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end); + sg_pfn_start = sg_pfn_end + 1; + } + + if (head_id != VIRTQUEUE_DESC_ID_INIT) { + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL); + virtqueue_kick_async(vq, vb->acked); + } +} + +/* Update pfn_max and pfn_min according to the pfn of @page */ +static inline void update_pfn_range(struct virtio_balloon *vb, + struct page *page, + unsigned long *pfn_min, + unsigned long *pfn_max) +{ + unsigned long pfn = page_to_pfn(page); + + *pfn_min = min(pfn, *pfn_min); + *pfn_max = max(pfn, *pfn_max); +} + static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) { struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; unsigned num_allocated_pages; + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; /* We can only do one array worth at a time. */ - num = min(num, ARRAY_SIZE(vb->pfns)); + if (!use_sg) + num = min(num, ARRAY_SIZE(vb->pfns)); mutex_lock(&vb->balloon_lock); for (vb->num_pfns = 0; vb->num_pfns < num; @@ -162,7 +224,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) msleep(200); break; } - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (use_sg) { + update_pfn_range(vb, page, &pfn_min, &pfn_max); + xb_set_bit(&vb->page_xb, page_to_pfn(page)); + } else { + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + } vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE; if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) @@ -171,8 +238,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) num_allocated_pages = vb->num_pfns; /* Did we get any? */ - if (vb->num_pfns != 0) - tell_host(vb, vb->inflate_vq); + if (vb->num_pfns != 0) { + if (use_sg) + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->inflate_vq); + } mutex_unlock(&vb->balloon_lock); return num_allocated_pages; @@ -198,9 +269,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) struct page *page; struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; LIST_HEAD(pages); + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; - /* We can only do one array worth at a time. */ - num = min(num, ARRAY_SIZE(vb->pfns)); + /* Traditionally, we can only do one array worth at a time. */ + if (!use_sg) + num = min(num, ARRAY_SIZE(vb->pfns)); mutex_lock(&vb->balloon_lock); /* We can't release more pages than taken */ @@ -210,7 +284,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) page = balloon_page_dequeue(vb_dev_info); if (!page) break; - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (use_sg) { + update_pfn_range(vb, page, &pfn_min, &pfn_max); + xb_set_bit(&vb->page_xb, page_to_pfn(page)); + } else { + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + } list_add(&page->lru, &pages); vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } @@ -221,8 +300,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - if (vb->num_pfns != 0) - tell_host(vb, vb->deflate_vq); + if (vb->num_pfns != 0) { + if (use_sg) + tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->deflate_vq); + } release_pages_balloon(vb, &pages); mutex_unlock(&vb->balloon_lock); return num_freed_pages; @@ -441,6 +524,18 @@ static int init_vqs(struct virtio_balloon *vb) } #ifdef CONFIG_BALLOON_COMPACTION + +static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq, + struct page *page) +{ + unsigned int id = VIRTQUEUE_DESC_ID_INIT; + u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; + + virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0); + virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); + virtqueue_kick_async(vq, vb->acked); +} + /* * virtballoon_migratepage - perform the balloon page migration on behalf of * a compation thread. (called under page lock) @@ -464,6 +559,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, { struct virtio_balloon *vb = container_of(vb_dev_info, struct virtio_balloon, vb_dev_info); + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); unsigned long flags; /* @@ -485,16 +581,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, vb_dev_info->isolated_pages--; __count_vm_event(BALLOON_MIGRATE); spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, newpage); - tell_host(vb, vb->inflate_vq); - + if (use_sg) { + tell_host_one_page(vb, vb->inflate_vq, newpage); + } else { + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, newpage); + tell_host(vb, vb->inflate_vq); + } /* balloon's page migration 2nd step -- deflate "page" */ balloon_page_delete(page); - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, page); - tell_host(vb, vb->deflate_vq); - + if (use_sg) { + tell_host_one_page(vb, vb->deflate_vq, page); + } else { + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, page); + tell_host(vb, vb->deflate_vq); + } mutex_unlock(&vb->balloon_lock); put_page(page); /* balloon reference */ @@ -553,6 +655,9 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) + xb_init(&vb->page_xb); + vb->nb.notifier_call = virtballoon_oom_notify; vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; err = register_oom_notifier(&vb->nb); @@ -618,6 +723,7 @@ static void virtballoon_remove(struct virtio_device *vdev) cancel_work_sync(&vb->update_balloon_size_work); cancel_work_sync(&vb->update_balloon_stats_work); + xb_empty(&vb->page_xb); remove_common(vb); #ifdef CONFIG_BALLOON_COMPACTION if (vb->vb_dev_info.inode) @@ -669,6 +775,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, + VIRTIO_BALLOON_F_SG, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548..b9d7e10 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -269,7 +269,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, struct vring_virtqueue *vq = to_vvq(_vq); struct scatterlist *sg; struct vring_desc *desc; - unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx; + unsigned int i, n, descs_used, uninitialized_var(prev), err_id; int head; bool indirect; @@ -387,10 +387,68 @@ static inline int virtqueue_add(struct virtqueue *_vq, else vq->free_head = i; - /* Store token and indirect buffer state. */ + END_USE(vq); + + return virtqueue_add_chain(_vq, head, indirect, desc, data, ctx); + +unmap_release: + err_id = i; + i = head; + + for (n = 0; n < total_sg; n++) { + if (i == err_id) + break; + vring_unmap_one(vq, &desc[i]); + i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); + } + + vq->vq.num_free += total_sg; + + if (indirect) + kfree(desc); + + END_USE(vq); + return -EIO; +} + +/** + * virtqueue_add_chain - expose a chain of buffers to the other end + * @_vq: the struct virtqueue we're talking about. + * @head: desc id of the chain head. + * @indirect: set if the chain of descs are indrect descs. + * @indir_desc: the first indirect desc. + * @data: the token identifying the chain. + * @ctx: extra context for the token. + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted). + * + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). + */ +int virtqueue_add_chain(struct virtqueue *_vq, + unsigned int head, + bool indirect, + struct vring_desc *indir_desc, + void *data, + void *ctx) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + unsigned int avail; + + /* The desc chain is empty. */ + if (head == VIRTQUEUE_DESC_ID_INIT) + return 0; + + START_USE(vq); + + if (unlikely(vq->broken)) { + END_USE(vq); + return -EIO; + } + vq->desc_state[head].data = data; if (indirect) - vq->desc_state[head].indir_desc = desc; + vq->desc_state[head].indir_desc = indir_desc; if (ctx) vq->desc_state[head].indir_desc = ctx; @@ -415,26 +473,87 @@ static inline int virtqueue_add(struct virtqueue *_vq, virtqueue_kick(_vq); return 0; +} +EXPORT_SYMBOL_GPL(virtqueue_add_chain); -unmap_release: - err_idx = i; - i = head; +/** + * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc + * @vq: the struct virtqueue we're talking about. + * @addr: address of the buffer to add. + * @len: length of the buffer. + * @head_id: desc id of the chain head. + * @prev_id: desc id of the previous buffer. + * @in: set if the buffer is for the device to write. + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted). + * + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). + */ +int virtqueue_add_chain_desc(struct virtqueue *_vq, + uint64_t addr, + uint32_t len, + unsigned int *head_id, + unsigned int *prev_id, + bool in) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + struct vring_desc *desc = vq->vring.desc; + uint16_t flags = in ? VRING_DESC_F_WRITE : 0; + unsigned int i; - for (n = 0; n < total_sg; n++) { - if (i == err_idx) - break; - vring_unmap_one(vq, &desc[i]); - i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); + /* Sanity check */ + if (!_vq || !head_id || !prev_id) + return -EINVAL; +retry: + START_USE(vq); + if (unlikely(vq->broken)) { + END_USE(vq); + return -EIO; } - vq->vq.num_free += total_sg; + if (vq->vq.num_free < 1) { + /* + * If there is no desc avail in the vq, so kick what is + * already added, and re-start to build a new chain for + * the passed sg. + */ + if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) { + END_USE(vq); + virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL); + virtqueue_kick_sync(_vq); + *head_id = VIRTQUEUE_DESC_ID_INIT; + *prev_id = VIRTQUEUE_DESC_ID_INIT; + goto retry; + } else { + END_USE(vq); + return -ENOSPC; + } + } - if (indirect) - kfree(desc); + i = vq->free_head; + flags &= ~VRING_DESC_F_NEXT; + desc[i].flags = cpu_to_virtio16(_vq->vdev, flags); + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr); + desc[i].len = cpu_to_virtio32(_vq->vdev, len); + + /* Add the desc to the end of the chain */ + if (*prev_id != VIRTQUEUE_DESC_ID_INIT) { + desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i); + desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev, + VRING_DESC_F_NEXT); + } + *prev_id = i; + if (*head_id == VIRTQUEUE_DESC_ID_INIT) + *head_id = *prev_id; + vq->vq.num_free--; + vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next); END_USE(vq); - return -EIO; + + return 0; } +EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc); /** * virtqueue_add_sgs - expose buffers to other end @@ -627,6 +746,56 @@ bool virtqueue_kick(struct virtqueue *vq) } EXPORT_SYMBOL_GPL(virtqueue_kick); +/** + * virtqueue_kick_sync - update after add_buf and busy wait till update is done + * @vq: the struct virtqueue + * + * After one or more virtqueue_add_* calls, invoke this to kick + * the other side. Busy wait till the other side is done with the update. + * + * Caller must ensure we don't call this with other virtqueue + * operations at the same time (except where noted). + * + * Returns false if kick failed, otherwise true. + */ +bool virtqueue_kick_sync(struct virtqueue *vq) +{ + u32 len; + + if (likely(virtqueue_kick(vq))) { + while (!virtqueue_get_buf(vq, &len) && + !virtqueue_is_broken(vq)) + cpu_relax(); + return true; + } + return false; +} +EXPORT_SYMBOL_GPL(virtqueue_kick_sync); + +/** + * virtqueue_kick_async - update after add_buf and blocking till update is done + * @vq: the struct virtqueue + * + * After one or more virtqueue_add_* calls, invoke this to kick + * the other side. Blocking till the other side is done with the update. + * + * Caller must ensure we don't call this with other virtqueue + * operations at the same time (except where noted). + * + * Returns false if kick failed, otherwise true. + */ +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq) +{ + u32 len; + + if (likely(virtqueue_kick(vq))) { + wait_event(wq, virtqueue_get_buf(vq, &len)); + return true; + } + return false; +} +EXPORT_SYMBOL_GPL(virtqueue_kick_async); + static void detach_buf(struct vring_virtqueue *vq, unsigned int head, void **ctx) { diff --git a/include/linux/virtio.h b/include/linux/virtio.h index 28b0e96..9f27101 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, void *data, gfp_t gfp); +/* A desc with this init id is treated as an invalid desc */ +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX +int virtqueue_add_chain_desc(struct virtqueue *_vq, + uint64_t addr, + uint32_t len, + unsigned int *head_id, + unsigned int *prev_id, + bool in); + +int virtqueue_add_chain(struct virtqueue *_vq, + unsigned int head, + bool indirect, + struct vring_desc *indirect_desc, + void *data, + void *ctx); + bool virtqueue_kick(struct virtqueue *vq); +bool virtqueue_kick_sync(struct virtqueue *vq); + +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); + bool virtqueue_kick_prepare(struct virtqueue *vq); bool virtqueue_notify(struct virtqueue *vq); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 343d7dd..37780a7 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -34,6 +34,7 @@ #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ +#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 -- 2.7.4
This patch adds support for reporting blocks of pages on the free list specified by the caller. As pages can leave the free list during this call or immediately afterwards, they are not guaranteed to be free after the function returns. The only guarantee this makes is that the page was on the free list at some point in time after the function has been invoked. Therefore, it is not safe for caller to use any pages on the returned block or to discard data that is put there after the function returns. However, it is safe for caller to discard data that was in one of these pages before the function was invoked. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Signed-off-by: Liang Li <liang.z.li at intel.com> --- include/linux/mm.h | 5 +++ mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 101 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 46b9ac5..76cb433 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) +extern int report_unused_page_block(struct zone *zone, unsigned int order, + unsigned int migratetype, + struct page **page); +#endif /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) * into the buddy system. The freed pages will be poisoned with pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 64b7d82..8b3c9dd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) show_swap_cache_info(); } +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) + +/* + * Heuristically get a page block in the system that is unused. + * It is possible that pages from the page block are used immediately after + * report_unused_page_block() returns. It is the caller's responsibility + * to either detect or prevent the use of such pages. + * + * The free list to check: zone->free_area[order].free_list[migratetype]. + * + * If the caller supplied page block (i.e. **page) is on the free list, offer + * the next page block on the list to the caller. Otherwise, offer the first + * page block on the list. + * + * Note: it is not safe for caller to use any pages on the returned + * block or to discard data that is put there after the function returns. + * However, it is safe for caller to discard data that was in one of these + * pages before the function was invoked. + * + * Return 0 when a page block is found on the caller specified free list. + */ +int report_unused_page_block(struct zone *zone, unsigned int order, + unsigned int migratetype, struct page **page) +{ + struct zone *this_zone; + struct list_head *this_list; + int ret = 0; + unsigned long flags; + + /* Sanity check */ + if (zone == NULL || page == NULL || order >= MAX_ORDER || + migratetype >= MIGRATE_TYPES) + return -EINVAL; + + /* Zone validity check */ + for_each_populated_zone(this_zone) { + if (zone == this_zone) + break; + } + + /* Got a non-existent zone from the caller? */ + if (zone != this_zone) + return -EINVAL; + + spin_lock_irqsave(&this_zone->lock, flags); + + this_list = &zone->free_area[order].free_list[migratetype]; + if (list_empty(this_list)) { + *page = NULL; + ret = 1; + goto out; + } + + /* The caller is asking for the first free page block on the list */ + if ((*page) == NULL) { + *page = list_first_entry(this_list, struct page, lru); + ret = 0; + goto out; + } + + /* + * The page block passed from the caller is not on this free list + * anymore (e.g. a 1MB free page block has been split). In this case, + * offer the first page block on the free list that the caller is + * asking for. + */ + if (PageBuddy(*page) && order != page_order(*page)) { + *page = list_first_entry(this_list, struct page, lru); + ret = 0; + goto out; + } + + /* + * The page block passed from the caller has been the last page block + * on the list. + */ + if ((*page)->lru.next == this_list) { + *page = NULL; + ret = 1; + goto out; + } + + /* + * Finally, fall into the regular case: the page block passed from the + * caller is still on the free list. Offer the next one. + */ + *page = list_next_entry((*page), lru); + ret = 0; +out: + spin_unlock_irqrestore(&this_zone->lock, flags); + return ret; +} +EXPORT_SYMBOL(report_unused_page_block); + +#endif + static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone = zone; -- 2.7.4
Wei Wang
2017-Jul-12 12:40 UTC
[PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat
This patch enables for_each_zone()/for_each_populated_zone() to be invoked by a kernel module. Signed-off-by: Wei Wang <wei.w.wang at intel.com> --- mm/mmzone.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/mmzone.c b/mm/mmzone.c index a51c0a6..08a2a3a 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void) { return NODE_DATA(first_online_node); } +EXPORT_SYMBOL_GPL(first_online_pgdat); struct pglist_data *next_online_pgdat(struct pglist_data *pgdat) { @@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone) } return zone; } +EXPORT_SYMBOL_GPL(next_zone); static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes) { -- 2.7.4
Add a new vq, cmdq, to handle requests between the device and driver. This patch implements two commands sent from the device and handled in the driver. 1) VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report the guest memory statistics to the host. The stats_vq mechanism is not used when the cmdq mechanism is enabled. 2) VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to report the guest unused pages to the host. Since now we have a vq to handle multiple commands, we need to keep only one vq operation at a time. Here, we change the existing START_USE() and END_USE() to lock on each vq operation. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Signed-off-by: Liang Li <liang.z.li at intel.com> --- drivers/virtio/virtio_balloon.c | 245 ++++++++++++++++++++++++++++++++++-- drivers/virtio/virtio_ring.c | 25 +++- include/linux/virtio.h | 2 + include/uapi/linux/virtio_balloon.h | 10 ++ 4 files changed, 265 insertions(+), 17 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index aa4e7ec..ae91fbf 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt; struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq; /* The balloon servicing is delegated to a freezable workqueue. */ struct work_struct update_balloon_stats_work; struct work_struct update_balloon_size_work; + struct work_struct cmdq_handle_work; /* Prevent updating balloon when it is being canceled. */ spinlock_t stop_update_lock; @@ -90,6 +91,12 @@ struct virtio_balloon { /* Memory statistics */ struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + /* Cmdq msg buffer for memory statistics */ + struct virtio_balloon_cmdq_hdr cmdq_stats_hdr; + + /* Cmdq msg buffer for reporting ununsed pages */ + struct virtio_balloon_cmdq_hdr cmdq_unused_page_hdr; + /* To register callback in oom notifier call chain */ struct notifier_block nb; }; @@ -485,25 +492,214 @@ static void update_balloon_size_func(struct work_struct *work) queue_work(system_freezable_wq, work); } +static unsigned int cmdq_hdr_add(struct virtqueue *vq, + struct virtio_balloon_cmdq_hdr *hdr, + bool in) +{ + unsigned int id = VIRTQUEUE_DESC_ID_INIT; + uint64_t hdr_pa = (uint64_t)virt_to_phys((void *)hdr); + + virtqueue_add_chain_desc(vq, hdr_pa, sizeof(*hdr), &id, &id, in); + + /* Deliver the hdr for the host to send commands. */ + if (in) { + hdr->flags = 0; + virtqueue_add_chain(vq, id, 0, NULL, hdr, NULL); + virtqueue_kick(vq); + } + + return id; +} + +static void cmdq_add_chain_desc(struct virtio_balloon *vb, + struct virtio_balloon_cmdq_hdr *hdr, + uint64_t addr, + uint32_t len, + unsigned int *head_id, + unsigned int *prev_id) +{ +retry: + if (*head_id == VIRTQUEUE_DESC_ID_INIT) { + *head_id = cmdq_hdr_add(vb->cmd_vq, hdr, 0); + *prev_id = *head_id; + } + + virtqueue_add_chain_desc(vb->cmd_vq, addr, len, head_id, prev_id, 0); + if (*head_id == *prev_id) { + /* + * The VQ was full and kicked to release some descs. Now we + * will re-start to build the chain by using the hdr as the + * first desc, so we need to detach the desc that was just + * added, and re-start to add the hdr. + */ + virtqueue_detach_buf(vb->cmd_vq, *head_id, NULL); + *head_id = VIRTQUEUE_DESC_ID_INIT; + *prev_id = VIRTQUEUE_DESC_ID_INIT; + goto retry; + } +} + +static void cmdq_handle_stats(struct virtio_balloon *vb) +{ + unsigned int num_stats, + head_id = VIRTQUEUE_DESC_ID_INIT, + prev_id = VIRTQUEUE_DESC_ID_INIT; + uint64_t addr = (uint64_t)virt_to_phys((void *)vb->stats); + uint32_t len; + + spin_lock(&vb->stop_update_lock); + if (!vb->stop_update) { + num_stats = update_balloon_stats(vb); + len = sizeof(struct virtio_balloon_stat) * num_stats; + cmdq_add_chain_desc(vb, &vb->cmdq_stats_hdr, addr, len, + &head_id, &prev_id); + virtqueue_add_chain(vb->cmd_vq, head_id, 0, NULL, vb, NULL); + virtqueue_kick_sync(vb->cmd_vq); + } + spin_unlock(&vb->stop_update_lock); +} + +static void cmdq_add_unused_page(struct virtio_balloon *vb, + struct zone *zone, + unsigned int order, + unsigned int type, + struct page *page, + unsigned int *head_id, + unsigned int *prev_id) +{ + uint64_t addr; + uint32_t len; + + while (!report_unused_page_block(zone, order, type, &page)) { + addr = (u64)page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; + len = (u64)(1 << order) << VIRTIO_BALLOON_PFN_SHIFT; + cmdq_add_chain_desc(vb, &vb->cmdq_unused_page_hdr, addr, len, + head_id, prev_id); + } +} + +static void cmdq_handle_unused_pages(struct virtio_balloon *vb) +{ + struct virtqueue *vq = vb->cmd_vq; + unsigned int order = 0, type = 0, + head_id = VIRTQUEUE_DESC_ID_INIT, + prev_id = VIRTQUEUE_DESC_ID_INIT; + struct zone *zone = NULL; + struct page *page = NULL; + + for_each_populated_zone(zone) + for_each_migratetype_order(order, type) + cmdq_add_unused_page(vb, zone, order, type, page, + &head_id, &prev_id); + + /* Set the cmd completion flag. */ + vb->cmdq_unused_page_hdr.flags |+ cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION); + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL); + virtqueue_kick_sync(vb->cmd_vq); +} + +static void cmdq_handle(struct virtio_balloon *vb) +{ + struct virtio_balloon_cmdq_hdr *hdr; + unsigned int len; + + while ((hdr = (struct virtio_balloon_cmdq_hdr *) + virtqueue_get_buf(vb->cmd_vq, &len)) != NULL) { + switch (__le32_to_cpu(hdr->cmd)) { + case VIRTIO_BALLOON_CMDQ_REPORT_STATS: + cmdq_handle_stats(vb); + break; + case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: + cmdq_handle_unused_pages(vb); + break; + default: + dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__); + return; + } + /* + * Replenish all the command buffer to the device after a + * command is handled. This is for the convenience of the + * device to rewind the cmdq to get back all the command + * buffer after live migration. + */ + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1); + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1); + } +} + +static void cmdq_handle_work_func(struct work_struct *work) +{ + struct virtio_balloon *vb; + + vb = container_of(work, struct virtio_balloon, + cmdq_handle_work); + cmdq_handle(vb); +} + +static void cmdq_callback(struct virtqueue *vq) +{ + struct virtio_balloon *vb = vq->vdev->priv; + + queue_work(system_freezable_wq, &vb->cmdq_handle_work); +} + static int init_vqs(struct virtio_balloon *vb) { - struct virtqueue *vqs[3]; - vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request }; - static const char * const names[] = { "inflate", "deflate", "stats" }; - int err, nvqs; + struct virtqueue **vqs; + vq_callback_t **callbacks; + const char **names; + int err = -ENOMEM; + int nvqs; + + /* Inflateq and deflateq are used unconditionally */ + nvqs = 2; + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) || + virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) + nvqs++; + + /* Allocate space for find_vqs parameters */ + vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL); + if (!vqs) + goto err_vq; + callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL); + if (!callbacks) + goto err_callback; + names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL); + if (!names) + goto err_names; + + callbacks[0] = balloon_ack; + names[0] = "inflate"; + callbacks[1] = balloon_ack; + names[1] = "deflate"; /* - * We expect two virtqueues: inflate and deflate, and - * optionally stat. + * The stats_vq is used only when cmdq is not supported (or disabled) + * by the device. */ - nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; - err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); - if (err) - return err; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) { + callbacks[2] = cmdq_callback; + names[2] = "cmdq"; + } else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { + callbacks[2] = stats_request; + names[2] = "stats"; + } + err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, + names, NULL, NULL); + if (err) + goto err_find; vb->inflate_vq = vqs[0]; vb->deflate_vq = vqs[1]; - if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) { + vb->cmd_vq = vqs[2]; + /* Prime the cmdq with the header buffer. */ + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1); + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1); + } else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { struct scatterlist sg; unsigned int num_stats; vb->stats_vq = vqs[2]; @@ -520,6 +716,16 @@ static int init_vqs(struct virtio_balloon *vb) BUG(); virtqueue_kick(vb->stats_vq); } + +err_find: + kfree(names); +err_names: + kfree(callbacks); +err_callback: + kfree(vqs); +err_vq: + return err; + return 0; } @@ -640,7 +846,18 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out; } - INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func); + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) { + vb->cmdq_stats_hdr.cmd + cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_STATS); + vb->cmdq_stats_hdr.flags = 0; + vb->cmdq_unused_page_hdr.cmd + cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES); + vb->cmdq_unused_page_hdr.flags = 0; + INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func); + } else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { + INIT_WORK(&vb->update_balloon_stats_work, + update_balloon_stats_func); + } INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func); spin_lock_init(&vb->stop_update_lock); vb->stop_update = false; @@ -722,6 +939,7 @@ static void virtballoon_remove(struct virtio_device *vdev) spin_unlock_irq(&vb->stop_update_lock); cancel_work_sync(&vb->update_balloon_size_work); cancel_work_sync(&vb->update_balloon_stats_work); + cancel_work_sync(&vb->cmdq_handle_work); xb_empty(&vb->page_xb); remove_common(vb); @@ -776,6 +994,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_SG, + VIRTIO_BALLOON_F_CMD_VQ, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index b9d7e10..793de12 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -52,8 +52,13 @@ "%s:"fmt, (_vq)->vq.name, ##args); \ (_vq)->broken = true; \ } while (0) -#define START_USE(vq) -#define END_USE(vq) +#define START_USE(_vq) \ + do { \ + while ((_vq)->in_use) \ + cpu_relax(); \ + (_vq)->in_use = __LINE__; \ + } while (0) +#define END_USE(_vq) ((_vq)->in_use = 0) #endif struct vring_desc_state { @@ -101,9 +106,9 @@ struct vring_virtqueue { size_t queue_size_in_bytes; dma_addr_t queue_dma_addr; -#ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; +#ifdef DEBUG /* Figure out if their kicks are too delayed. */ bool last_add_time_valid; @@ -845,6 +850,18 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head, } } +void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + + START_USE(vq); + + detach_buf(vq, head, ctx); + + END_USE(vq); +} +EXPORT_SYMBOL_GPL(virtqueue_detach_buf); + static inline bool more_used(const struct vring_virtqueue *vq) { return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx); @@ -1158,8 +1175,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index, vq->avail_idx_shadow = 0; vq->num_added = 0; list_add_tail(&vq->vq.list, &vdev->vqs); + vq->in_use = 0; #ifdef DEBUG - vq->in_use = false; vq->last_add_time_valid = false; #endif diff --git a/include/linux/virtio.h b/include/linux/virtio.h index 9f27101..9df480b 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -88,6 +88,8 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len); void *virtqueue_get_buf_ctx(struct virtqueue *vq, unsigned int *len, void **ctx); +void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx); + void virtqueue_disable_cb(struct virtqueue *vq); bool virtqueue_enable_cb(struct virtqueue *vq); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 37780a7..b38c370 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -35,6 +35,7 @@ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ +#define VIRTIO_BALLOON_F_CMD_VQ 4 /* Command virtqueue */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 @@ -83,4 +84,13 @@ struct virtio_balloon_stat { __virtio64 val; } __attribute__((packed)); +struct virtio_balloon_cmdq_hdr { +#define VIRTIO_BALLOON_CMDQ_REPORT_STATS 0 +#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES 1 + __le32 cmd; +/* Flag to indicate the completion of handling a command */ +#define VIRTIO_BALLOON_CMDQ_F_COMPLETION 1 + __le32 flags; +}; + #endif /* _LINUX_VIRTIO_BALLOON_H */ -- 2.7.4
Michael S. Tsirkin
2017-Jul-12 13:06 UTC
[PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:> diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index 28b0e96..9f27101 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, > void *data, > gfp_t gfp); > > +/* A desc with this init id is treated as an invalid desc */ > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX > +int virtqueue_add_chain_desc(struct virtqueue *_vq, > + uint64_t addr, > + uint32_t len, > + unsigned int *head_id, > + unsigned int *prev_id, > + bool in); > + > +int virtqueue_add_chain(struct virtqueue *_vq, > + unsigned int head, > + bool indirect, > + struct vring_desc *indirect_desc, > + void *data, > + void *ctx); > + > bool virtqueue_kick(struct virtqueue *vq); > > +bool virtqueue_kick_sync(struct virtqueue *vq); > + > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); > + > bool virtqueue_kick_prepare(struct virtqueue *vq); > > bool virtqueue_notify(struct virtqueue *vq);I don't much care for this API. It does exactly what balloon needs, but at cost of e.g. transparently busy-waiting. Unlikely to be a good fit for anything else. If you don't like my original _first/_next/_last, you will need to come up with something else. -- MST
On Wed, Jul 12, 2017 at 08:40:13PM +0800, Wei Wang wrote:> This patch series enhances the existing virtio-balloon with the following new > features: > 1) fast ballooning: transfer ballooned pages between the guest and host in > chunks using sgs, instead of one by one; and > 2) cmdq: a new virtqueue to send commands between the device and driver. > Currently, it supports commands to report memory stats (replace the old statq > mechanism) and report guest unused pages.Could we get some feedback from mm crowd on patches 6 and 7?> Change Log: > > v11->v12: > 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages. > 2) virtio-ring: enable the driver to build up a desc chain using vring desc. > 3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro > to lock/unlock the vq when a vq operation starts/ends. > 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async() > 5) virtio-balloon: describe chunks of ballooned pages and free pages blocks > directly using one or more chains of desc from the vq. > > v10->v11: > 1) virtio_balloon: use vring_desc to describe a chunk; > 2) virtio_ring: support to add an indirect desc table to virtqueue; > 3) virtio_balloon: use cmdq to report guest memory statistics. > > v9->v10: > 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON; > 2) virtio-balloon: add virtballoon_validate(); > 3) virtio-balloon: msg format change; > 4) virtio-balloon: move miscq handling to a task on system_freezable_wq; > 5) virtio-balloon: code cleanup. > > v8->v9: > 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and > VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous > implementation; > 2) Simpler function to get the free page block. > > v7->v8: > 1) Use only one chunk format, instead of two. > 2) re-write the virtio-balloon implementation patch. > 3) commit changes > 4) patch re-org > > Liang Li (1): > virtio-balloon: deflate via a page list > > Matthew Wilcox (1): > Introduce xbitmap > > Wei Wang (6): > virtio-balloon: coding format cleanup > xbitmap: add xb_find_next_bit() and xb_zero() > virtio-balloon: VIRTIO_BALLOON_F_SG > mm: support reporting free page blocks > mm: export symbol of next_zone and first_online_pgdat > virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ > > drivers/virtio/virtio_balloon.c | 414 ++++++++++++++++++++++++++++++++---- > drivers/virtio/virtio_ring.c | 224 +++++++++++++++++-- > include/linux/mm.h | 5 + > include/linux/radix-tree.h | 2 + > include/linux/virtio.h | 22 ++ > include/linux/xbitmap.h | 53 +++++ > include/uapi/linux/virtio_balloon.h | 11 + > lib/radix-tree.c | 164 +++++++++++++- > mm/mmzone.c | 2 + > mm/page_alloc.c | 96 +++++++++ > 10 files changed, 926 insertions(+), 67 deletions(-) > create mode 100644 include/linux/xbitmap.h > > -- > 2.7.4
Michael S. Tsirkin
2017-Jul-13 00:16 UTC
[PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat
On Wed, Jul 12, 2017 at 08:40:20PM +0800, Wei Wang wrote:> This patch enables for_each_zone()/for_each_populated_zone() to be > invoked by a kernel module.... for use by virtio balloon.> Signed-off-by: Wei Wang <wei.w.wang at intel.com>balloon seems to only use + for_each_populated_zone(zone) + for_each_migratetype_order(order, type)> --- > mm/mmzone.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/mmzone.c b/mm/mmzone.c > index a51c0a6..08a2a3a 100644 > --- a/mm/mmzone.c > +++ b/mm/mmzone.c > @@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void) > { > return NODE_DATA(first_online_node); > } > +EXPORT_SYMBOL_GPL(first_online_pgdat); > > struct pglist_data *next_online_pgdat(struct pglist_data *pgdat) > { > @@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone) > } > return zone; > } > +EXPORT_SYMBOL_GPL(next_zone); > > static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes) > { > -- > 2.7.4
Michael S. Tsirkin
2017-Jul-13 00:22 UTC
[PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
On Wed, Jul 12, 2017 at 08:40:21PM +0800, Wei Wang wrote:> Add a new vq, cmdq, to handle requests between the device and driver. > > This patch implements two commands sent from the device and handled in > the driver. > 1) VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report > the guest memory statistics to the host. The stats_vq mechanism is not > used when the cmdq mechanism is enabled. > 2) VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to > report the guest unused pages to the host. > > Since now we have a vq to handle multiple commands, we need to keep only > one vq operation at a time. Here, we change the existing START_USE() > and END_USE() to lock on each vq operation. > > Signed-off-by: Wei Wang <wei.w.wang at intel.com> > Signed-off-by: Liang Li <liang.z.li at intel.com> > --- > drivers/virtio/virtio_balloon.c | 245 ++++++++++++++++++++++++++++++++++-- > drivers/virtio/virtio_ring.c | 25 +++- > include/linux/virtio.h | 2 + > include/uapi/linux/virtio_balloon.h | 10 ++ > 4 files changed, 265 insertions(+), 17 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index aa4e7ec..ae91fbf 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt; > > struct virtio_balloon { > struct virtio_device *vdev; > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq; > > /* The balloon servicing is delegated to a freezable workqueue. */ > struct work_struct update_balloon_stats_work; > struct work_struct update_balloon_size_work; > + struct work_struct cmdq_handle_work; > > /* Prevent updating balloon when it is being canceled. */ > spinlock_t stop_update_lock; > @@ -90,6 +91,12 @@ struct virtio_balloon { > /* Memory statistics */ > struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; > > + /* Cmdq msg buffer for memory statistics */ > + struct virtio_balloon_cmdq_hdr cmdq_stats_hdr; > + > + /* Cmdq msg buffer for reporting ununsed pages */ > + struct virtio_balloon_cmdq_hdr cmdq_unused_page_hdr; > + > /* To register callback in oom notifier call chain */ > struct notifier_block nb; > }; > @@ -485,25 +492,214 @@ static void update_balloon_size_func(struct work_struct *work) > queue_work(system_freezable_wq, work); > } > > +static unsigned int cmdq_hdr_add(struct virtqueue *vq, > + struct virtio_balloon_cmdq_hdr *hdr, > + bool in) > +{ > + unsigned int id = VIRTQUEUE_DESC_ID_INIT; > + uint64_t hdr_pa = (uint64_t)virt_to_phys((void *)hdr); > + > + virtqueue_add_chain_desc(vq, hdr_pa, sizeof(*hdr), &id, &id, in); > + > + /* Deliver the hdr for the host to send commands. */ > + if (in) { > + hdr->flags = 0; > + virtqueue_add_chain(vq, id, 0, NULL, hdr, NULL); > + virtqueue_kick(vq); > + } > + > + return id; > +} > + > +static void cmdq_add_chain_desc(struct virtio_balloon *vb, > + struct virtio_balloon_cmdq_hdr *hdr, > + uint64_t addr, > + uint32_t len, > + unsigned int *head_id, > + unsigned int *prev_id) > +{ > +retry: > + if (*head_id == VIRTQUEUE_DESC_ID_INIT) { > + *head_id = cmdq_hdr_add(vb->cmd_vq, hdr, 0); > + *prev_id = *head_id; > + } > + > + virtqueue_add_chain_desc(vb->cmd_vq, addr, len, head_id, prev_id, 0); > + if (*head_id == *prev_id) {That's an ugly way to detect ring full.> + /* > + * The VQ was full and kicked to release some descs. Now we > + * will re-start to build the chain by using the hdr as the > + * first desc, so we need to detach the desc that was just > + * added, and re-start to add the hdr. > + */ > + virtqueue_detach_buf(vb->cmd_vq, *head_id, NULL); > + *head_id = VIRTQUEUE_DESC_ID_INIT; > + *prev_id = VIRTQUEUE_DESC_ID_INIT; > + goto retry; > + } > +} > + > +static void cmdq_handle_stats(struct virtio_balloon *vb) > +{ > + unsigned int num_stats, > + head_id = VIRTQUEUE_DESC_ID_INIT, > + prev_id = VIRTQUEUE_DESC_ID_INIT; > + uint64_t addr = (uint64_t)virt_to_phys((void *)vb->stats); > + uint32_t len; > + > + spin_lock(&vb->stop_update_lock); > + if (!vb->stop_update) { > + num_stats = update_balloon_stats(vb); > + len = sizeof(struct virtio_balloon_stat) * num_stats; > + cmdq_add_chain_desc(vb, &vb->cmdq_stats_hdr, addr, len, > + &head_id, &prev_id); > + virtqueue_add_chain(vb->cmd_vq, head_id, 0, NULL, vb, NULL); > + virtqueue_kick_sync(vb->cmd_vq); > + } > + spin_unlock(&vb->stop_update_lock); > +} > + > +static void cmdq_add_unused_page(struct virtio_balloon *vb, > + struct zone *zone, > + unsigned int order, > + unsigned int type, > + struct page *page, > + unsigned int *head_id, > + unsigned int *prev_id) > +{ > + uint64_t addr; > + uint32_t len; > + > + while (!report_unused_page_block(zone, order, type, &page)) { > + addr = (u64)page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; > + len = (u64)(1 << order) << VIRTIO_BALLOON_PFN_SHIFT; > + cmdq_add_chain_desc(vb, &vb->cmdq_unused_page_hdr, addr, len, > + head_id, prev_id); > + } > +} > + > +static void cmdq_handle_unused_pages(struct virtio_balloon *vb) > +{ > + struct virtqueue *vq = vb->cmd_vq; > + unsigned int order = 0, type = 0, > + head_id = VIRTQUEUE_DESC_ID_INIT, > + prev_id = VIRTQUEUE_DESC_ID_INIT; > + struct zone *zone = NULL; > + struct page *page = NULL; > + > + for_each_populated_zone(zone) > + for_each_migratetype_order(order, type) > + cmdq_add_unused_page(vb, zone, order, type, page, > + &head_id, &prev_id); > + > + /* Set the cmd completion flag. */ > + vb->cmdq_unused_page_hdr.flags |> + cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION); > + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL); > + virtqueue_kick_sync(vb->cmd_vq); > +} > + > +static void cmdq_handle(struct virtio_balloon *vb) > +{ > + struct virtio_balloon_cmdq_hdr *hdr; > + unsigned int len; > + > + while ((hdr = (struct virtio_balloon_cmdq_hdr *) > + virtqueue_get_buf(vb->cmd_vq, &len)) != NULL) { > + switch (__le32_to_cpu(hdr->cmd)) { > + case VIRTIO_BALLOON_CMDQ_REPORT_STATS: > + cmdq_handle_stats(vb); > + break; > + case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: > + cmdq_handle_unused_pages(vb); > + break; > + default: > + dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__); > + return; > + } > + /* > + * Replenish all the command buffer to the device after a > + * command is handled. This is for the convenience of the > + * device to rewind the cmdq to get back all the command > + * buffer after live migration. > + */ > + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1); > + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1); > + } > +} > + > +static void cmdq_handle_work_func(struct work_struct *work) > +{ > + struct virtio_balloon *vb; > + > + vb = container_of(work, struct virtio_balloon, > + cmdq_handle_work); > + cmdq_handle(vb); > +} > + > +static void cmdq_callback(struct virtqueue *vq) > +{ > + struct virtio_balloon *vb = vq->vdev->priv; > + > + queue_work(system_freezable_wq, &vb->cmdq_handle_work); > +} > + > static int init_vqs(struct virtio_balloon *vb) > { > - struct virtqueue *vqs[3]; > - vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request }; > - static const char * const names[] = { "inflate", "deflate", "stats" }; > - int err, nvqs; > + struct virtqueue **vqs; > + vq_callback_t **callbacks; > + const char **names; > + int err = -ENOMEM; > + int nvqs; > + > + /* Inflateq and deflateq are used unconditionally */ > + nvqs = 2; > + > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) || > + virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) > + nvqs++; > + > + /* Allocate space for find_vqs parameters */ > + vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL); > + if (!vqs) > + goto err_vq; > + callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL); > + if (!callbacks) > + goto err_callback; > + names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL); > + if (!names) > + goto err_names; > + > + callbacks[0] = balloon_ack; > + names[0] = "inflate"; > + callbacks[1] = balloon_ack; > + names[1] = "deflate"; > > /* > - * We expect two virtqueues: inflate and deflate, and > - * optionally stat. > + * The stats_vq is used only when cmdq is not supported (or disabled) > + * by the device. > */ > - nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; > - err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); > - if (err) > - return err; > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) { > + callbacks[2] = cmdq_callback; > + names[2] = "cmdq"; > + } else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > + callbacks[2] = stats_request; > + names[2] = "stats"; > + } > > + err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, > + names, NULL, NULL); > + if (err) > + goto err_find; > vb->inflate_vq = vqs[0]; > vb->deflate_vq = vqs[1]; > - if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > + > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) { > + vb->cmd_vq = vqs[2]; > + /* Prime the cmdq with the header buffer. */ > + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1); > + cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1); > + } else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > struct scatterlist sg; > unsigned int num_stats; > vb->stats_vq = vqs[2]; > @@ -520,6 +716,16 @@ static int init_vqs(struct virtio_balloon *vb) > BUG(); > virtqueue_kick(vb->stats_vq); > } > + > +err_find: > + kfree(names); > +err_names: > + kfree(callbacks); > +err_callback: > + kfree(vqs); > +err_vq: > + return err; > + > return 0; > } > > @@ -640,7 +846,18 @@ static int virtballoon_probe(struct virtio_device *vdev) > goto out; > } > > - INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func); > + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) { > + vb->cmdq_stats_hdr.cmd > + cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_STATS); > + vb->cmdq_stats_hdr.flags = 0; > + vb->cmdq_unused_page_hdr.cmd > + cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES); > + vb->cmdq_unused_page_hdr.flags = 0; > + INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func); > + } else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > + INIT_WORK(&vb->update_balloon_stats_work, > + update_balloon_stats_func); > + } > INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func); > spin_lock_init(&vb->stop_update_lock); > vb->stop_update = false; > @@ -722,6 +939,7 @@ static void virtballoon_remove(struct virtio_device *vdev) > spin_unlock_irq(&vb->stop_update_lock); > cancel_work_sync(&vb->update_balloon_size_work); > cancel_work_sync(&vb->update_balloon_stats_work); > + cancel_work_sync(&vb->cmdq_handle_work); > > xb_empty(&vb->page_xb); > remove_common(vb); > @@ -776,6 +994,7 @@ static unsigned int features[] = { > VIRTIO_BALLOON_F_STATS_VQ, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, > VIRTIO_BALLOON_F_SG, > + VIRTIO_BALLOON_F_CMD_VQ, > }; > > static struct virtio_driver virtio_balloon_driver = { > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index b9d7e10..793de12 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -52,8 +52,13 @@ > "%s:"fmt, (_vq)->vq.name, ##args); \ > (_vq)->broken = true; \ > } while (0) > -#define START_USE(vq) > -#define END_USE(vq) > +#define START_USE(_vq) \ > + do { \ > + while ((_vq)->in_use) \ > + cpu_relax(); \ > + (_vq)->in_use = __LINE__; \ > + } while (0) > +#define END_USE(_vq) ((_vq)->in_use = 0) > #endif > > struct vring_desc_state { > @@ -101,9 +106,9 @@ struct vring_virtqueue { > size_t queue_size_in_bytes; > dma_addr_t queue_dma_addr; > > -#ifdef DEBUG > /* They're supposed to lock for us. */ > unsigned int in_use; > +#ifdef DEBUG > > /* Figure out if their kicks are too delayed. */ > bool last_add_time_valid; > @@ -845,6 +850,18 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head, > } > } > > +void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx) > +{ > + struct vring_virtqueue *vq = to_vvq(_vq); > + > + START_USE(vq); > + > + detach_buf(vq, head, ctx); > + > + END_USE(vq); > +} > +EXPORT_SYMBOL_GPL(virtqueue_detach_buf); > + > static inline bool more_used(const struct vring_virtqueue *vq) > { > return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx); > @@ -1158,8 +1175,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index, > vq->avail_idx_shadow = 0; > vq->num_added = 0; > list_add_tail(&vq->vq.list, &vdev->vqs); > + vq->in_use = 0; > #ifdef DEBUG > - vq->in_use = false; > vq->last_add_time_valid = false; > #endif > > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index 9f27101..9df480b 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -88,6 +88,8 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len); > void *virtqueue_get_buf_ctx(struct virtqueue *vq, unsigned int *len, > void **ctx); > > +void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx); > + > void virtqueue_disable_cb(struct virtqueue *vq); > > bool virtqueue_enable_cb(struct virtqueue *vq); > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > index 37780a7..b38c370 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -35,6 +35,7 @@ > #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ > #define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ > +#define VIRTIO_BALLOON_F_CMD_VQ 4 /* Command virtqueue */ > > /* Size of a PFN in the balloon interface. */ > #define VIRTIO_BALLOON_PFN_SHIFT 12 > @@ -83,4 +84,13 @@ struct virtio_balloon_stat { > __virtio64 val; > } __attribute__((packed)); > > +struct virtio_balloon_cmdq_hdr { > +#define VIRTIO_BALLOON_CMDQ_REPORT_STATS 0 > +#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES 1 > + __le32 cmd; > +/* Flag to indicate the completion of handling a command */ > +#define VIRTIO_BALLOON_CMDQ_F_COMPLETION 1 > + __le32 flags; > +}; > + > #endif /* _LINUX_VIRTIO_BALLOON_H */ > -- > 2.7.4
Michael S. Tsirkin
2017-Jul-13 00:33 UTC
[PATCH v12 6/8] mm: support reporting free page blocks
On Wed, Jul 12, 2017 at 08:40:19PM +0800, Wei Wang wrote:> This patch adds support for reporting blocks of pages on the free list > specified by the caller. > > As pages can leave the free list during this call or immediately > afterwards, they are not guaranteed to be free after the function > returns. The only guarantee this makes is that the page was on the free > list at some point in time after the function has been invoked. > > Therefore, it is not safe for caller to use any pages on the returned > block or to discard data that is put there after the function returns. > However, it is safe for caller to discard data that was in one of these > pages before the function was invoked. > > Signed-off-by: Wei Wang <wei.w.wang at intel.com> > Signed-off-by: Liang Li <liang.z.li at intel.com> > --- > include/linux/mm.h | 5 +++ > mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 101 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 46b9ac5..76cb433 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > unsigned long zone_start_pfn, unsigned long *zholes_size); > extern void free_initmem(void); > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > +extern int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, > + struct page **page); > +#endif > /* > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > * into the buddy system. The freed pages will be poisoned with pattern > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 64b7d82..8b3c9dd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > show_swap_cache_info(); > } > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > + > +/* > + * Heuristically get a page block in the system that is unused. > + * It is possible that pages from the page block are used immediately after > + * report_unused_page_block() returns. It is the caller's responsibility > + * to either detect or prevent the use of such pages. > + * > + * The free list to check: zone->free_area[order].free_list[migratetype]. > + * > + * If the caller supplied page block (i.e. **page) is on the free list, offer > + * the next page block on the list to the caller. Otherwise, offer the first > + * page block on the list. > + * > + * Note: it is not safe for caller to use any pages on the returned > + * block or to discard data that is put there after the function returns. > + * However, it is safe for caller to discard data that was in one of these > + * pages before the function was invoked. > + * > + * Return 0 when a page block is found on the caller specified free list.Otherwise?> + */As an alternative, we could have an API that scans free pages and invokes a callback under a lock. Granted, this might end up staying a lot of time under a lock. Is this a big issue? Some benchmarking will tell. It would then be up to the hypervisor to decide whether it wants to play tricks with the dirty bit or just wants to drop pages while VCPU is stopped.> +int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, struct page **page) > +{ > + struct zone *this_zone; > + struct list_head *this_list; > + int ret = 0; > + unsigned long flags; > + > + /* Sanity check */ > + if (zone == NULL || page == NULL || order >= MAX_ORDER || > + migratetype >= MIGRATE_TYPES) > + return -EINVAL;Why do callers this?> + > + /* Zone validity check */ > + for_each_populated_zone(this_zone) { > + if (zone == this_zone) > + break; > + }Why? Will take a long time if there are lots of zones.> + > + /* Got a non-existent zone from the caller? */ > + if (zone != this_zone) > + return -EINVAL;When does this happen?> + > + spin_lock_irqsave(&this_zone->lock, flags); > + > + this_list = &zone->free_area[order].free_list[migratetype]; > + if (list_empty(this_list)) { > + *page = NULL; > + ret = 1;What does this mean?> + goto out; > + } > + > + /* The caller is asking for the first free page block on the list */ > + if ((*page) == NULL) {if (!*page) is shorter and prettier.> + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller is not on this free list > + * anymore (e.g. a 1MB free page block has been split). In this case, > + * offer the first page block on the free list that the caller is > + * asking for.This just might keep giving you same block over and over again. E.g. - get 1st block - get 2nd block - 2nd gets broken up - get 1st block again this way we might never make progress beyond the 1st 2 blocks> + */ > + if (PageBuddy(*page) && order != page_order(*page)) { > + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller has been the last page block > + * on the list. > + */ > + if ((*page)->lru.next == this_list) { > + *page = NULL; > + ret = 1; > + goto out; > + } > + > + /* > + * Finally, fall into the regular case: the page block passed from the > + * caller is still on the free list. Offer the next one. > + */ > + *page = list_next_entry((*page), lru); > + ret = 0; > +out: > + spin_unlock_irqrestore(&this_zone->lock, flags); > + return ret; > +} > +EXPORT_SYMBOL(report_unused_page_block); > + > +#endif > + > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > { > zoneref->zone = zone; > -- > 2.7.4
Michael S. Tsirkin
2017-Jul-13 00:44 UTC
[PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to > transfer a chunk of ballooned (i.e. inflated/deflated) pages using > scatter-gather lists to the host. > > The implementation of the previous virtio-balloon is not very > efficient, because the balloon pages are transferred to the > host one by one. Here is the breakdown of the time in percentage > spent on each step of the balloon inflating process (inflating > 7GB of an 8GB idle guest). > > 1) allocating pages (6.5%) > 2) sending PFNs to host (68.3%) > 3) address translation (6.1%) > 4) madvise (19%) > > It takes about 4126ms for the inflating process to complete. > The above profiling shows that the bottlenecks are stage 2) > and stage 4). > > This patch optimizes step 2) by transferring pages to the host in > sgs. An sg describes a chunk of guest physically continuous pages. > With this mechanism, step 4) can also be optimized by doing address > translation and madvise() in chunks rather than page by page. > > With this new feature, the above ballooning process takes ~491ms > resulting in an improvement of ~88%. > > TODO: optimize stage 1) by allocating/freeing a chunk of pages > instead of a single page each time. > > Signed-off-by: Wei Wang <wei.w.wang at intel.com> > Signed-off-by: Liang Li <liang.z.li at intel.com> > Suggested-by: Michael S. Tsirkin <mst at redhat.com> > --- > drivers/virtio/virtio_balloon.c | 141 ++++++++++++++++++++++--- > drivers/virtio/virtio_ring.c | 199 +++++++++++++++++++++++++++++++++--- > include/linux/virtio.h | 20 ++++ > include/uapi/linux/virtio_balloon.h | 1 + > 4 files changed, 329 insertions(+), 32 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index f0b3a0b..aa4e7ec 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -32,6 +32,7 @@ > #include <linux/mm.h> > #include <linux/mount.h> > #include <linux/magic.h> > +#include <linux/xbitmap.h> > > /* > * Balloon device works in 4K page units. So each page is pointed to by > @@ -79,6 +80,9 @@ struct virtio_balloon { > /* Synchronize access/update to this struct virtio_balloon elements */ > struct mutex balloon_lock; > > + /* The xbitmap used to record ballooned pages */ > + struct xb page_xb; > + > /* The array of pfns we tell the Host about. */ > unsigned int num_pfns; > __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; > @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb, > page_to_balloon_pfn(page) + i); > } > > +/* > + * Send balloon pages in sgs to host. > + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap > + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for > + * continuous "1" bits, which correspond to continuous pages, to chunk into > + * sgs. > + * > + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that > + * need to be serached.searched> + */ > +static void tell_host_sgs(struct virtio_balloon *vb, > + struct virtqueue *vq, > + unsigned long page_xb_start, > + unsigned long page_xb_end) > +{ > + unsigned int head_id = VIRTQUEUE_DESC_ID_INIT, > + prev_id = VIRTQUEUE_DESC_ID_INIT; > + unsigned long sg_pfn_start, sg_pfn_end; > + uint64_t sg_addr; > + uint32_t sg_size; > + > + sg_pfn_start = page_xb_start; > + while (sg_pfn_start < page_xb_end) { > + sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start, > + page_xb_end, 1); > + if (sg_pfn_start == page_xb_end + 1) > + break; > + sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1, > + page_xb_end, 0); > + sg_addr = sg_pfn_start << PAGE_SHIFT; > + sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;There's an issue here - this might not fit in uint32_t. You need to limit sg_pfn_end - something like: /* make sure sg_size below fits in a 32 bit integer */ sg_pfn_end = min(sg_pfn_end, sg_pfn_start + UINT_MAX >> PAGE_SIZE);> + virtqueue_add_chain_desc(vq, sg_addr, sg_size, &head_id, > + &prev_id, 0); > + xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end); > + sg_pfn_start = sg_pfn_end + 1; > + } > + > + if (head_id != VIRTQUEUE_DESC_ID_INIT) { > + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL); > + virtqueue_kick_async(vq, vb->acked); > + } > +} > + > +/* Update pfn_max and pfn_min according to the pfn of @page */ > +static inline void update_pfn_range(struct virtio_balloon *vb, > + struct page *page, > + unsigned long *pfn_min, > + unsigned long *pfn_max) > +{ > + unsigned long pfn = page_to_pfn(page); > + > + *pfn_min = min(pfn, *pfn_min); > + *pfn_max = max(pfn, *pfn_max); > +} > + > static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > { > struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; > unsigned num_allocated_pages; > + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); > + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; > > /* We can only do one array worth at a time. */ > - num = min(num, ARRAY_SIZE(vb->pfns)); > + if (!use_sg) > + num = min(num, ARRAY_SIZE(vb->pfns)); > > mutex_lock(&vb->balloon_lock); > for (vb->num_pfns = 0; vb->num_pfns < num; > @@ -162,7 +224,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > msleep(200); > break; > } > - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + if (use_sg) { > + update_pfn_range(vb, page, &pfn_min, &pfn_max); > + xb_set_bit(&vb->page_xb, page_to_pfn(page)); > + } else { > + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + } > vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE; > if (!virtio_has_feature(vb->vdev, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) > @@ -171,8 +238,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > > num_allocated_pages = vb->num_pfns; > /* Did we get any? */ > - if (vb->num_pfns != 0) > - tell_host(vb, vb->inflate_vq); > + if (vb->num_pfns != 0) { > + if (use_sg) > + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max); > + else > + tell_host(vb, vb->inflate_vq); > + } > mutex_unlock(&vb->balloon_lock); > > return num_allocated_pages; > @@ -198,9 +269,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) > struct page *page; > struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; > LIST_HEAD(pages); > + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); > + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; > > - /* We can only do one array worth at a time. */ > - num = min(num, ARRAY_SIZE(vb->pfns)); > + /* Traditionally, we can only do one array worth at a time. */ > + if (!use_sg) > + num = min(num, ARRAY_SIZE(vb->pfns)); > > mutex_lock(&vb->balloon_lock); > /* We can't release more pages than taken */ > @@ -210,7 +284,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) > page = balloon_page_dequeue(vb_dev_info); > if (!page) > break; > - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + if (use_sg) { > + update_pfn_range(vb, page, &pfn_min, &pfn_max); > + xb_set_bit(&vb->page_xb, page_to_pfn(page)); > + } else { > + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + } > list_add(&page->lru, &pages); > vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; > } > @@ -221,8 +300,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) > * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); > * is true, we *have* to do it in this order > */ > - if (vb->num_pfns != 0) > - tell_host(vb, vb->deflate_vq); > + if (vb->num_pfns != 0) { > + if (use_sg) > + tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max); > + else > + tell_host(vb, vb->deflate_vq); > + } > release_pages_balloon(vb, &pages); > mutex_unlock(&vb->balloon_lock); > return num_freed_pages; > @@ -441,6 +524,18 @@ static int init_vqs(struct virtio_balloon *vb) > } > > #ifdef CONFIG_BALLOON_COMPACTION > + > +static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq, > + struct page *page) > +{ > + unsigned int id = VIRTQUEUE_DESC_ID_INIT; > + u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; > + > + virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0); > + virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); > + virtqueue_kick_async(vq, vb->acked); > +} > + > /* > * virtballoon_migratepage - perform the balloon page migration on behalf of > * a compation thread. (called under page lock) > @@ -464,6 +559,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, > { > struct virtio_balloon *vb = container_of(vb_dev_info, > struct virtio_balloon, vb_dev_info); > + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); > unsigned long flags; > > /* > @@ -485,16 +581,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, > vb_dev_info->isolated_pages--; > __count_vm_event(BALLOON_MIGRATE); > spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); > - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > - set_page_pfns(vb, vb->pfns, newpage); > - tell_host(vb, vb->inflate_vq); > - > + if (use_sg) { > + tell_host_one_page(vb, vb->inflate_vq, newpage); > + } else { > + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > + set_page_pfns(vb, vb->pfns, newpage); > + tell_host(vb, vb->inflate_vq); > + } > /* balloon's page migration 2nd step -- deflate "page" */ > balloon_page_delete(page); > - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > - set_page_pfns(vb, vb->pfns, page); > - tell_host(vb, vb->deflate_vq); > - > + if (use_sg) { > + tell_host_one_page(vb, vb->deflate_vq, page); > + } else { > + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > + set_page_pfns(vb, vb->pfns, page); > + tell_host(vb, vb->deflate_vq); > + } > mutex_unlock(&vb->balloon_lock); > > put_page(page); /* balloon reference */ > @@ -553,6 +655,9 @@ static int virtballoon_probe(struct virtio_device *vdev) > if (err) > goto out_free_vb; > > + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) > + xb_init(&vb->page_xb); > + > vb->nb.notifier_call = virtballoon_oom_notify; > vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; > err = register_oom_notifier(&vb->nb); > @@ -618,6 +723,7 @@ static void virtballoon_remove(struct virtio_device *vdev) > cancel_work_sync(&vb->update_balloon_size_work); > cancel_work_sync(&vb->update_balloon_stats_work); > > + xb_empty(&vb->page_xb); > remove_common(vb); > #ifdef CONFIG_BALLOON_COMPACTION > if (vb->vb_dev_info.inode) > @@ -669,6 +775,7 @@ static unsigned int features[] = { > VIRTIO_BALLOON_F_MUST_TELL_HOST, > VIRTIO_BALLOON_F_STATS_VQ, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, > + VIRTIO_BALLOON_F_SG, > }; > > static struct virtio_driver virtio_balloon_driver = { > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 5e1b548..b9d7e10 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -269,7 +269,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, > struct vring_virtqueue *vq = to_vvq(_vq); > struct scatterlist *sg; > struct vring_desc *desc; > - unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx; > + unsigned int i, n, descs_used, uninitialized_var(prev), err_id; > int head; > bool indirect; > > @@ -387,10 +387,68 @@ static inline int virtqueue_add(struct virtqueue *_vq, > else > vq->free_head = i; > > - /* Store token and indirect buffer state. */ > + END_USE(vq); > + > + return virtqueue_add_chain(_vq, head, indirect, desc, data, ctx); > + > +unmap_release: > + err_id = i; > + i = head; > + > + for (n = 0; n < total_sg; n++) { > + if (i == err_id) > + break; > + vring_unmap_one(vq, &desc[i]); > + i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); > + } > + > + vq->vq.num_free += total_sg; > + > + if (indirect) > + kfree(desc); > + > + END_USE(vq); > + return -EIO; > +} > + > +/** > + * virtqueue_add_chain - expose a chain of buffers to the other end > + * @_vq: the struct virtqueue we're talking about. > + * @head: desc id of the chain head. > + * @indirect: set if the chain of descs are indrect descs. > + * @indir_desc: the first indirect desc. > + * @data: the token identifying the chain. > + * @ctx: extra context for the token. > + * > + * Caller must ensure we don't call this with other virtqueue operations > + * at the same time (except where noted). > + * > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + */ > +int virtqueue_add_chain(struct virtqueue *_vq, > + unsigned int head, > + bool indirect, > + struct vring_desc *indir_desc, > + void *data, > + void *ctx) > +{ > + struct vring_virtqueue *vq = to_vvq(_vq); > + unsigned int avail; > + > + /* The desc chain is empty. */ > + if (head == VIRTQUEUE_DESC_ID_INIT) > + return 0; > + > + START_USE(vq); > + > + if (unlikely(vq->broken)) { > + END_USE(vq); > + return -EIO; > + } > + > vq->desc_state[head].data = data; > if (indirect) > - vq->desc_state[head].indir_desc = desc; > + vq->desc_state[head].indir_desc = indir_desc; > if (ctx) > vq->desc_state[head].indir_desc = ctx; > > @@ -415,26 +473,87 @@ static inline int virtqueue_add(struct virtqueue *_vq, > virtqueue_kick(_vq); > > return 0; > +} > +EXPORT_SYMBOL_GPL(virtqueue_add_chain); > > -unmap_release: > - err_idx = i; > - i = head; > +/** > + * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc > + * @vq: the struct virtqueue we're talking about. > + * @addr: address of the buffer to add. > + * @len: length of the buffer. > + * @head_id: desc id of the chain head. > + * @prev_id: desc id of the previous buffer. > + * @in: set if the buffer is for the device to write. > + * > + * Caller must ensure we don't call this with other virtqueue operations > + * at the same time (except where noted). > + * > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + */ > +int virtqueue_add_chain_desc(struct virtqueue *_vq, > + uint64_t addr, > + uint32_t len, > + unsigned int *head_id, > + unsigned int *prev_id, > + bool in) > +{ > + struct vring_virtqueue *vq = to_vvq(_vq); > + struct vring_desc *desc = vq->vring.desc; > + uint16_t flags = in ? VRING_DESC_F_WRITE : 0; > + unsigned int i; > > - for (n = 0; n < total_sg; n++) { > - if (i == err_idx) > - break; > - vring_unmap_one(vq, &desc[i]); > - i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); > + /* Sanity check */ > + if (!_vq || !head_id || !prev_id) > + return -EINVAL; > +retry: > + START_USE(vq); > + if (unlikely(vq->broken)) { > + END_USE(vq); > + return -EIO; > } > > - vq->vq.num_free += total_sg; > + if (vq->vq.num_free < 1) { > + /* > + * If there is no desc avail in the vq, so kick what is > + * already added, and re-start to build a new chain for > + * the passed sg. > + */ > + if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) { > + END_USE(vq); > + virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL); > + virtqueue_kick_sync(_vq); > + *head_id = VIRTQUEUE_DESC_ID_INIT; > + *prev_id = VIRTQUEUE_DESC_ID_INIT; > + goto retry; > + } else { > + END_USE(vq); > + return -ENOSPC; > + } > + } > > - if (indirect) > - kfree(desc); > + i = vq->free_head; > + flags &= ~VRING_DESC_F_NEXT; > + desc[i].flags = cpu_to_virtio16(_vq->vdev, flags); > + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr); > + desc[i].len = cpu_to_virtio32(_vq->vdev, len); > + > + /* Add the desc to the end of the chain */ > + if (*prev_id != VIRTQUEUE_DESC_ID_INIT) { > + desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i); > + desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev, > + VRING_DESC_F_NEXT); > + } > + *prev_id = i; > + if (*head_id == VIRTQUEUE_DESC_ID_INIT) > + *head_id = *prev_id; > > + vq->vq.num_free--; > + vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next); > END_USE(vq); > - return -EIO; > + > + return 0; > } > +EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc); > > /** > * virtqueue_add_sgs - expose buffers to other end > @@ -627,6 +746,56 @@ bool virtqueue_kick(struct virtqueue *vq) > } > EXPORT_SYMBOL_GPL(virtqueue_kick); > > +/** > + * virtqueue_kick_sync - update after add_buf and busy wait till update is done > + * @vq: the struct virtqueue > + * > + * After one or more virtqueue_add_* calls, invoke this to kick > + * the other side. Busy wait till the other side is done with the update. > + * > + * Caller must ensure we don't call this with other virtqueue > + * operations at the same time (except where noted). > + * > + * Returns false if kick failed, otherwise true. > + */ > +bool virtqueue_kick_sync(struct virtqueue *vq) > +{ > + u32 len; > + > + if (likely(virtqueue_kick(vq))) { > + while (!virtqueue_get_buf(vq, &len) && > + !virtqueue_is_broken(vq)) > + cpu_relax(); > + return true; > + } > + return false; > +} > +EXPORT_SYMBOL_GPL(virtqueue_kick_sync); > + > +/** > + * virtqueue_kick_async - update after add_buf and blocking till update is done > + * @vq: the struct virtqueue > + * > + * After one or more virtqueue_add_* calls, invoke this to kick > + * the other side. Blocking till the other side is done with the update. > + * > + * Caller must ensure we don't call this with other virtqueue > + * operations at the same time (except where noted). > + * > + * Returns false if kick failed, otherwise true. > + */ > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq) > +{ > + u32 len; > + > + if (likely(virtqueue_kick(vq))) { > + wait_event(wq, virtqueue_get_buf(vq, &len)); > + return true; > + } > + return false; > +} > +EXPORT_SYMBOL_GPL(virtqueue_kick_async); > +This happens to 1. drop the buf 2. not do the right thing if more than one is in flight which means this API isn't all that useful. Even balloon might benefit from keeping multiple bufs in flight down the road.> static void detach_buf(struct vring_virtqueue *vq, unsigned int head, > void **ctx) > { > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index 28b0e96..9f27101 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, > void *data, > gfp_t gfp); > > +/* A desc with this init id is treated as an invalid desc */ > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX > +int virtqueue_add_chain_desc(struct virtqueue *_vq, > + uint64_t addr, > + uint32_t len, > + unsigned int *head_id, > + unsigned int *prev_id, > + bool in); > + > +int virtqueue_add_chain(struct virtqueue *_vq, > + unsigned int head, > + bool indirect, > + struct vring_desc *indirect_desc, > + void *data, > + void *ctx); > + > bool virtqueue_kick(struct virtqueue *vq); > > +bool virtqueue_kick_sync(struct virtqueue *vq); > + > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); > + > bool virtqueue_kick_prepare(struct virtqueue *vq); > > bool virtqueue_notify(struct virtqueue *vq); > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > index 343d7dd..37780a7 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -34,6 +34,7 @@ > #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ > #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ > +#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ > > /* Size of a PFN in the balloon interface. */ > #define VIRTIO_BALLOON_PFN_SHIFT 12 > -- > 2.7.4
kbuild test robot
2017-Jul-13 01:16 UTC
[PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
Hi Wei, [auto build test WARNING on linus/master] [also build test WARNING on v4.12 next-20170712] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956 config: i386-randconfig-x071-07121639 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): drivers//virtio/virtio_balloon.c: In function 'tell_host_one_page':>> drivers//virtio/virtio_balloon.c:535:39: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); ^ vim +535 drivers//virtio/virtio_balloon.c 527 528 static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq, 529 struct page *page) 530 { 531 unsigned int id = VIRTQUEUE_DESC_ID_INIT; 532 u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; 533 534 virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0); > 535 virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); 536 virtqueue_kick_async(vq, vb->acked); 537 } 538 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation -------------- next part -------------- A non-text attachment was scrubbed... Name: .config.gz Type: application/gzip Size: 29833 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20170713/05064c09/attachment-0001.bin>
kbuild test robot
2017-Jul-13 04:21 UTC
[PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
Hi Wei, [auto build test ERROR on linus/master] [also build test ERROR on v4.12 next-20170712] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956 config: powerpc-defconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All errors (new ones prefixed by >>):>> ERROR: ".xb_set_bit" [drivers/virtio/virtio_balloon.ko] undefined! >> ERROR: ".xb_zero" [drivers/virtio/virtio_balloon.ko] undefined! >> ERROR: ".xb_find_next_bit" [drivers/virtio/virtio_balloon.ko] undefined!--- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation -------------- next part -------------- A non-text attachment was scrubbed... Name: .config.gz Type: application/gzip Size: 23467 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20170713/e06828ec/attachment-0001.bin>
On Wed 12-07-17 20:40:19, Wei Wang wrote:> This patch adds support for reporting blocks of pages on the free list > specified by the caller. > > As pages can leave the free list during this call or immediately > afterwards, they are not guaranteed to be free after the function > returns. The only guarantee this makes is that the page was on the free > list at some point in time after the function has been invoked. > > Therefore, it is not safe for caller to use any pages on the returned > block or to discard data that is put there after the function returns. > However, it is safe for caller to discard data that was in one of these > pages before the function was invoked.I do not understand what is the point of such a function and how it is used because the patch doesn't give us any user (I haven't checked other patches yet). But just from the semantic point of view this sounds like a horrible idea. The only way to get a free block of pages is to call the page allocator. I am tempted to give it Nack right on those grounds but I would like to hear more about what you actually want to achieve.> Signed-off-by: Wei Wang <wei.w.wang at intel.com> > Signed-off-by: Liang Li <liang.z.li at intel.com> > --- > include/linux/mm.h | 5 +++ > mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 101 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 46b9ac5..76cb433 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > unsigned long zone_start_pfn, unsigned long *zholes_size); > extern void free_initmem(void); > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > +extern int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, > + struct page **page); > +#endif > /* > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > * into the buddy system. The freed pages will be poisoned with pattern > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 64b7d82..8b3c9dd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > show_swap_cache_info(); > } > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > + > +/* > + * Heuristically get a page block in the system that is unused. > + * It is possible that pages from the page block are used immediately after > + * report_unused_page_block() returns. It is the caller's responsibility > + * to either detect or prevent the use of such pages. > + * > + * The free list to check: zone->free_area[order].free_list[migratetype]. > + * > + * If the caller supplied page block (i.e. **page) is on the free list, offer > + * the next page block on the list to the caller. Otherwise, offer the first > + * page block on the list. > + * > + * Note: it is not safe for caller to use any pages on the returned > + * block or to discard data that is put there after the function returns. > + * However, it is safe for caller to discard data that was in one of these > + * pages before the function was invoked. > + * > + * Return 0 when a page block is found on the caller specified free list. > + */ > +int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, struct page **page) > +{ > + struct zone *this_zone; > + struct list_head *this_list; > + int ret = 0; > + unsigned long flags; > + > + /* Sanity check */ > + if (zone == NULL || page == NULL || order >= MAX_ORDER || > + migratetype >= MIGRATE_TYPES) > + return -EINVAL; > + > + /* Zone validity check */ > + for_each_populated_zone(this_zone) { > + if (zone == this_zone) > + break; > + } > + > + /* Got a non-existent zone from the caller? */ > + if (zone != this_zone) > + return -EINVAL;Huh, what do you check for here? Why don't you simply populated_zone(zone)?> + > + spin_lock_irqsave(&this_zone->lock, flags); > + > + this_list = &zone->free_area[order].free_list[migratetype]; > + if (list_empty(this_list)) { > + *page = NULL; > + ret = 1; > + goto out; > + } > + > + /* The caller is asking for the first free page block on the list */ > + if ((*page) == NULL) { > + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller is not on this free list > + * anymore (e.g. a 1MB free page block has been split). In this case, > + * offer the first page block on the free list that the caller is > + * asking for. > + */ > + if (PageBuddy(*page) && order != page_order(*page)) { > + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller has been the last page block > + * on the list. > + */ > + if ((*page)->lru.next == this_list) { > + *page = NULL; > + ret = 1; > + goto out; > + } > + > + /* > + * Finally, fall into the regular case: the page block passed from the > + * caller is still on the free list. Offer the next one. > + */ > + *page = list_next_entry((*page), lru); > + ret = 0; > +out: > + spin_unlock_irqrestore(&this_zone->lock, flags); > + return ret; > +} > +EXPORT_SYMBOL(report_unused_page_block); > + > +#endif > + > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > { > zoneref->zone = zone; > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo at kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont at kvack.org"> email at kvack.org </a>-- Michal Hocko SUSE Labs
Michal Hocko
2017-Jul-14 12:31 UTC
[PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat
On Wed 12-07-17 20:40:20, Wei Wang wrote:> This patch enables for_each_zone()/for_each_populated_zone() to be > invoked by a kernel module.This needs much better justification with an example of who is going to use these symbols and what for.> Signed-off-by: Wei Wang <wei.w.wang at intel.com> > --- > mm/mmzone.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/mmzone.c b/mm/mmzone.c > index a51c0a6..08a2a3a 100644 > --- a/mm/mmzone.c > +++ b/mm/mmzone.c > @@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void) > { > return NODE_DATA(first_online_node); > } > +EXPORT_SYMBOL_GPL(first_online_pgdat); > > struct pglist_data *next_online_pgdat(struct pglist_data *pgdat) > { > @@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone) > } > return zone; > } > +EXPORT_SYMBOL_GPL(next_zone); > > static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes) > { > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo at kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont at kvack.org"> email at kvack.org </a>-- Michal Hocko SUSE Labs
On 07/12/2017 08:40 PM, Wei Wang wrote:> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to > transfer a chunk of ballooned (i.e. inflated/deflated) pages using > scatter-gather lists to the host. > > The implementation of the previous virtio-balloon is not very > efficient, because the balloon pages are transferred to the > host one by one. Here is the breakdown of the time in percentage > spent on each step of the balloon inflating process (inflating > 7GB of an 8GB idle guest). > > 1) allocating pages (6.5%) > 2) sending PFNs to host (68.3%) > 3) address translation (6.1%) > 4) madvise (19%) > > It takes about 4126ms for the inflating process to complete. > The above profiling shows that the bottlenecks are stage 2) > and stage 4). > > This patch optimizes step 2) by transferring pages to the host in > sgs. An sg describes a chunk of guest physically continuous pages. > With this mechanism, step 4) can also be optimized by doing address > translation and madvise() in chunks rather than page by page. > > With this new feature, the above ballooning process takes ~491ms > resulting in an improvement of ~88%. >I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499 , zeros all the ballooned pages, which is very time consuming. Tests show that the time to balloon 7G pages is increased from ~491 ms to 2.8 seconds with the above patch. How about moving the zero operation to the hypervisor? In this way, we will have a much faster balloon process. Best, Wei
Reasonably Related Threads
- [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
- [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
- [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
- [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
- [virtio-dev] Re: [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG