This patch series enhances the existing virtio-balloon with the following new features: 1) fast ballooning: transfer ballooned pages between the guest and host in chunks using sgs, instead of one array each time; and 2) free page block reporting: a new virtqueue to report guest free pages to the host. The second feature can be used to accelerate live migration of VMs. Here are some details: Live migration needs to transfer the VM's memory from the source machine to the destination round by round. For the 1st round, all the VM's memory is transferred. From the 2nd round, only the pieces of memory that were written by the guest (after the 1st round) are transferred. One method that is popularly used by the hypervisor to track which part of memory is written is to write-protect all the guest memory. The second feature enables the optimization of the 1st round memory transfer - the hypervisor can skip the transfer of guest free pages in the 1st round. It is not concerned that the memory pages are used after they are given to the hypervisor as a hint of the free pages, because they will be tracked by the hypervisor and transferred in the next round if they are used and written. ChangeLog: v16->v17: 1) patch 1: please check the commit log there; 2) patch 3: included Michael S. Tsirkin patch to fix the potential deadlock issue; 3) patch 4: use BUG_ON if virtqueue_add_ returns error, which is expected never to happen; 4) patch 4: add leak_balloon_sg_oom, which is used in the oom case when VIRTIO_BALLOON_F_SG is in use; 5) patch 6: use config registers, instead of a vq, as the command channel between the host and guest; 6) patch 6: add the command sequence id support. v15->v16: 1) mm: stop reporting the free pfn range if the callback returns false; 2) mm: move some implementaion of walk_free_mem_block into a function to make the code layout looks better; 3) xbitmap: added some optimizations suggested by Matthew, please refer to the ChangLog in the xbitmap patch for details. 4) xbitmap: added a test suite 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns an error 6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf from the vq before adding a new buf v14->v15: 1) mm: make the report callback return a bool value - returning 1 to stop walking through the free page list. 2) virtio-balloon: batching sgs of balloon pages till the vq is full 3) virtio-balloon: create a new workqueue, rather than using the default system_wq, to queue the free page reporting work item. 4) virtio-balloon: add a ctrl_vq to be a central control plane which will handle all the future control related commands between the host and guest. Add free page report as the first feature controlled under ctrl_vq, and the free_page_vq is a data plane vq dedicated to the transmission of free page blocks. v13->v14: 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c. 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into one xb_bit_ops. 3) xbitmap: add documents for the exported APIs. 4) mm: rewrite the function to walk through free page blocks. 5) virtio-balloon: when reporting a free page blcok to the device, if the vq is full (less likey to happen in practice), just skip reporting this block, instead of busywaiting till an entry gets released. 6) virtio-balloon: fail the probe function if adding the signal buf in init_vqs fails. v12->v13: 1) mm: use a callback function to handle the the free page blocks from the report function. This avoids exposing the zone internal to a kernel module. 2) virtio-balloon: send balloon pages or a free page block using a single sg each time. This has the benefits of simpler implementation with no new APIs. 3) virtio-balloon: the free_page_vq is used to report free pages only (no multiple usages interleaving) 4) virtio-balloon: Balloon pages and free page blocks are sent via input sgs, and the completion signal to the host is sent via an output sg. v11->v12: 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages. 2) virtio-ring: enable the driver to build up a desc chain using vring desc. 3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro to lock/unlock the vq when a vq operation starts/ends. 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async() 5) virtio-balloon: describe chunks of ballooned pages and free pages blocks directly using one or more chains of desc from the vq. v10->v11: 1) virtio_balloon: use vring_desc to describe a chunk; 2) virtio_ring: support to add an indirect desc table to virtqueue; 3) virtio_balloon: use cmdq to report guest memory statistics. v9->v10: 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON; 2) virtio-balloon: add virtballoon_validate(); 3) virtio-balloon: msg format change; 4) virtio-balloon: move miscq handling to a task on system_freezable_wq; 5) virtio-balloon: code cleanup. v8->v9: 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous implementation; 2) Simpler function to get the free page block. v7->v8: 1) Use only one chunk format, instead of two. 2) re-write the virtio-balloon implementation patch. 3) commit changes 4) patch re-org Matthew Wilcox (2): lib/xbitmap: Introduce xbitmap radix tree test suite: add tests for xbitmap Michael S. Tsirkin (1): mm/balloon_compaction.c: split balloon page allocation and enqueue Wei Wang (3): virtio-balloon: VIRTIO_BALLOON_F_SG mm: support reporting free page blocks virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ drivers/virtio/virtio_balloon.c | 489 +++++++++++++++++++++++++++++--- include/linux/balloon_compaction.h | 34 ++- include/linux/mm.h | 6 + include/linux/radix-tree.h | 2 + include/linux/xbitmap.h | 67 +++++ include/uapi/linux/virtio_balloon.h | 12 + lib/Makefile | 2 +- lib/radix-tree.c | 51 +++- lib/xbitmap.c | 283 ++++++++++++++++++ mm/balloon_compaction.c | 28 +- mm/page_alloc.c | 91 ++++++ tools/include/linux/bitmap.h | 34 +++ tools/include/linux/kernel.h | 2 + tools/testing/radix-tree/Makefile | 7 +- tools/testing/radix-tree/linux/kernel.h | 2 - tools/testing/radix-tree/main.c | 5 + tools/testing/radix-tree/test.h | 1 + tools/testing/radix-tree/xbitmap.c | 278 ++++++++++++++++++ 18 files changed, 1335 insertions(+), 59 deletions(-) create mode 100644 include/linux/xbitmap.h create mode 100644 lib/xbitmap.c create mode 100644 tools/testing/radix-tree/xbitmap.c -- 2.7.4
From: Matthew Wilcox <mawilcox at microsoft.com> The eXtensible Bitmap is a sparse bitmap representation which is efficient for set bits which tend to cluster. It supports up to 'unsigned long' worth of bits, and this commit adds the bare bones -- xb_set_bit(), xb_clear_bit(), xb_clear_bit_range(), xb_test_bit(), xb_find_next_set_bit(), xb_find_next_zero_bit(). More possible optimizations to add in the future: 1) xb_set_bit_range: set a range of bits. 2) when searching a bit, if the bit is not found in the slot, move on to the next slot directly. 3) add Tags to help searching. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Cc: Matthew Wilcox <mawilcox at microsoft.com> Cc: Andrew Morton <akpm at linux-foundation.org> Cc: Michal Hocko <mhocko at kernel.org> Cc: Michael S. Tsirkin <mst at redhat.com> Cc: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp> v16->v17 ChangeLog: 1) xb_preload: allocate ida bitmap before __radix_tree_preload() to avoid kmalloc with preemption disabled. Also change this function to return with preemption not disabled on error. 2) xb_preload_and_set_bit: a wrapper of xb_preload and xb_set_bit, for the convenience of usage. v15->v16 ChangeLog: 1) coding style - separate small functions for bit set/clear/test; 2) Clear a range of bits in a more efficient way: A) clear a range of bits from the same ida bitmap directly rather than search the bitmap again for each bit; B) when the range of bits to clear covers the whole ida bitmap, directly free the bitmap - no need to zero the bitmap first. 3) more efficient bit searching, like 2.A. --- include/linux/radix-tree.h | 2 + include/linux/xbitmap.h | 67 +++++++++++ lib/Makefile | 2 +- lib/radix-tree.c | 51 +++++++- lib/xbitmap.c | 283 +++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 402 insertions(+), 3 deletions(-) create mode 100644 include/linux/xbitmap.h create mode 100644 lib/xbitmap.c diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 567ebb5..1d6d6f6 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *, const struct radix_tree_iter *, void __rcu **slot, void *entry); void radix_tree_replace_slot(struct radix_tree_root *, void __rcu **slot, void *entry); +bool __radix_tree_delete(struct radix_tree_root *root, + struct radix_tree_node *node, void __rcu **slot); void __radix_tree_delete_node(struct radix_tree_root *, struct radix_tree_node *, radix_tree_update_node_t update_node, diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h new file mode 100644 index 0000000..00b59c3 --- /dev/null +++ b/include/linux/xbitmap.h @@ -0,0 +1,67 @@ +/* + * eXtensible Bitmaps + * Copyright (c) 2017 Microsoft Corporation <mawilcox at microsoft.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility. + * All bits are initially zero. + */ + +#ifndef __XBITMAP_H__ +#define __XBITMAP_H__ + +#include <linux/idr.h> + +struct xb { + struct radix_tree_root xbrt; +}; + +#define XB_INIT { \ + .xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT), \ +} +#define DEFINE_XB(name) struct xb name = XB_INIT + +static inline void xb_init(struct xb *xb) +{ + INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT); +} + +int xb_set_bit(struct xb *xb, unsigned long bit); +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp); +bool xb_test_bit(struct xb *xb, unsigned long bit); +void xb_clear_bit(struct xb *xb, unsigned long bit); +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, + unsigned long end); +unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start, + unsigned long end); +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end); + +/* Check if the xb tree is empty */ +static inline bool xb_is_empty(const struct xb *xb) +{ + return radix_tree_empty(&xb->xbrt); +} + +bool xb_preload(gfp_t gfp); + +/** + * xb_preload_end - end preload section started with xb_preload() + * + * Each xb_preload() should be matched with an invocation of this + * function. See xb_preload() for details. + */ +static inline void xb_preload_end(void) +{ + preempt_enable(); +} + +#endif diff --git a/lib/Makefile b/lib/Makefile index dafa796..082361b 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n lib-y := ctype.o string.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o dump_stack.o timerqueue.o\ - idr.o int_sqrt.o extable.o \ + idr.o xbitmap.o int_sqrt.o extable.o \ sha1.o chacha20.o irq_regs.o argv_split.o \ flex_proportions.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o kobject_uevent.o \ diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 8b1feca..269a5cc 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep; #define IDA_PRELOAD_SIZE (IDA_MAX_PATH * 2 - 1) /* + * The xbitmap implementation supports up to ULONG_MAX bits, and it is + * implemented based on ida bitmaps. So, given an unsigned long index, + * the high order XB_INDEX_BITS bits of the index is used to find the + * corresponding item (i.e. ida bitmap) from the radix tree, and the low + * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into + * the ida bitmap to find the bit. + */ +#define XB_INDEX_BITS (BITS_PER_LONG - ilog2(IDA_BITMAP_BITS)) +#define XB_MAX_PATH (DIV_ROUND_UP(XB_INDEX_BITS, \ + RADIX_TREE_MAP_SHIFT)) +#define XB_PRELOAD_SIZE (XB_MAX_PATH * 2 - 1) + +/* * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { @@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, offset, 0, 0); if (!child) return -ENOMEM; + if (is_idr(root)) + all_tag_set(child, IDR_FREE); rcu_assign_pointer(*slot, node_to_entry(child)); if (node) node->count++; @@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root, delete_node(root, node, update_node, private); } -static bool __radix_tree_delete(struct radix_tree_root *root, - struct radix_tree_node *node, void __rcu **slot) +bool __radix_tree_delete(struct radix_tree_root *root, + struct radix_tree_node *node, void __rcu **slot) { void *old = rcu_dereference_raw(*slot); int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0; @@ -2005,6 +2020,38 @@ static bool __radix_tree_delete(struct radix_tree_root *root, } /** + * xb_preload - preload for xb_set_bit() + * @gfp_mask: allocation mask to use for preloading + * + * Preallocate memory to use for the next call to xb_set_bit(). On success, + * return true, with preemption disabled. On error, return false with + * preemption not disabled. + */ +bool xb_preload(gfp_t gfp) +{ + if (!this_cpu_read(ida_bitmap)) { + struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp); + + if (!bitmap) + return false; + /* + * The per-CPU variable is updated with preemption enabled. + * If the calling task is unlucky to be scheduled to another + * CPU which has no ida_bitmap allocation, it will be detected + * when setting a bit (i.e. __xb_set_bit()). + */ + bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap); + kfree(bitmap); + } + + if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0) + return false; + + return true; +} +EXPORT_SYMBOL(xb_preload); + +/** * radix_tree_iter_delete - delete the entry at this iterator position * @root: radix tree root * @iter: iterator state diff --git a/lib/xbitmap.c b/lib/xbitmap.c new file mode 100644 index 0000000..3e891d6 --- /dev/null +++ b/lib/xbitmap.c @@ -0,0 +1,283 @@ +#include <linux/slab.h> +#include <linux/xbitmap.h> + +/** + * xb_set_bit - set a bit in the xbitmap + * @xb: the xbitmap tree used to record the bit + * @bit: index of the bit to set + * + * Returns: 0 on success; -EAGAIN on error, and the caller is expected to + * restart from xb_preload. + */ +int xb_set_bit(struct xb *xb, unsigned long bit) +{ + int err; + unsigned long index = bit / IDA_BITMAP_BITS; + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %= IDA_BITMAP_BITS; + ebit = bit + 2; + + err = __radix_tree_create(root, index, 0, &node, &slot); + if (err) + return err; + bitmap = rcu_dereference_raw(*slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp = (unsigned long)bitmap; + + if (ebit < BITS_PER_LONG) { + tmp |= 1UL << ebit; + rcu_assign_pointer(*slot, (void *)tmp); + return 0; + } + bitmap = this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT; + rcu_assign_pointer(*slot, bitmap); + } + + if (!bitmap) { + if (ebit < BITS_PER_LONG) { + bitmap = (void *)((1UL << ebit) | + RADIX_TREE_EXCEPTIONAL_ENTRY); + __radix_tree_replace(root, node, slot, bitmap, NULL, + NULL); + return 0; + } + bitmap = this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + __radix_tree_replace(root, node, slot, bitmap, NULL, NULL); + } + + __set_bit(bit, bitmap->bitmap); + return 0; +} +EXPORT_SYMBOL(xb_set_bit); + +/** + * xb_preload_and_set_bit - preload the memory and set a bit in the xbitmap + * @xb: the xbitmap tree used to record the bit + * @bit: index of the bit to set + * + * A wrapper of the xb_preload() and xb_set_bit(). + * Returns: 0 on success; -EAGAIN or -ENOMEM on error. + */ +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp) +{ + int ret = 0; + + if (!xb_preload(gfp)) + return -ENOMEM; + + ret = xb_set_bit(xb, bit); + xb_preload_end(); + + return ret; +} +EXPORT_SYMBOL(xb_preload_and_set_bit); + +/** + * xb_clear_bit - clear a bit in the xbitmap + * @xb: the xbitmap tree used to record the bit + * @bit: index of the bit to clear + * + * This function is used to clear a bit in the xbitmap. If all the bits of the + * bitmap are 0, the bitmap will be freed. + */ +void xb_clear_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index = bit / IDA_BITMAP_BITS; + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %= IDA_BITMAP_BITS; + ebit = bit + 2; + + bitmap = __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp = (unsigned long)bitmap; + + if (ebit >= BITS_PER_LONG) + return; + tmp &= ~(1UL << ebit); + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + return; + } + + if (!bitmap) + return; + + __clear_bit(bit, bitmap->bitmap); + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } +} +EXPORT_SYMBOL(xb_clear_bit); + +/** + * xb_clear_bit - clear a range of bits in the xbitmap + * @start: the start of the bit range, inclusive + * @end: the end of the bit range, inclusive + * + * This function is used to clear a bit in the xbitmap. If all the bits of the + * bitmap are 0, the bitmap will be freed. + */ +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end) +{ + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned int nbits; + + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index = start / IDA_BITMAP_BITS; + unsigned long bit = start % IDA_BITMAP_BITS; + + bitmap = __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long ebit = bit + 2; + unsigned long tmp = (unsigned long)bitmap; + + nbits = min(end - start + 1, BITS_PER_LONG - ebit); + + if (ebit >= BITS_PER_LONG) + continue; + bitmap_clear(&tmp, ebit, nbits); + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + } else if (bitmap) { + nbits = min(end - start + 1, IDA_BITMAP_BITS - bit); + + if (nbits != IDA_BITMAP_BITS) + bitmap_clear(bitmap->bitmap, bit, nbits); + + if (nbits == IDA_BITMAP_BITS || + bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } + } + } +} +EXPORT_SYMBOL(xb_clear_bit_range); + +/** + * xb_test_bit - test a bit in the xbitmap + * @xb: the xbitmap tree used to record the bit + * @bit: index of the bit to test + * + * This function is used to test a bit in the xbitmap. + * Returns: 1 if the bit is set, or 0 otherwise. + */ +bool xb_test_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index = bit / IDA_BITMAP_BITS; + const struct radix_tree_root *root = &xb->xbrt; + struct ida_bitmap *bitmap = radix_tree_lookup(root, index); + + bit %= IDA_BITMAP_BITS; + + if (!bitmap) + return false; + if (radix_tree_exception(bitmap)) { + bit += RADIX_TREE_EXCEPTIONAL_SHIFT; + if (bit > BITS_PER_LONG) + return false; + return (unsigned long)bitmap & (1UL << bit); + } + + return test_bit(bit, bitmap->bitmap); +} +EXPORT_SYMBOL(xb_test_bit); + +static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start, + unsigned long end, bool set) +{ + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bmap; + unsigned long ret = end + 1; + + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index = start / IDA_BITMAP_BITS; + unsigned long bit = start % IDA_BITMAP_BITS; + + bmap = __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bmap)) { + unsigned long tmp = (unsigned long)bmap; + unsigned long ebit = bit + 2; + + if (ebit >= BITS_PER_LONG) + continue; + if (set) + ret = find_next_bit(&tmp, BITS_PER_LONG, ebit); + else + ret = find_next_zero_bit(&tmp, BITS_PER_LONG, + ebit); + if (ret < BITS_PER_LONG) + return ret - 2 + IDA_BITMAP_BITS * index; + } else if (bmap) { + if (set) + ret = find_next_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + else + ret = find_next_zero_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + if (ret < IDA_BITMAP_BITS) + return ret + index * IDA_BITMAP_BITS; + } else if (!bmap && !set) { + return start; + } + } + + return ret; +} + +/** + * xb_find_next_set_bit - find the next set bit in a range + * @xb: the xbitmap to search + * @start: the start of the range, inclusive + * @end: the end of the range, inclusive + * + * Returns: the index of the found bit, or @end + 1 if no such bit is found. + */ +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 1); +} +EXPORT_SYMBOL(xb_find_next_set_bit); + +/** + * xb_find_next_zero_bit - find the next zero bit in a range + * @xb: the xbitmap to search + * @start: the start of the range, inclusive + * @end: the end of the range, inclusive + * + * Returns: the index of the found bit, or @end + 1 if no such bit is found. + */ +unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 0); +} +EXPORT_SYMBOL(xb_find_next_zero_bit); -- 2.7.4
Wei Wang
2017-Nov-03 08:13 UTC
[PATCH v17 2/6] radix tree test suite: add tests for xbitmap
From: Matthew Wilcox <mawilcox at microsoft.com> Add the following tests for xbitmap: 1) single bit test: single bit set/clear/find; 2) bit range test: set/clear a range of bits and find a 0 or 1 bit in the range. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Cc: Matthew Wilcox <mawilcox at microsoft.com> Cc: Andrew Morton <akpm at linux-foundation.org> Cc: Michael S. Tsirkin <mst at redhat.com> --- tools/include/linux/bitmap.h | 34 ++++ tools/include/linux/kernel.h | 2 + tools/testing/radix-tree/Makefile | 7 +- tools/testing/radix-tree/linux/kernel.h | 2 - tools/testing/radix-tree/main.c | 5 + tools/testing/radix-tree/test.h | 1 + tools/testing/radix-tree/xbitmap.c | 278 ++++++++++++++++++++++++++++++++ 7 files changed, 326 insertions(+), 3 deletions(-) create mode 100644 tools/testing/radix-tree/xbitmap.c diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h index e8b9f51..890dab2 100644 --- a/tools/include/linux/bitmap.h +++ b/tools/include/linux/bitmap.h @@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits) } } +static inline void __bitmap_clear(unsigned long *map, unsigned int start, + int len) +{ + unsigned long *p = map + BIT_WORD(start); + const unsigned int size = start + len; + int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG); + unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start); + + while (len - bits_to_clear >= 0) { + *p &= ~mask_to_clear; + len -= bits_to_clear; + bits_to_clear = BITS_PER_LONG; + mask_to_clear = ~0UL; + p++; + } + if (len) { + mask_to_clear &= BITMAP_LAST_WORD_MASK(size); + *p &= ~mask_to_clear; + } +} + +static inline __always_inline void bitmap_clear(unsigned long *map, + unsigned int start, + unsigned int nbits) +{ + if (__builtin_constant_p(nbits) && nbits == 1) + __clear_bit(start, map); + else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) && + __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8)) + memset((char *)map + start / 8, 0, nbits / 8); + else + __bitmap_clear(map, start, nbits); +} + static inline void bitmap_fill(unsigned long *dst, unsigned int nbits) { unsigned int nlongs = BITS_TO_LONGS(nbits); diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h index 77d2e94..21e90ee 100644 --- a/tools/include/linux/kernel.h +++ b/tools/include/linux/kernel.h @@ -12,6 +12,8 @@ #define UINT_MAX (~0U) #endif +#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0) + #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) #define PERF_ALIGN(x, a) __PERF_ALIGN_MASK(x, (typeof(x))(a)-1) diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile index 6a9480c..fc7cb422 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -5,7 +5,8 @@ LDLIBS+= -lpthread -lurcu TARGETS = main idr-test multiorder CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \ - tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o + tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \ + xbitmap.o ifndef SHIFT SHIFT=3 @@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES) multiorder: multiorder.o $(CORE_OFILES) +xbitmap: xbitmap.o $(CORE_OFILES) + $(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap + clean: $(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h @@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \ ../../include/linux/*.h \ ../../include/asm/*.h \ ../../../include/linux/radix-tree.h \ + ../../../include/linux/xbitmap.h \ ../../../include/linux/idr.h radix-tree.c: ../../../lib/radix-tree.c diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h index b21a77f..c1e6088 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -16,6 +16,4 @@ #define pr_debug printk #define pr_cont printk -#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) - #endif /* _KERNEL_H */ diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c index bc9a784..6f4774e 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run) rcu_barrier(); printv(2, "after copy_tag_check: %d allocated, preempt %d\n", nr_allocated, preempt_count); + + xbitmap_checks(); + rcu_barrier(); + printv(2, "after xbitmap_checks: %d allocated, preempt %d\n", + nr_allocated, preempt_count); } int main(int argc, char **argv) diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h index 0f8220c..f8dcdaa 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration); void benchmark(void); void idr_checks(void); void ida_checks(void); +void xbitmap_checks(void); void ida_thread_tests(void); struct item * diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/xbitmap.c new file mode 100644 index 0000000..bee8a38 --- /dev/null +++ b/tools/testing/radix-tree/xbitmap.c @@ -0,0 +1,278 @@ +#include <linux/bitmap.h> +#include <linux/slab.h> +#include <linux/kernel.h> +#include "../../../include/linux/xbitmap.h" + +static DEFINE_XB(xb1); + +int xb_set_bit(struct xb *xb, unsigned long bit) +{ + int err; + unsigned long index = bit / IDA_BITMAP_BITS; + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %= IDA_BITMAP_BITS; + ebit = bit + 2; + + err = __radix_tree_create(root, index, 0, &node, &slot); + if (err) + return err; + bitmap = rcu_dereference_raw(*slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp = (unsigned long)bitmap; + + if (ebit < BITS_PER_LONG) { + tmp |= 1UL << ebit; + rcu_assign_pointer(*slot, (void *)tmp); + return 0; + } + bitmap = this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT; + rcu_assign_pointer(*slot, bitmap); + } + + if (!bitmap) { + if (ebit < BITS_PER_LONG) { + bitmap = (void *)((1UL << ebit) | + RADIX_TREE_EXCEPTIONAL_ENTRY); + __radix_tree_replace(root, node, slot, bitmap, NULL, + NULL); + return 0; + } + bitmap = this_cpu_xchg(ida_bitmap, NULL); + if (!bitmap) + return -EAGAIN; + memset(bitmap, 0, sizeof(*bitmap)); + __radix_tree_replace(root, node, slot, bitmap, NULL, NULL); + } + + __set_bit(bit, bitmap->bitmap); + return 0; +} + +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp) +{ + int ret = 0; + + if (!xb_preload(gfp)) + return -ENOMEM; + + ret = xb_set_bit(xb, bit); + xb_preload_end(); + + return ret; +} + +bool xb_test_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index = bit / IDA_BITMAP_BITS; + const struct radix_tree_root *root = &xb->xbrt; + struct ida_bitmap *bitmap = radix_tree_lookup(root, index); + + bit %= IDA_BITMAP_BITS; + + if (!bitmap) + return false; + if (radix_tree_exception(bitmap)) { + bit += RADIX_TREE_EXCEPTIONAL_SHIFT; + if (bit > BITS_PER_LONG) + return false; + return (unsigned long)bitmap & (1UL << bit); + } + + return test_bit(bit, bitmap->bitmap); +} + +void xb_clear_bit(struct xb *xb, unsigned long bit) +{ + unsigned long index = bit / IDA_BITMAP_BITS; + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned long ebit; + + bit %= IDA_BITMAP_BITS; + ebit = bit + 2; + + bitmap = __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long tmp = (unsigned long)bitmap; + + if (ebit >= BITS_PER_LONG) + return; + tmp &= ~(1UL << ebit); + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + return; + } + + if (!bitmap) + return; + + __clear_bit(bit, bitmap->bitmap); + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } +} + +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end) +{ + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bitmap; + unsigned int nbits; + + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index = start / IDA_BITMAP_BITS; + unsigned long bit = start % IDA_BITMAP_BITS; + + bitmap = __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bitmap)) { + unsigned long ebit = bit + 2; + unsigned long tmp = (unsigned long)bitmap; + + nbits = min(end - start + 1, BITS_PER_LONG - ebit); + + if (ebit >= BITS_PER_LONG) + continue; + bitmap_clear(&tmp, ebit, nbits); + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY) + __radix_tree_delete(root, node, slot); + else + rcu_assign_pointer(*slot, (void *)tmp); + } else if (bitmap) { + nbits = min(end - start + 1, IDA_BITMAP_BITS - bit); + + if (nbits != IDA_BITMAP_BITS) + bitmap_clear(bitmap->bitmap, bit, nbits); + + if (nbits == IDA_BITMAP_BITS || + bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) { + kfree(bitmap); + __radix_tree_delete(root, node, slot); + } + } + } +} + +static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start, + unsigned long end, bool set) +{ + struct radix_tree_root *root = &xb->xbrt; + struct radix_tree_node *node; + void **slot; + struct ida_bitmap *bmap; + unsigned long ret = end + 1; + + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { + unsigned long index = start / IDA_BITMAP_BITS; + unsigned long bit = start % IDA_BITMAP_BITS; + + bmap = __radix_tree_lookup(root, index, &node, &slot); + if (radix_tree_exception(bmap)) { + unsigned long tmp = (unsigned long)bmap; + unsigned long ebit = bit + 2; + + if (ebit >= BITS_PER_LONG) + continue; + if (set) + ret = find_next_bit(&tmp, BITS_PER_LONG, ebit); + else + ret = find_next_zero_bit(&tmp, BITS_PER_LONG, + ebit); + if (ret < BITS_PER_LONG) + return ret - 2 + IDA_BITMAP_BITS * index; + } else if (bmap) { + if (set) + ret = find_next_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + else + ret = find_next_zero_bit(bmap->bitmap, + IDA_BITMAP_BITS, bit); + if (ret < IDA_BITMAP_BITS) + return ret + index * IDA_BITMAP_BITS; + } else if (!bmap && !set) { + return start; + } + } + + return ret; +} + +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 1); +} + +unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start, + unsigned long end) +{ + return xb_find_next_bit(xb, start, end, 0); +} + +static void xbitmap_check_bit(unsigned long bit) +{ + assert(!xb_test_bit(&xb1, bit)); + assert(!xb_preload_and_set_bit(&xb1, bit, GFP_KERNEL)); + assert(xb_test_bit(&xb1, bit)); + xb_clear_bit(&xb1, bit); + assert(xb_is_empty(&xb1)); +} + +static void xbitmap_check_bit_range(void) +{ + xb_preload(GFP_KERNEL); + + /* Set a range of bits */ + assert(!xb_set_bit(&xb1, 1060)); + assert(!xb_set_bit(&xb1, 1061)); + assert(!xb_set_bit(&xb1, 1064)); + assert(!xb_set_bit(&xb1, 1065)); + assert(!xb_set_bit(&xb1, 8180)); + assert(!xb_set_bit(&xb1, 8181)); + assert(!xb_set_bit(&xb1, 8190)); + assert(!xb_set_bit(&xb1, 8191)); + + /* Test a range of bits */ + assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060); + assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062); + assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064); + assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066); + assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180); + assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182); + xb_clear_bit_range(&xb1, 0, 1000000); + assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001); + + assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000); + + xb_preload_end(); +} + +void xbitmap_checks(void) +{ + xb_init(&xb1); + + xbitmap_check_bit(0); + xbitmap_check_bit(30); + xbitmap_check_bit(31); + xbitmap_check_bit(1023); + xbitmap_check_bit(1024); + xbitmap_check_bit(1025); + xbitmap_check_bit((1UL << 63) | (1UL << 24)); + xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70); + + xbitmap_check_bit_range(); +} -- 2.7.4
Wei Wang
2017-Nov-03 08:13 UTC
[PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue
From: "Michael S. Tsirkin" <mst.redhat.com> fill_balloon doing memory allocations under balloon_lock can cause a deadlock when leak_balloon is called from virtballoon_oom_notify and tries to take same lock. To fix, split page allocation and enqueue and do allocations outside the lock. Here's a detailed analysis of the deadlock by Tetsuo Handa: In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to serialize against fill_balloon(). But in fill_balloon(), alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE] implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY is specified, this allocation attempt might indirectly depend on somebody else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via virtballoon_oom_notify() via blocking_notifier_call_chain() callback via out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it will cause OOM lockup. Thus, do not wait for vb->balloon_lock mutex if leak_balloon() is called from out_of_memory(). Thread1 Thread2 fill_balloon() takes a balloon_lock balloon_page_enqueue() alloc_page(GFP_HIGHUSER_MOVABLE) direct reclaim (__GFP_FS context) takes a fs lock waits for that fs lock alloc_page(GFP_NOFS) __alloc_pages_may_oom() takes the oom_lock out_of_memory() blocking_notifier_call_chain() leak_balloon() tries to take that balloon_lock and deadlocks Reported-by: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp> Signed-off-by: Michael S. Tsirkin <mst at redhat.com> Cc: Michal Hocko <mhocko at suse.com> Cc: Wei Wang <wei.w.wang at intel.com> Reviewed-by: Wei Wang <wei.w.wang at intel.com> --- drivers/virtio/virtio_balloon.c | 23 ++++++++++++++++++----- include/linux/balloon_compaction.h | 34 +++++++++++++++++++++++++++++++++- mm/balloon_compaction.c | 28 +++++++++++++++++++++------- 3 files changed, 72 insertions(+), 13 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index f0b3a0b..45fe6a8 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -143,16 +143,17 @@ static void set_page_pfns(struct virtio_balloon *vb, static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) { - struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; unsigned num_allocated_pages; + unsigned int num_pfns; + struct page *page; + LIST_HEAD(pages); /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb->pfns)); - mutex_lock(&vb->balloon_lock); - for (vb->num_pfns = 0; vb->num_pfns < num; - vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { - struct page *page = balloon_page_enqueue(vb_dev_info); + for (num_pfns = 0; num_pfns < num; + num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { + struct page *page = balloon_page_alloc(); if (!page) { dev_info_ratelimited(&vb->vdev->dev, @@ -162,6 +163,18 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) msleep(200); break; } + + balloon_page_push(&pages, page); + } + + mutex_lock(&vb->balloon_lock); + + vb->num_pfns = 0; + while ((page = balloon_page_pop(&pages))) { + balloon_page_enqueue(&vb->vb_dev_info, page); + + vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE; if (!virtio_has_feature(vb->vdev, diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 79542b2..bdc055a 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -49,6 +49,7 @@ #include <linux/gfp.h> #include <linux/err.h> #include <linux/fs.h> +#include <linux/list.h> /* * Balloon device information descriptor. @@ -66,7 +67,9 @@ struct balloon_dev_info { struct inode *inode; }; -extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info); +extern struct page *balloon_page_alloc(void); +extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info, + struct page *page); extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info); static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) @@ -86,6 +89,35 @@ extern void balloon_page_putback(struct page *page); extern int balloon_page_migrate(struct address_space *mapping, struct page *newpage, struct page *page, enum migrate_mode mode); +/* + * balloon_page_push - insert a page into a page list. + * @head : pointer to list + * @page : page to be added + * + * Caller must ensure the page is private and protect the list. + */ +static inline void balloon_page_push(struct list_head *pages, struct page *page) +{ + list_add(&page->lru, pages); +} + +/* + * balloon_page_pop - remove a page from a page list. + * @head : pointer to list + * @page : page to be added + * + * Caller must ensure the page is private and protect the list. + */ +static inline struct page *balloon_page_pop(struct list_head *pages) +{ + struct page *page = list_first_entry_or_null(pages, struct page, lru); + + if (!page) + return NULL; + + list_del(&page->lru); + return page; +} /* * balloon_page_insert - insert a page into the balloon's page list and make diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 68d2892..16212c7 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -11,22 +11,36 @@ #include <linux/balloon_compaction.h> /* - * balloon_page_enqueue - allocates a new page and inserts it into the balloon + * balloon_page_alloc - allocates a new page for insertion into the balloon * page list. + * + * Driver must call it to properly allocate a new enlisted balloon page. + * Driver must call balloon_page_enqueue before definitively removing it from + * the guest system. This function returns the page address for the recently + * allocated page or NULL in the case we fail to allocate a new page this turn. + */ +struct page *balloon_page_alloc(void) +{ + struct page *page = alloc_page(balloon_mapping_gfp_mask() | + __GFP_NOMEMALLOC | __GFP_NORETRY); + return page; +} +EXPORT_SYMBOL_GPL(balloon_page_alloc); + +/* + * balloon_page_enqueue - inserts a new page into the balloon page list. * @b_dev_info: balloon device descriptor where we will insert a new page to + * @page: new page to enqueue - allocated using balloon_page_alloc. * - * Driver must call it to properly allocate a new enlisted balloon page + * Driver must call it to properly enqueue a new enlisted balloon page * before definitively removing it from the guest system. * This function returns the page address for the recently enqueued page or * NULL in the case we fail to allocate a new page this turn. */ -struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info) +void balloon_page_enqueue(struct balloon_dev_info *b_dev_info, + struct page *page) { unsigned long flags; - struct page *page = alloc_page(balloon_mapping_gfp_mask() | - __GFP_NOMEMALLOC | __GFP_NORETRY); - if (!page) - return NULL; /* * Block others from accessing the 'page' when we get around to -- 2.7.4
Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of balloon (i.e. inflated/deflated) pages using scatter-gather lists to the host. The implementation of the previous virtio-balloon is not very efficient, because the balloon pages are transferred to the host one by one. Here is the breakdown of the time in percentage spent on each step of the balloon inflating process (inflating 7GB of an 8GB idle guest). 1) allocating pages (6.5%) 2) sending PFNs to host (68.3%) 3) address translation (6.1%) 4) madvise (19%) It takes about 4126ms for the inflating process to complete. The above profiling shows that the bottlenecks are stage 2) and stage 4). This patch optimizes step 2) by transferring pages to the host in sgs. An sg describes a chunk of guest physically continuous pages. With this mechanism, step 4) can also be optimized by doing address translation and madvise() in chunks rather than page by page. With this new feature, the above ballooning process takes ~492ms resulting in an improvement of ~88%. TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of a single page each time. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Signed-off-by: Liang Li <liang.z.li at intel.com> Suggested-by: Michael S. Tsirkin <mst at redhat.com> Cc: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp> --- drivers/virtio/virtio_balloon.c | 232 +++++++++++++++++++++++++++++++++--- include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 215 insertions(+), 18 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 45fe6a8..b31fc25 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -32,6 +32,8 @@ #include <linux/mm.h> #include <linux/mount.h> #include <linux/magic.h> +#include <linux/xbitmap.h> +#include <asm/page.h> /* * Balloon device works in 4K page units. So each page is pointed to by @@ -79,6 +81,9 @@ struct virtio_balloon { /* Synchronize access/update to this struct virtio_balloon elements */ struct mutex balloon_lock; + /* The xbitmap used to record balloon pages */ + struct xb page_xb; + /* The array of pfns we tell the Host about. */ unsigned int num_pfns; __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; @@ -141,15 +146,130 @@ static void set_page_pfns(struct virtio_balloon *vb, page_to_balloon_pfn(page) + i); } + +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head) +{ + unsigned int len; + + virtqueue_kick(vq); + wait_event(wq_head, virtqueue_get_buf(vq, &len)); +} + +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size) +{ + struct scatterlist sg; + unsigned int len; + + sg_init_one(&sg, addr, size); + + /* Detach all the used buffers from the vq */ + while (virtqueue_get_buf(vq, &len)) + ; + + return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL); +} + +static void send_balloon_page_sg(struct virtio_balloon *vb, + struct virtqueue *vq, + void *addr, + uint32_t size, + bool batch) +{ + int err; + + err = add_one_sg(vq, addr, size); + /* + * This is expected to never fail: there is always at least 1 entry + * available on the vq, because when the vq is full the worker thread + * that adds the sg will be put into sleep until at least 1 entry is + * available to use. + */ + BUG_ON(err); + + /* If batching is requested, we batch till the vq is full */ + if (!batch || !vq->num_free) + kick_and_wait(vq, vb->acked); +} + +/* + * Send balloon pages in sgs to host. The balloon pages are recorded in the + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. + * The page xbitmap is searched for continuous "1" bits, which correspond + * to continuous pages, to chunk into sgs. + * + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that + * need to be searched. + */ +static void tell_host_sgs(struct virtio_balloon *vb, + struct virtqueue *vq, + unsigned long page_xb_start, + unsigned long page_xb_end) +{ + unsigned long sg_pfn_start, sg_pfn_end; + void *sg_addr; + uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE); + + sg_pfn_start = page_xb_start; + while (sg_pfn_start < page_xb_end) { + sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start, + page_xb_end); + if (sg_pfn_start == page_xb_end + 1) + break; + sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb, + sg_pfn_start + 1, + page_xb_end); + sg_addr = (void *)pfn_to_kaddr(sg_pfn_start); + sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT; + while (sg_len > sg_max_len) { + send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, + true); + sg_addr += sg_max_len; + sg_len -= sg_max_len; + } + send_balloon_page_sg(vb, vq, sg_addr, sg_len, true); + sg_pfn_start = sg_pfn_end + 1; + } + + /* + * The last few sgs may not reach the batch size, but need a kick to + * notify the device to handle them. + */ + if (vq->num_free != virtqueue_get_vring_size(vq)) + kick_and_wait(vq, vb->acked); + + xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end); +} + +static inline int xb_set_page(struct virtio_balloon *vb, + struct page *page, + unsigned long *pfn_min, + unsigned long *pfn_max) +{ + unsigned long pfn = page_to_pfn(page); + int ret; + + *pfn_min = min(pfn, *pfn_min); + *pfn_max = max(pfn, *pfn_max); + + do { + ret = xb_preload_and_set_bit(&vb->page_xb, pfn, GFP_KERNEL); + } while (unlikely(ret == -EAGAIN)); + + return ret; +} + static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) { unsigned num_allocated_pages; unsigned int num_pfns; struct page *page; LIST_HEAD(pages); + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; /* We can only do one array worth at a time. */ - num = min(num, ARRAY_SIZE(vb->pfns)); + if (!use_sg) + num = min(num, ARRAY_SIZE(vb->pfns)); for (num_pfns = 0; num_pfns < num; num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) { @@ -164,6 +284,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) break; } + if (use_sg && xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) + break; balloon_page_push(&pages, page); } @@ -175,7 +297,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (!use_sg) + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE; if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) @@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) num_allocated_pages = vb->num_pfns; /* Did we get any? */ - if (vb->num_pfns != 0) - tell_host(vb, vb->inflate_vq); + if (vb->num_pfns) { + if (use_sg) + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->inflate_vq); + } mutex_unlock(&vb->balloon_lock); return num_allocated_pages; @@ -211,9 +338,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) struct page *page; struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; LIST_HEAD(pages); + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; - /* We can only do one array worth at a time. */ - num = min(num, ARRAY_SIZE(vb->pfns)); + /* Traditionally, we can only do one array worth at a time. */ + if (!use_sg) + num = min(num, ARRAY_SIZE(vb->pfns)); mutex_lock(&vb->balloon_lock); /* We can't release more pages than taken */ @@ -223,7 +353,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) page = balloon_page_dequeue(vb_dev_info); if (!page) break; - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (use_sg) { + if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) + break; + } else { + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + } + list_add(&page->lru, &pages); vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } @@ -234,13 +370,56 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - if (vb->num_pfns != 0) - tell_host(vb, vb->deflate_vq); + if (vb->num_pfns) { + if (use_sg) + tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->deflate_vq); + } release_pages_balloon(vb, &pages); mutex_unlock(&vb->balloon_lock); return num_freed_pages; } +/* + * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation + * for xbitmap, which is not suitable for the oom case. This function does not + * use xbitmap to chunk pages, so it can be used by oom notifier to deflate + * pages when VIRTIO_BALLOON_F_SG is negotiated. + */ +static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb) +{ + unsigned int n; + struct page *page; + struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; + struct virtqueue *vq = vb->deflate_vq; + LIST_HEAD(pages); + + mutex_lock(&vb->balloon_lock); + for (n = 0; n < oom_pages; n++) { + page = balloon_page_dequeue(vb_dev_info); + if (!page) + break; + + list_add(&page->lru, &pages); + vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; + send_balloon_page_sg(vb, vq, page_address(page), + VIRTIO_BALLOON_PAGES_PER_PAGE, true); + release_pages_balloon(vb, &pages); + } + + /* + * The last few sgs may not reach the batch size, but need a kick to + * notify the device to handle them. + */ + if (vq->num_free != virtqueue_get_vring_size(vq)) + kick_and_wait(vq, vb->acked); + mutex_unlock(&vb->balloon_lock); + + return n; +} + + static inline void update_stat(struct virtio_balloon *vb, int idx, u16 tag, u64 val) { @@ -380,7 +559,10 @@ static int virtballoon_oom_notify(struct notifier_block *self, return NOTIFY_OK; freed = parm; - num_freed_pages = leak_balloon(vb, oom_pages); + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG)) + num_freed_pages = leak_balloon_sg_oom(vb); + else + num_freed_pages = leak_balloon(vb, oom_pages); update_balloon_size(vb); *freed += num_freed_pages; @@ -454,6 +636,7 @@ static int init_vqs(struct virtio_balloon *vb) } #ifdef CONFIG_BALLOON_COMPACTION + /* * virtballoon_migratepage - perform the balloon page migration on behalf of * a compation thread. (called under page lock) @@ -477,6 +660,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, { struct virtio_balloon *vb = container_of(vb_dev_info, struct virtio_balloon, vb_dev_info); + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); unsigned long flags; /* @@ -498,16 +682,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, vb_dev_info->isolated_pages--; __count_vm_event(BALLOON_MIGRATE); spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, newpage); - tell_host(vb, vb->inflate_vq); - + if (use_sg) { + send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage), + PAGE_SIZE, false); + } else { + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, newpage); + tell_host(vb, vb->inflate_vq); + } /* balloon's page migration 2nd step -- deflate "page" */ balloon_page_delete(page); - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, page); - tell_host(vb, vb->deflate_vq); - + if (use_sg) { + send_balloon_page_sg(vb, vb->deflate_vq, page_address(page), + PAGE_SIZE, false); + } else { + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, page); + tell_host(vb, vb->deflate_vq); + } mutex_unlock(&vb->balloon_lock); put_page(page); /* balloon reference */ @@ -566,6 +758,9 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) + xb_init(&vb->page_xb); + vb->nb.notifier_call = virtballoon_oom_notify; vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; err = register_oom_notifier(&vb->nb); @@ -682,6 +877,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, + VIRTIO_BALLOON_F_SG, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 343d7dd..37780a7 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -34,6 +34,7 @@ #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ +#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 -- 2.7.4
This patch adds support to walk through the free page blocks in the system and report them via a callback function. Some page blocks may leave the free list after zone->lock is released, so it is the caller's responsibility to either detect or prevent the use of such pages. One use example of this patch is to accelerate live migration by skipping the transfer of free pages reported from the guest. A popular method used by the hypervisor to track which part of memory is written during live migration is to write-protect all the guest memory. So, those pages that are reported as free pages but are written after the report function returns will be captured by the hypervisor, and they will be added to the next round of memory transfer. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Signed-off-by: Liang Li <liang.z.li at intel.com> Cc: Michal Hocko <mhocko at kernel.org> Cc: Michael S. Tsirkin <mst at redhat.com> Acked-by: Michal Hocko <mhocko at kernel.org> --- include/linux/mm.h | 6 ++++ mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 97 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 065d99d..fe5a90e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1877,6 +1877,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); +extern void walk_free_mem_block(void *opaque, + int min_order, + bool (*report_pfn_range)(void *opaque, + unsigned long pfn, + unsigned long num)); + /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) * into the buddy system. The freed pages will be poisoned with pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 77e4d3c..2283fcc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4867,6 +4867,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) show_swap_cache_info(); } +/* + * Walk through a free page list and report the found pfn range via the + * callback. + * + * Return false if the callback requests to stop reporting. Otherwise, + * return true. + */ +static bool walk_free_page_list(void *opaque, + struct zone *zone, + int order, + enum migratetype mt, + bool (*report_pfn_range)(void *, + unsigned long, + unsigned long)) +{ + struct page *page; + struct list_head *list; + unsigned long pfn, flags; + bool ret; + + spin_lock_irqsave(&zone->lock, flags); + list = &zone->free_area[order].free_list[mt]; + list_for_each_entry(page, list, lru) { + pfn = page_to_pfn(page); + ret = report_pfn_range(opaque, pfn, 1 << order); + if (!ret) + break; + } + spin_unlock_irqrestore(&zone->lock, flags); + + return ret; +} + +/** + * walk_free_mem_block - Walk through the free page blocks in the system + * @opaque: the context passed from the caller + * @min_order: the minimum order of free lists to check + * @report_pfn_range: the callback to report the pfn range of the free pages + * + * If the callback returns false, stop iterating the list of free page blocks. + * Otherwise, continue to report. + * + * Please note that there are no locking guarantees for the callback and + * that the reported pfn range might be freed or disappear after the + * callback returns so the caller has to be very careful how it is used. + * + * The callback itself must not sleep or perform any operations which would + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC) + * or via any lock dependency. It is generally advisable to implement + * the callback as simple as possible and defer any heavy lifting to a + * different context. + * + * There is no guarantee that each free range will be reported only once + * during one walk_free_mem_block invocation. + * + * pfn_to_page on the given range is strongly discouraged and if there is + * an absolute need for that make sure to contact MM people to discuss + * potential problems. + * + * The function itself might sleep so it cannot be called from atomic + * contexts. + * + * In general low orders tend to be very volatile and so it makes more + * sense to query larger ones first for various optimizations which like + * ballooning etc... This will reduce the overhead as well. + */ +void walk_free_mem_block(void *opaque, + int min_order, + bool (*report_pfn_range)(void *opaque, + unsigned long pfn, + unsigned long num)) +{ + struct zone *zone; + int order; + enum migratetype mt; + bool ret; + + for_each_populated_zone(zone) { + for (order = MAX_ORDER - 1; order >= min_order; order--) { + for (mt = 0; mt < MIGRATE_TYPES; mt++) { + ret = walk_free_page_list(opaque, zone, + order, mt, + report_pfn_range); + if (!ret) + return; + } + } + } +} +EXPORT_SYMBOL_GPL(walk_free_mem_block); + static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone = zone; -- 2.7.4
Wei Wang
2017-Nov-03 08:13 UTC
[PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the support of reporting hints of guest free pages to the host via virtio-balloon. The host requests the guest to report the free pages by sending commands via the virtio-balloon configuration registers. When the guest starts to report, the first element added to the free page vq is a sequence id of the start reporting command. The id is given by the host, and it indicates whether the following free pages correspond to the command. For example, the host may stop the report and start again with a new command id. The obsolete pages for the previous start command can be detected by the id dismatching on the host. The id is added to the vq using an output buffer, and the free pages are added to the vq using input buffer. Here are some explainations about the added configuration registers: - host2guest_cmd: a register used by the host to send commands to the guest. - guest2host_cmd: written by the guest to ACK to the host about the commands that have been received. The host will clear the corresponding bits on the host2guest_cmd register. The guest also uses this register to send commands to the host (e.g. when finish free page reporting). - free_page_cmd_id: the sequence id of the free page report command given by the host. Signed-off-by: Wei Wang <wei.w.wang at intel.com> Signed-off-by: Liang Li <liang.z.li at intel.com> Cc: Michael S. Tsirkin <mst at redhat.com> Cc: Michal Hocko <mhocko at kernel.org> --- drivers/virtio/virtio_balloon.c | 234 ++++++++++++++++++++++++++++++++---- include/uapi/linux/virtio_balloon.h | 11 ++ 2 files changed, 223 insertions(+), 22 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index b31fc25..4087f04 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt; struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; + + /* Balloon's own wq for cpu-intensive work items */ + struct workqueue_struct *balloon_wq; + /* The free page reporting work item submitted to the balloon wq */ + struct work_struct report_free_page_work; /* The balloon servicing is delegated to a freezable workqueue. */ struct work_struct update_balloon_stats_work; @@ -65,6 +70,10 @@ struct virtio_balloon { spinlock_t stop_update_lock; bool stop_update; + /* Stop reporting free pages */ + bool report_free_page_stop; + uint32_t free_page_cmd_id; + /* Waiting for host to ack the pages we released. */ wait_queue_head_t acked; @@ -191,6 +200,30 @@ static void send_balloon_page_sg(struct virtio_balloon *vb, kick_and_wait(vq, vb->acked); } +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size) +{ + int err = 0; + unsigned int len; + + /* Detach all the used buffers from the vq */ + while (virtqueue_get_buf(vq, &len)) + ; + + /* + * Since this is an optimization feature, losing a couple of free + * pages to report isn't important. We simply resturn without adding + * the page if the vq is full. + */ + if (vq->num_free) { + err = add_one_sg(vq, addr, size); + BUG_ON(err); + } + + /* Batch till the vq is full */ + if (!vq->num_free) + virtqueue_kick(vq); +} + /* * Send balloon pages in sgs to host. The balloon pages are recorded in the * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. @@ -495,9 +528,8 @@ static void stats_handle_request(struct virtio_balloon *vb) virtqueue_kick(vq); } -static void virtballoon_changed(struct virtio_device *vdev) +static void virtballoon_cmd_balloon_memory(struct virtio_balloon *vb) { - struct virtio_balloon *vb = vdev->priv; unsigned long flags; spin_lock_irqsave(&vb->stop_update_lock, flags); @@ -506,6 +538,50 @@ static void virtballoon_changed(struct virtio_device *vdev) spin_unlock_irqrestore(&vb->stop_update_lock, flags); } +static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb) +{ + unsigned long flags; + + vb->report_free_page_stop = false; + spin_lock_irqsave(&vb->stop_update_lock, flags); + if (!vb->stop_update) + queue_work(vb->balloon_wq, &vb->report_free_page_work); + spin_unlock_irqrestore(&vb->stop_update_lock, flags); +} + +static void virtballoon_changed(struct virtio_device *vdev) +{ + struct virtio_balloon *vb = vdev->priv; + u32 host2guest_cmd, guest2host_cmd = 0; + + if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { + virtballoon_cmd_balloon_memory(vb); + return; + } + + virtio_cread(vb->vdev, struct virtio_balloon_config, host2guest_cmd, + &host2guest_cmd); + + if (host2guest_cmd & VIRTIO_BALLOON_CMD_BALLOON_MEMORY) { + virtballoon_cmd_balloon_memory(vb); + guest2host_cmd |= VIRTIO_BALLOON_CMD_BALLOON_MEMORY; + } + + if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START) { + virtballoon_cmd_report_free_page_start(vb); + guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START; + } + + if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP) { + vb->report_free_page_stop = true; + guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP; + } + + /* Ack to the host about the commands that have been received */ + virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd, + &guest2host_cmd); +} + static inline s64 towards_target(struct virtio_balloon *vb) { s64 target; @@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work) queue_work(system_freezable_wq, work); } -static int init_vqs(struct virtio_balloon *vb) +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn, + unsigned long nr_pages) { - struct virtqueue *vqs[3]; - vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request }; - static const char * const names[] = { "inflate", "deflate", "stats" }; - int err, nvqs; + struct virtio_balloon *vb = (struct virtio_balloon *)opaque; + void *addr = (void *)pfn_to_kaddr(pfn); + uint32_t len = nr_pages << PAGE_SHIFT; + + if (vb->report_free_page_stop) + return false; + + send_free_page_sg(vb->free_page_vq, addr, len); + return true; +} + +static void report_free_page_end(struct virtio_balloon *vb) +{ + u32 cmd = VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP; /* - * We expect two virtqueues: inflate and deflate, and - * optionally stat. + * The host may have already requested to stop the reporting before we + * finish, so no need to notify the host in this case. */ - nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; - err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); + if (vb->report_free_page_stop) + return; + vb->report_free_page_stop = true; + + virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd, + &cmd); +} + +static void report_free_page_cmd_id(struct virtio_balloon *vb) +{ + struct scatterlist sg; + int err; + + virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_cmd_id, + &vb->free_page_cmd_id); + sg_init_one(&sg, &vb->free_page_cmd_id, sizeof(uint32_t)); + err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, + &vb->free_page_cmd_id, GFP_KERNEL); + BUG_ON(err); +} + +static void report_free_page(struct work_struct *work) +{ + struct virtio_balloon *vb; + + vb = container_of(work, struct virtio_balloon, report_free_page_work); + report_free_page_cmd_id(vb); + walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages); + /* + * The last few free page blocks that were added may not reach the + * batch size, but need a kick to notify the device to handle them. + */ + virtqueue_kick(vb->free_page_vq); + report_free_page_end(vb); +} + +static int init_vqs(struct virtio_balloon *vb) +{ + struct virtqueue **vqs; + vq_callback_t **callbacks; + const char **names; + struct scatterlist sg; + int i, nvqs, err = -ENOMEM; + + /* Inflateq and deflateq are used unconditionally */ + nvqs = 2; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) + nvqs++; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) + nvqs++; + + /* Allocate space for find_vqs parameters */ + vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL); + if (!vqs) + goto err_vq; + callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL); + if (!callbacks) + goto err_callback; + names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL); + if (!names) + goto err_names; + + callbacks[0] = balloon_ack; + names[0] = "inflate"; + callbacks[1] = balloon_ack; + names[1] = "deflate"; + + i = 2; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { + callbacks[i] = stats_request; + names[i] = "stats"; + i++; + } + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { + callbacks[i] = NULL; + names[i] = "free_page_vq"; + } + + err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names, + NULL, NULL); if (err) - return err; + goto err_find; vb->inflate_vq = vqs[0]; vb->deflate_vq = vqs[1]; + i = 2; if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { - struct scatterlist sg; - unsigned int num_stats; - vb->stats_vq = vqs[2]; - + vb->stats_vq = vqs[i++]; /* * Prime this virtqueue with one buffer so the hypervisor can * use it to signal us later (it can't be broken yet!). */ - num_stats = update_balloon_stats(vb); - - sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats); + sg_init_one(&sg, vb->stats, sizeof(vb->stats)); if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL) - < 0) - BUG(); + < 0) { + dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n", + __func__); + goto err_find; + } virtqueue_kick(vb->stats_vq); } + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) + vb->free_page_vq = vqs[i]; + + kfree(names); + kfree(callbacks); + kfree(vqs); return 0; + +err_find: + kfree(names); +err_names: + kfree(callbacks); +err_callback: + kfree(vqs); +err_vq: + return err; } #ifdef CONFIG_BALLOON_COMPACTION @@ -761,6 +942,13 @@ static int virtballoon_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) xb_init(&vb->page_xb); + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { + vb->balloon_wq = alloc_workqueue("balloon-wq", + WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0); + INIT_WORK(&vb->report_free_page_work, report_free_page); + vb->report_free_page_stop = true; + } + vb->nb.notifier_call = virtballoon_oom_notify; vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; err = register_oom_notifier(&vb->nb); @@ -825,6 +1013,7 @@ static void virtballoon_remove(struct virtio_device *vdev) spin_unlock_irq(&vb->stop_update_lock); cancel_work_sync(&vb->update_balloon_size_work); cancel_work_sync(&vb->update_balloon_stats_work); + cancel_work_sync(&vb->report_free_page_work); remove_common(vb); #ifdef CONFIG_BALLOON_COMPACTION @@ -878,6 +1067,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_SG, + VIRTIO_BALLOON_F_FREE_PAGE_VQ, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 37780a7..b758484 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -35,15 +35,26 @@ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ 4 /* VQ to report free pages */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 +#define VIRTIO_BALLOON_CMD_BALLOON_MEMORY (1 << 0) +#define VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START (1 << 1) +#define VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP (1 << 2) + struct virtio_balloon_config { /* Number of pages host wants Guest to give up. */ __u32 num_pages; /* Number of pages we've actually got in balloon. */ __u32 actual; + /* Host-to-guest command, readonly by guest */ + __u32 host2guest_cmd; + /* Sequence id of the free_page report command, readonly by guest */ + __u32 free_page_cmd_id; + /* Guest-to-host command */ + __u32 guest2host_cmd; }; #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ -- 2.7.4
I'm commenting without understanding the logic. Wei Wang wrote:> + > +bool xb_preload(gfp_t gfp); > +Want __must_check annotation, for __radix_tree_preload() is marked with __must_check annotation. By error failing to check result of xb_preload() will lead to preemption kept disabled unexpectedly.> +int xb_set_bit(struct xb *xb, unsigned long bit) > +{ > + int err; > + unsigned long index = bit / IDA_BITMAP_BITS; > + struct radix_tree_root *root = &xb->xbrt; > + struct radix_tree_node *node; > + void **slot; > + struct ida_bitmap *bitmap; > + unsigned long ebit; > + > + bit %= IDA_BITMAP_BITS; > + ebit = bit + 2; > + > + err = __radix_tree_create(root, index, 0, &node, &slot); > + if (err) > + return err; > + bitmap = rcu_dereference_raw(*slot); > + if (radix_tree_exception(bitmap)) { > + unsigned long tmp = (unsigned long)bitmap; > + > + if (ebit < BITS_PER_LONG) { > + tmp |= 1UL << ebit; > + rcu_assign_pointer(*slot, (void *)tmp); > + return 0; > + } > + bitmap = this_cpu_xchg(ida_bitmap, NULL); > + if (!bitmap)Please write locking rules, in order to explain how memory allocated by __radix_tree_create() will not leak.> + return -EAGAIN; > + memset(bitmap, 0, sizeof(*bitmap)); > + bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT; > + rcu_assign_pointer(*slot, bitmap); > + } > + > + if (!bitmap) { > + if (ebit < BITS_PER_LONG) { > + bitmap = (void *)((1UL << ebit) | > + RADIX_TREE_EXCEPTIONAL_ENTRY); > + __radix_tree_replace(root, node, slot, bitmap, NULL, > + NULL); > + return 0; > + } > + bitmap = this_cpu_xchg(ida_bitmap, NULL); > + if (!bitmap)Same here.> + return -EAGAIN; > + memset(bitmap, 0, sizeof(*bitmap)); > + __radix_tree_replace(root, node, slot, bitmap, NULL, NULL); > + } > + > + __set_bit(bit, bitmap->bitmap); > + return 0; > +}> +void xb_clear_bit(struct xb *xb, unsigned long bit) > +{ > + unsigned long index = bit / IDA_BITMAP_BITS; > + struct radix_tree_root *root = &xb->xbrt; > + struct radix_tree_node *node; > + void **slot; > + struct ida_bitmap *bitmap; > + unsigned long ebit; > + > + bit %= IDA_BITMAP_BITS; > + ebit = bit + 2; > + > + bitmap = __radix_tree_lookup(root, index, &node, &slot); > + if (radix_tree_exception(bitmap)) { > + unsigned long tmp = (unsigned long)bitmap; > + > + if (ebit >= BITS_PER_LONG) > + return; > + tmp &= ~(1UL << ebit); > + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY) > + __radix_tree_delete(root, node, slot); > + else > + rcu_assign_pointer(*slot, (void *)tmp); > + return; > + } > + > + if (!bitmap) > + return; > + > + __clear_bit(bit, bitmap->bitmap); > + if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {Please write locking rules, in order to explain how double kfree() and/or use-after-free can be avoided.> + kfree(bitmap); > + __radix_tree_delete(root, node, slot); > + } > +}> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end) > +{ > + struct radix_tree_root *root = &xb->xbrt; > + struct radix_tree_node *node; > + void **slot; > + struct ida_bitmap *bitmap; > + unsigned int nbits; > + > + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { > + unsigned long index = start / IDA_BITMAP_BITS; > + unsigned long bit = start % IDA_BITMAP_BITS; > + > + bitmap = __radix_tree_lookup(root, index, &node, &slot); > + if (radix_tree_exception(bitmap)) { > + unsigned long ebit = bit + 2; > + unsigned long tmp = (unsigned long)bitmap; > + > + nbits = min(end - start + 1, BITS_PER_LONG - ebit); > + > + if (ebit >= BITS_PER_LONG) > + continue; > + bitmap_clear(&tmp, ebit, nbits); > + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY) > + __radix_tree_delete(root, node, slot); > + else > + rcu_assign_pointer(*slot, (void *)tmp); > + } else if (bitmap) { > + nbits = min(end - start + 1, IDA_BITMAP_BITS - bit); > + > + if (nbits != IDA_BITMAP_BITS) > + bitmap_clear(bitmap->bitmap, bit, nbits); > + > + if (nbits == IDA_BITMAP_BITS || > + bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {Same here.> + kfree(bitmap); > + __radix_tree_delete(root, node, slot); > + } > + } > + } > +}> +bool xb_test_bit(struct xb *xb, unsigned long bit) > +{ > + unsigned long index = bit / IDA_BITMAP_BITS; > + const struct radix_tree_root *root = &xb->xbrt; > + struct ida_bitmap *bitmap = radix_tree_lookup(root, index); > + > + bit %= IDA_BITMAP_BITS; > + > + if (!bitmap) > + return false; > + if (radix_tree_exception(bitmap)) { > + bit += RADIX_TREE_EXCEPTIONAL_SHIFT; > + if (bit > BITS_PER_LONG)Why not bit >= BITS_PER_LONG here?> + return false; > + return (unsigned long)bitmap & (1UL << bit); > + } > + > + return test_bit(bit, bitmap->bitmap); > +}
Tetsuo Handa
2017-Nov-03 10:59 UTC
[PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue
Wei Wang wrote:> Here's a detailed analysis of the deadlock by Tetsuo Handa: > > In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to > serialize against fill_balloon(). But in fill_balloon(), > alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is > called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE] > implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY > is specified, this allocation attempt might indirectly depend on somebody > else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect > __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via > virtballoon_oom_notify() via blocking_notifier_call_chain() callback via > out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock > mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it > will cause OOM lockup. Thus, do not wait for vb->balloon_lock mutex if > leak_balloon() is called from out_of_memory().Please drop "Thus, do not wait for vb->balloon_lock mutex if leak_balloon() is called from out_of_memory()." part. This is not what this patch will do.> > Thread1 Thread2 > fill_balloon() > takes a balloon_lock > balloon_page_enqueue() > alloc_page(GFP_HIGHUSER_MOVABLE) > direct reclaim (__GFP_FS context) takes a fs lock > waits for that fs lock alloc_page(GFP_NOFS) > __alloc_pages_may_oom() > takes the oom_lock > out_of_memory() > blocking_notifier_call_chain() > leak_balloon() > tries to take that > balloon_lock and deadlocks
Wei Wang wrote:> @@ -164,6 +284,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > break; > } > > + if (use_sg && xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)Isn't this leaking "page" ?> + break; > balloon_page_push(&pages, page); > } >> @@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > > num_allocated_pages = vb->num_pfns; > /* Did we get any? */ > - if (vb->num_pfns != 0) > - tell_host(vb, vb->inflate_vq); > + if (vb->num_pfns) { > + if (use_sg) > + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);Please describe why tell_host_sgs() can work without __GFP_DIRECT_RECLAIM allocation, for tell_host_sgs() is called with vb->balloon_lock mutex held.> + else > + tell_host(vb, vb->inflate_vq); > + } > mutex_unlock(&vb->balloon_lock); > > return num_allocated_pages;> @@ -223,7 +353,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) > page = balloon_page_dequeue(vb_dev_info); > if (!page) > break; > - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + if (use_sg) { > + if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)Isn't this leaking "page" ? If this is inside vb->balloon_lock mutex (isn't this?), xb_set_page() must not use __GFP_DIRECT_RECLAIM allocation, for leak_balloon_sg_oom() will be blocked on vb->balloon_lock mutex.> + break; > + } else { > + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + } > + > list_add(&page->lru, &pages); > vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; > }
Matthew Wilcox
2017-Nov-06 17:00 UTC
[PATCH v17 2/6] radix tree test suite: add tests for xbitmap
On Fri, Nov 03, 2017 at 04:13:02PM +0800, Wei Wang wrote:> From: Matthew Wilcox <mawilcox at microsoft.com> > > Add the following tests for xbitmap: > 1) single bit test: single bit set/clear/find; > 2) bit range test: set/clear a range of bits and find a 0 or 1 bit in > the range. > > Signed-off-by: Wei Wang <wei.w.wang at intel.com> > Cc: Matthew Wilcox <mawilcox at microsoft.com> > Cc: Andrew Morton <akpm at linux-foundation.org> > Cc: Michael S. Tsirkin <mst at redhat.com> > --- > tools/include/linux/bitmap.h | 34 ++++ > tools/include/linux/kernel.h | 2 + > tools/testing/radix-tree/Makefile | 7 +- > tools/testing/radix-tree/linux/kernel.h | 2 - > tools/testing/radix-tree/main.c | 5 + > tools/testing/radix-tree/test.h | 1 + > tools/testing/radix-tree/xbitmap.c | 278 ++++++++++++++++++++++++++++++++Umm. No. You've duplicated xbitmap.c into the test framework, so now it can slowly get out of sync with the one in lib/. That's not OK. Put it back the way it was, with the patch I gave you as patch 1/n (relocating xbitmap.c from tools/testing/radix-tree to lib/). Then add your enhancements as patch 2/n. All you should need to change in your 1/n from http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392 is a bit of Makefile tooling. Leave the test suite embedded in the file; that way people might remember to update the test suite when adding new functionality.
Wei Wang
2017-Nov-13 10:34 UTC
[PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
Ping for comments, thanks. On 11/03/2017 04:13 PM, Wei Wang wrote:> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the > support of reporting hints of guest free pages to the host via > virtio-balloon. The host requests the guest to report the free pages by > sending commands via the virtio-balloon configuration registers. > > When the guest starts to report, the first element added to the free page > vq is a sequence id of the start reporting command. The id is given by > the host, and it indicates whether the following free pages correspond > to the command. For example, the host may stop the report and start again > with a new command id. The obsolete pages for the previous start command > can be detected by the id dismatching on the host. The id is added to the > vq using an output buffer, and the free pages are added to the vq using > input buffer. > > Here are some explainations about the added configuration registers: > - host2guest_cmd: a register used by the host to send commands to the > guest. > - guest2host_cmd: written by the guest to ACK to the host about the > commands that have been received. The host will clear the corresponding > bits on the host2guest_cmd register. The guest also uses this register > to send commands to the host (e.g. when finish free page reporting). > - free_page_cmd_id: the sequence id of the free page report command > given by the host. > > Signed-off-by: Wei Wang <wei.w.wang at intel.com> > Signed-off-by: Liang Li <liang.z.li at intel.com> > Cc: Michael S. Tsirkin <mst at redhat.com> > Cc: Michal Hocko <mhocko at kernel.org> > --- > drivers/virtio/virtio_balloon.c | 234 ++++++++++++++++++++++++++++++++---- > include/uapi/linux/virtio_balloon.h | 11 ++ > 2 files changed, 223 insertions(+), 22 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index b31fc25..4087f04 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt; > > struct virtio_balloon { > struct virtio_device *vdev; > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; > + > + /* Balloon's own wq for cpu-intensive work items */ > + struct workqueue_struct *balloon_wq; > + /* The free page reporting work item submitted to the balloon wq */ > + struct work_struct report_free_page_work; > > /* The balloon servicing is delegated to a freezable workqueue. */ > struct work_struct update_balloon_stats_work; > @@ -65,6 +70,10 @@ struct virtio_balloon { > spinlock_t stop_update_lock; > bool stop_update; > > + /* Stop reporting free pages */ > + bool report_free_page_stop; > + uint32_t free_page_cmd_id; > + > /* Waiting for host to ack the pages we released. */ > wait_queue_head_t acked; > > @@ -191,6 +200,30 @@ static void send_balloon_page_sg(struct virtio_balloon *vb, > kick_and_wait(vq, vb->acked); > } > > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size) > +{ > + int err = 0; > + unsigned int len; > + > + /* Detach all the used buffers from the vq */ > + while (virtqueue_get_buf(vq, &len)) > + ; > + > + /* > + * Since this is an optimization feature, losing a couple of free > + * pages to report isn't important. We simply resturn without adding > + * the page if the vq is full. > + */ > + if (vq->num_free) { > + err = add_one_sg(vq, addr, size); > + BUG_ON(err); > + } > + > + /* Batch till the vq is full */ > + if (!vq->num_free) > + virtqueue_kick(vq); > +} > + > /* > * Send balloon pages in sgs to host. The balloon pages are recorded in the > * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. > @@ -495,9 +528,8 @@ static void stats_handle_request(struct virtio_balloon *vb) > virtqueue_kick(vq); > } > > -static void virtballoon_changed(struct virtio_device *vdev) > +static void virtballoon_cmd_balloon_memory(struct virtio_balloon *vb) > { > - struct virtio_balloon *vb = vdev->priv; > unsigned long flags; > > spin_lock_irqsave(&vb->stop_update_lock, flags); > @@ -506,6 +538,50 @@ static void virtballoon_changed(struct virtio_device *vdev) > spin_unlock_irqrestore(&vb->stop_update_lock, flags); > } > > +static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb) > +{ > + unsigned long flags; > + > + vb->report_free_page_stop = false; > + spin_lock_irqsave(&vb->stop_update_lock, flags); > + if (!vb->stop_update) > + queue_work(vb->balloon_wq, &vb->report_free_page_work); > + spin_unlock_irqrestore(&vb->stop_update_lock, flags); > +} > + > +static void virtballoon_changed(struct virtio_device *vdev) > +{ > + struct virtio_balloon *vb = vdev->priv; > + u32 host2guest_cmd, guest2host_cmd = 0; > + > + if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { > + virtballoon_cmd_balloon_memory(vb); > + return; > + } > + > + virtio_cread(vb->vdev, struct virtio_balloon_config, host2guest_cmd, > + &host2guest_cmd); > + > + if (host2guest_cmd & VIRTIO_BALLOON_CMD_BALLOON_MEMORY) { > + virtballoon_cmd_balloon_memory(vb); > + guest2host_cmd |= VIRTIO_BALLOON_CMD_BALLOON_MEMORY; > + } > + > + if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START) { > + virtballoon_cmd_report_free_page_start(vb); > + guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START; > + } > + > + if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP) { > + vb->report_free_page_stop = true; > + guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP; > + } > + > + /* Ack to the host about the commands that have been received */ > + virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd, > + &guest2host_cmd); > +} > + > static inline s64 towards_target(struct virtio_balloon *vb) > { > s64 target; > @@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work) > queue_work(system_freezable_wq, work); > } > > -static int init_vqs(struct virtio_balloon *vb) > +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn, > + unsigned long nr_pages) > { > - struct virtqueue *vqs[3]; > - vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request }; > - static const char * const names[] = { "inflate", "deflate", "stats" }; > - int err, nvqs; > + struct virtio_balloon *vb = (struct virtio_balloon *)opaque; > + void *addr = (void *)pfn_to_kaddr(pfn); > + uint32_t len = nr_pages << PAGE_SHIFT; > + > + if (vb->report_free_page_stop) > + return false; > + > + send_free_page_sg(vb->free_page_vq, addr, len); > > + return true; > +} > + > +static void report_free_page_end(struct virtio_balloon *vb) > +{ > + u32 cmd = VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP; > /* > - * We expect two virtqueues: inflate and deflate, and > - * optionally stat. > + * The host may have already requested to stop the reporting before we > + * finish, so no need to notify the host in this case. > */ > - nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; > - err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); > + if (vb->report_free_page_stop) > + return; > + vb->report_free_page_stop = true; > + > + virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd, > + &cmd); > +} > + > +static void report_free_page_cmd_id(struct virtio_balloon *vb) > +{ > + struct scatterlist sg; > + int err; > + > + virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_cmd_id, > + &vb->free_page_cmd_id); > + sg_init_one(&sg, &vb->free_page_cmd_id, sizeof(uint32_t)); > + err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, > + &vb->free_page_cmd_id, GFP_KERNEL); > + BUG_ON(err); > +} > + > +static void report_free_page(struct work_struct *work) > +{ > + struct virtio_balloon *vb; > + > + vb = container_of(work, struct virtio_balloon, report_free_page_work); > + report_free_page_cmd_id(vb); > + walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages); > + /* > + * The last few free page blocks that were added may not reach the > + * batch size, but need a kick to notify the device to handle them. > + */ > + virtqueue_kick(vb->free_page_vq); > + report_free_page_end(vb); > +} > + > +static int init_vqs(struct virtio_balloon *vb) > +{ > + struct virtqueue **vqs; > + vq_callback_t **callbacks; > + const char **names; > + struct scatterlist sg; > + int i, nvqs, err = -ENOMEM; > + > + /* Inflateq and deflateq are used unconditionally */ > + nvqs = 2; > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) > + nvqs++; > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) > + nvqs++; > + > + /* Allocate space for find_vqs parameters */ > + vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL); > + if (!vqs) > + goto err_vq; > + callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL); > + if (!callbacks) > + goto err_callback; > + names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL); > + if (!names) > + goto err_names; > + > + callbacks[0] = balloon_ack; > + names[0] = "inflate"; > + callbacks[1] = balloon_ack; > + names[1] = "deflate"; > + > + i = 2; > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > + callbacks[i] = stats_request; > + names[i] = "stats"; > + i++; > + } > + > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { > + callbacks[i] = NULL; > + names[i] = "free_page_vq"; > + } > + > + err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names, > + NULL, NULL); > if (err) > - return err; > + goto err_find; > > vb->inflate_vq = vqs[0]; > vb->deflate_vq = vqs[1]; > + i = 2; > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > - struct scatterlist sg; > - unsigned int num_stats; > - vb->stats_vq = vqs[2]; > - > + vb->stats_vq = vqs[i++]; > /* > * Prime this virtqueue with one buffer so the hypervisor can > * use it to signal us later (it can't be broken yet!). > */ > - num_stats = update_balloon_stats(vb); > - > - sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats); > + sg_init_one(&sg, vb->stats, sizeof(vb->stats)); > if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL) > - < 0) > - BUG(); > + < 0) { > + dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n", > + __func__); > + goto err_find; > + } > virtqueue_kick(vb->stats_vq); > } > + > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) > + vb->free_page_vq = vqs[i]; > + > + kfree(names); > + kfree(callbacks); > + kfree(vqs); > return 0; > + > +err_find: > + kfree(names); > +err_names: > + kfree(callbacks); > +err_callback: > + kfree(vqs); > +err_vq: > + return err; > } > > #ifdef CONFIG_BALLOON_COMPACTION > @@ -761,6 +942,13 @@ static int virtballoon_probe(struct virtio_device *vdev) > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) > xb_init(&vb->page_xb); > > + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { > + vb->balloon_wq = alloc_workqueue("balloon-wq", > + WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0); > + INIT_WORK(&vb->report_free_page_work, report_free_page); > + vb->report_free_page_stop = true; > + } > + > vb->nb.notifier_call = virtballoon_oom_notify; > vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; > err = register_oom_notifier(&vb->nb); > @@ -825,6 +1013,7 @@ static void virtballoon_remove(struct virtio_device *vdev) > spin_unlock_irq(&vb->stop_update_lock); > cancel_work_sync(&vb->update_balloon_size_work); > cancel_work_sync(&vb->update_balloon_stats_work); > + cancel_work_sync(&vb->report_free_page_work); > > remove_common(vb); > #ifdef CONFIG_BALLOON_COMPACTION > @@ -878,6 +1067,7 @@ static unsigned int features[] = { > VIRTIO_BALLOON_F_STATS_VQ, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, > VIRTIO_BALLOON_F_SG, > + VIRTIO_BALLOON_F_FREE_PAGE_VQ, > }; > > static struct virtio_driver virtio_balloon_driver = { > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > index 37780a7..b758484 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -35,15 +35,26 @@ > #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ > #define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ > +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ 4 /* VQ to report free pages */ > > /* Size of a PFN in the balloon interface. */ > #define VIRTIO_BALLOON_PFN_SHIFT 12 > > +#define VIRTIO_BALLOON_CMD_BALLOON_MEMORY (1 << 0) > +#define VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START (1 << 1) > +#define VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP (1 << 2) > + > struct virtio_balloon_config { > /* Number of pages host wants Guest to give up. */ > __u32 num_pages; > /* Number of pages we've actually got in balloon. */ > __u32 actual; > + /* Host-to-guest command, readonly by guest */ > + __u32 host2guest_cmd; > + /* Sequence id of the free_page report command, readonly by guest */ > + __u32 free_page_cmd_id; > + /* Guest-to-host command */ > + __u32 guest2host_cmd; > }; > > #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
Michael S. Tsirkin
2017-Nov-15 20:32 UTC
[PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the > support of reporting hints of guest free pages to the host via > virtio-balloon. The host requests the guest to report the free pages by > sending commands via the virtio-balloon configuration registers. > > When the guest starts to report, the first element added to the free page > vq is a sequence id of the start reporting command. The id is given by > the host, and it indicates whether the following free pages correspond > to the command. For example, the host may stop the report and start again > with a new command id. The obsolete pages for the previous start command > can be detected by the id dismatching on the host. The id is added to the > vq using an output buffer, and the free pages are added to the vq using > input buffer. > > Here are some explainations about the added configuration registers: > - host2guest_cmd: a register used by the host to send commands to the > guest. > - guest2host_cmd: written by the guest to ACK to the host about the > commands that have been received. The host will clear the corresponding > bits on the host2guest_cmd register. The guest also uses this register > to send commands to the host (e.g. when finish free page reporting). > - free_page_cmd_id: the sequence id of the free page report command > given by the host. > > Signed-off-by: Wei Wang <wei.w.wang at intel.com> > Signed-off-by: Liang Li <liang.z.li at intel.com> > Cc: Michael S. Tsirkin <mst at redhat.com> > Cc: Michal Hocko <mhocko at kernel.org> > --- > drivers/virtio/virtio_balloon.c | 234 ++++++++++++++++++++++++++++++++---- > include/uapi/linux/virtio_balloon.h | 11 ++ > 2 files changed, 223 insertions(+), 22 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index b31fc25..4087f04 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt; > > struct virtio_balloon { > struct virtio_device *vdev; > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; > + > + /* Balloon's own wq for cpu-intensive work items */ > + struct workqueue_struct *balloon_wq; > + /* The free page reporting work item submitted to the balloon wq */ > + struct work_struct report_free_page_work; > > /* The balloon servicing is delegated to a freezable workqueue. */ > struct work_struct update_balloon_stats_work; > @@ -65,6 +70,10 @@ struct virtio_balloon { > spinlock_t stop_update_lock; > bool stop_update; > > + /* Stop reporting free pages */ > + bool report_free_page_stop; > + uint32_t free_page_cmd_id; > + > /* Waiting for host to ack the pages we released. */ > wait_queue_head_t acked; > > @@ -191,6 +200,30 @@ static void send_balloon_page_sg(struct virtio_balloon *vb, > kick_and_wait(vq, vb->acked); > } > > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size) > +{ > + int err = 0; > + unsigned int len; > + > + /* Detach all the used buffers from the vq */ > + while (virtqueue_get_buf(vq, &len)) > + ; > + > + /* > + * Since this is an optimization feature, losing a couple of free > + * pages to report isn't important. We simply resturn without adding > + * the page if the vq is full. > + */ > + if (vq->num_free) { > + err = add_one_sg(vq, addr, size); > + BUG_ON(err); > + } > + > + /* Batch till the vq is full */ > + if (!vq->num_free) > + virtqueue_kick(vq); > +} > + > /* > * Send balloon pages in sgs to host. The balloon pages are recorded in the > * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE. > @@ -495,9 +528,8 @@ static void stats_handle_request(struct virtio_balloon *vb) > virtqueue_kick(vq); > } > > -static void virtballoon_changed(struct virtio_device *vdev) > +static void virtballoon_cmd_balloon_memory(struct virtio_balloon *vb) > { > - struct virtio_balloon *vb = vdev->priv; > unsigned long flags; > > spin_lock_irqsave(&vb->stop_update_lock, flags); > @@ -506,6 +538,50 @@ static void virtballoon_changed(struct virtio_device *vdev) > spin_unlock_irqrestore(&vb->stop_update_lock, flags); > } > > +static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb) > +{ > + unsigned long flags; > + > + vb->report_free_page_stop = false; > + spin_lock_irqsave(&vb->stop_update_lock, flags); > + if (!vb->stop_update) > + queue_work(vb->balloon_wq, &vb->report_free_page_work); > + spin_unlock_irqrestore(&vb->stop_update_lock, flags); > +} > + > +static void virtballoon_changed(struct virtio_device *vdev) > +{ > + struct virtio_balloon *vb = vdev->priv; > + u32 host2guest_cmd, guest2host_cmd = 0; > + > + if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { > + virtballoon_cmd_balloon_memory(vb); > + return; > + } > + > + virtio_cread(vb->vdev, struct virtio_balloon_config, host2guest_cmd, > + &host2guest_cmd); > + > + if (host2guest_cmd & VIRTIO_BALLOON_CMD_BALLOON_MEMORY) { > + virtballoon_cmd_balloon_memory(vb); > + guest2host_cmd |= VIRTIO_BALLOON_CMD_BALLOON_MEMORY; > + } > + > + if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START) { > + virtballoon_cmd_report_free_page_start(vb); > + guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START; > + } > + > + if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP) { > + vb->report_free_page_stop = true; > + guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP; > + } > + > + /* Ack to the host about the commands that have been received */ > + virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd, > + &guest2host_cmd); > +} > + > static inline s64 towards_target(struct virtio_balloon *vb) > { > s64 target; > @@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work) > queue_work(system_freezable_wq, work); > } > > -static int init_vqs(struct virtio_balloon *vb) > +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn, > + unsigned long nr_pages) > { > - struct virtqueue *vqs[3]; > - vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request }; > - static const char * const names[] = { "inflate", "deflate", "stats" }; > - int err, nvqs; > + struct virtio_balloon *vb = (struct virtio_balloon *)opaque; > + void *addr = (void *)pfn_to_kaddr(pfn); > + uint32_t len = nr_pages << PAGE_SHIFT; > + > + if (vb->report_free_page_stop) > + return false; > + > + send_free_page_sg(vb->free_page_vq, addr, len); > > + return true; > +} > + > +static void report_free_page_end(struct virtio_balloon *vb) > +{ > + u32 cmd = VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP; > /* > - * We expect two virtqueues: inflate and deflate, and > - * optionally stat. > + * The host may have already requested to stop the reporting before we > + * finish, so no need to notify the host in this case. > */ > - nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; > - err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); > + if (vb->report_free_page_stop) > + return; > + vb->report_free_page_stop = true; > + > + virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd, > + &cmd); > +} > + > +static void report_free_page_cmd_id(struct virtio_balloon *vb) > +{ > + struct scatterlist sg; > + int err; > + > + virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_cmd_id, > + &vb->free_page_cmd_id); > + sg_init_one(&sg, &vb->free_page_cmd_id, sizeof(uint32_t)); > + err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, > + &vb->free_page_cmd_id, GFP_KERNEL); > + BUG_ON(err); > +} > + > +static void report_free_page(struct work_struct *work) > +{ > + struct virtio_balloon *vb; > + > + vb = container_of(work, struct virtio_balloon, report_free_page_work); > + report_free_page_cmd_id(vb); > + walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages); > + /* > + * The last few free page blocks that were added may not reach the > + * batch size, but need a kick to notify the device to handle them. > + */ > + virtqueue_kick(vb->free_page_vq); > + report_free_page_end(vb); > +} > +I think there's an issue here: if pages are poisoned and hypervisor subsequently drops them, testing them after allocation will trigger a false positive. The specific configuration: PAGE_POISONING on PAGE_POISONING_NO_SANITY off PAGE_POISONING_ZERO off Solutions: 1. disable the feature in that configuration suggested as an initial step 2. pass poison value to host so it can validate page content before it drops it 3. pass poison value to host so it can init allocated pages with that value In fact one nice side effect would be that unmap becomes safe even though free list is not locked anymore. It would be interesting to see whether this last has any value performance-wise.> +static int init_vqs(struct virtio_balloon *vb) > +{ > + struct virtqueue **vqs; > + vq_callback_t **callbacks; > + const char **names; > + struct scatterlist sg; > + int i, nvqs, err = -ENOMEM; > + > + /* Inflateq and deflateq are used unconditionally */ > + nvqs = 2; > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) > + nvqs++; > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) > + nvqs++; > + > + /* Allocate space for find_vqs parameters */ > + vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL); > + if (!vqs) > + goto err_vq; > + callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL); > + if (!callbacks) > + goto err_callback; > + names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL); > + if (!names) > + goto err_names; > + > + callbacks[0] = balloon_ack; > + names[0] = "inflate"; > + callbacks[1] = balloon_ack; > + names[1] = "deflate"; > + > + i = 2; > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > + callbacks[i] = stats_request; > + names[i] = "stats"; > + i++; > + } > + > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { > + callbacks[i] = NULL; > + names[i] = "free_page_vq"; > + } > + > + err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names, > + NULL, NULL); > if (err) > - return err; > + goto err_find; > > vb->inflate_vq = vqs[0]; > vb->deflate_vq = vqs[1]; > + i = 2; > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { > - struct scatterlist sg; > - unsigned int num_stats; > - vb->stats_vq = vqs[2]; > - > + vb->stats_vq = vqs[i++]; > /* > * Prime this virtqueue with one buffer so the hypervisor can > * use it to signal us later (it can't be broken yet!). > */ > - num_stats = update_balloon_stats(vb); > - > - sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats); > + sg_init_one(&sg, vb->stats, sizeof(vb->stats)); > if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL) > - < 0) > - BUG(); > + < 0) { > + dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n", > + __func__); > + goto err_find; > + } > virtqueue_kick(vb->stats_vq); > } > + > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) > + vb->free_page_vq = vqs[i]; > + > + kfree(names); > + kfree(callbacks); > + kfree(vqs); > return 0; > + > +err_find: > + kfree(names); > +err_names: > + kfree(callbacks); > +err_callback: > + kfree(vqs); > +err_vq: > + return err; > } > > #ifdef CONFIG_BALLOON_COMPACTION > @@ -761,6 +942,13 @@ static int virtballoon_probe(struct virtio_device *vdev) > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) > xb_init(&vb->page_xb); > > + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) { > + vb->balloon_wq = alloc_workqueue("balloon-wq", > + WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0); > + INIT_WORK(&vb->report_free_page_work, report_free_page); > + vb->report_free_page_stop = true; > + } > + > vb->nb.notifier_call = virtballoon_oom_notify; > vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; > err = register_oom_notifier(&vb->nb); > @@ -825,6 +1013,7 @@ static void virtballoon_remove(struct virtio_device *vdev) > spin_unlock_irq(&vb->stop_update_lock); > cancel_work_sync(&vb->update_balloon_size_work); > cancel_work_sync(&vb->update_balloon_stats_work); > + cancel_work_sync(&vb->report_free_page_work); > > remove_common(vb); > #ifdef CONFIG_BALLOON_COMPACTION > @@ -878,6 +1067,7 @@ static unsigned int features[] = { > VIRTIO_BALLOON_F_STATS_VQ, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, > VIRTIO_BALLOON_F_SG, > + VIRTIO_BALLOON_F_FREE_PAGE_VQ, > }; > > static struct virtio_driver virtio_balloon_driver = { > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > index 37780a7..b758484 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -35,15 +35,26 @@ > #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ > #define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ > +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ 4 /* VQ to report free pages */ > > /* Size of a PFN in the balloon interface. */ > #define VIRTIO_BALLOON_PFN_SHIFT 12 > > +#define VIRTIO_BALLOON_CMD_BALLOON_MEMORY (1 << 0) > +#define VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START (1 << 1) > +#define VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP (1 << 2) > + > struct virtio_balloon_config { > /* Number of pages host wants Guest to give up. */ > __u32 num_pages; > /* Number of pages we've actually got in balloon. */ > __u32 actual; > + /* Host-to-guest command, readonly by guest */ > + __u32 host2guest_cmd; > + /* Sequence id of the free_page report command, readonly by guest */ > + __u32 free_page_cmd_id; > + /* Guest-to-host command */ > + __u32 guest2host_cmd; > }; > > #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ > -- > 2.7.4
Reasonably Related Threads
- [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
- [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
- [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
- [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
- [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG