thr3ads.net - Linux Virtualization - [PATCH v20 0/7] Virtio-balloon Enhancement [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Wei Wang

2017-Dec-19 12:17 UTC

[PATCH v20 0/7] Virtio-balloon Enhancement

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

ChangeLog:
v19->v20:
1) patch 1: xbitmap
	- add __rcu to "void **slot";
	- remove the exceptional path.
2) patch 3: xbitmap
	- DeveloperNotes: add an item to comment that the current bit range
	  related APIs operating on extremely large ranges (e.g.
          [0, ULONG_MAX)) will take too long time. This can be optimized in
	  the future.
	- remove the exceptional path;
	- remove xb_preload_and_set();
	- reimplement xb_clear_bit_range to make its usage close to
	  bitmap_clear;
	- rename xb_find_next_set_bit to xb_find_set, and re-implement it
	  in a style close to find_next_bit;
	- rename xb_find_next_zero_bit to xb_find_clear, and re-implement
	  it in a stytle close to find_next_zero_bit;
	- separate the implementation of xb_find_set and xb_find_clear for
	  the convenience of future updates.
3) patch 4: virtio-balloon
	- xb_set_page: change the way to call xb_ related APIs

v18->v19:
1) patch 3:
	- xb_clear_bit_range and xb_find_next_bit will deal with range [start,
	  end), where end is changed to be exclusive of the range.
	- add overflow checks at the end of xb_clear_bit_range and
	  xb_find_next_bit 
	- add overflow related test cases
2) patch 4:
	- change back to the previous add_one_sg methond, which is based on the
	  scatterlist struct
        - tell_host_sgs: use "uint64_t len" to avoid overflow
	- batch_balloon_page_sg: a simpler function to implement the batching of
	  sgs
3) patch 6: batch_free_page_sg: batch sgs using the previous scatterlist struct
4) patch 7: add a config field, poison_val, to tell the host about the poison
	    value
v17->v18:
1) patch 1-2: new to solve some tools related compilation issues
2) patch 3: revert to the original xbitmap implementation from Matthew
Wilcox with some minor changes (e.g. comments added to the exported
functions)
3) patch 4: summarize the changes we want to make to patch 3
4) patch 5: add the developer notes as a reminder for users to avoid
concurrent accesses to the ida bitmap
5) patch 6: a new vring API to allow users to directly pass in a physical
address to a vring desc
6) patch 7: ballooning time is reduced from ~490ms to ~440ms with the new
implementation
	- use the new API from patch 6 to send balloon pages
	- xb_preload with "GFP_NOWAIT | __GFP_NOWARN" flag;
	- handle the case when xb_set_page() fails to avoid memory leak;
	- put xb_set_page() under the balloon lock
7) patch 9: simper implementation
	- start free page reporting by sending a new cmd id from the host
	- guest acks the start or stop via adding a cmd id to the free page vq
	- use vb->report_free_page, instead of vb->report_free_page_stop
	- use WRITE_ONCE/READ_ONCE to access vb->report_free_page
	- use the new API from patch 6 to send free pages to avoid the
	  unnecessary use of kaddr.
8) patch 10: new patch to solve the page posioning issue reported by
Michael S. Tsirkin 

v16->v17:
1) patch 1: please check the commit log there;
2) patch 3: included Michael S. Tsirkin patch to fix the potential
deadlock issue;
3) patch 4: use BUG_ON if virtqueue_add_ returns error, which is
expected never to happen;
4) patch 4: add leak_balloon_sg_oom, which is used in the oom case when
VIRTIO_BALLOON_F_SG is in use;
5) patch 6: use config registers, instead of a vq, as the command channel
between the host and guest;
6) patch 6: add the command sequence id support.

v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf

v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (1):
  xbitmap: Introduce xbitmap

Wei Wang (6):
  xbitmap: potential improvement
  xbitmap: add more operations
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  virtio-balloon: don't report free pages when page poisoning is enabled

 drivers/virtio/virtio_balloon.c          | 444 +++++++++++++++++++++++++++----
 include/linux/mm.h                       |   6 +
 include/linux/radix-tree.h               |   2 +
 include/linux/xbitmap.h                  |  55 ++++
 include/uapi/linux/virtio_balloon.h      |   7 +
 lib/Makefile                             |   2 +-
 lib/radix-tree.c                         |  40 ++-
 lib/xbitmap.c                            | 330 +++++++++++++++++++++++
 mm/page_alloc.c                          |  91 +++++++
 tools/include/linux/bitmap.h             |  34 +++
 tools/include/linux/kernel.h             |   2 +
 tools/testing/radix-tree/Makefile        |  12 +-
 tools/testing/radix-tree/linux/xbitmap.h |   1 +
 tools/testing/radix-tree/main.c          |   4 +
 tools/testing/radix-tree/test.h          |   1 +
 15 files changed, 976 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/linux/xbitmap.h

-- 
2.7.4

Wei Wang

2017-Dec-19 12:17 UTC

head link

[PATCH v20 1/7] xbitmap: Introduce xbitmap

From: Matthew Wilcox <mawilcox at microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Wei Wang <wei.w.wang at intel.com>
Cc: Matthew Wilcox <mawilcox at microsoft.com>
Cc: Andrew Morton <akpm at linux-foundation.org>
Cc: Michal Hocko <mhocko at kernel.org>
Cc: Michael S. Tsirkin <mst at redhat.com>
Cc: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp>
---
 include/linux/radix-tree.h               |   2 +
 include/linux/xbitmap.h                  |  49 ++++++++++++
 lib/Makefile                             |   2 +-
 lib/radix-tree.c                         |  25 +++++-
 lib/xbitmap.c                            | 130 +++++++++++++++++++++++++++++++
 tools/testing/radix-tree/Makefile        |  12 ++-
 tools/testing/radix-tree/linux/xbitmap.h |   1 +
 tools/testing/radix-tree/main.c          |   4 +
 tools/testing/radix-tree/test.h          |   1 +
 9 files changed, 221 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/linux/xbitmap.h

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 23a9c89..5c16179a 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -315,6 +315,8 @@ void radix_tree_iter_delete(struct radix_tree_root *,
 			struct radix_tree_iter *iter, void __rcu **slot);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+bool __radix_tree_delete(struct radix_tree_root *r, struct radix_tree_node *n,
+				void __rcu **slot);
 void radix_tree_clear_tags(struct radix_tree_root *, struct radix_tree_node *,
 			   void __rcu **slot);
 unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..4ac2b8d
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,49 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox at microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+
+static inline bool xb_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t);
+
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
diff --git a/lib/Makefile b/lib/Makefile
index d11c48e..08a8183 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -19,7 +19,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c8d5556..2650e9e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -37,7 +37,7 @@
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
 #include <linux/string.h>
-
+#include <linux/xbitmap.h>
 
 /* Number of nodes in fully populated tree of given height */
 static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
@@ -77,6 +77,11 @@ static struct kmem_cache *radix_tree_node_cachep;
 						RADIX_TREE_MAP_SHIFT))
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+						RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
 /*
  * Per-cpu pool of preloaded nodes
  */
@@ -839,6 +844,8 @@ int __radix_tree_create(struct radix_tree_root *root,
unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1982,7 +1989,7 @@ void __radix_tree_delete_node(struct radix_tree_root
*root,
 	delete_node(root, node, update_node);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
+bool __radix_tree_delete(struct radix_tree_root *root,
 				struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
@@ -2135,6 +2142,20 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
 }
 EXPORT_SYMBOL(ida_pre_get);
 
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
 void __rcu **idr_get_free_cmn(struct radix_tree_root *root,
 			      struct radix_tree_iter *iter, gfp_t gfp,
 			      unsigned long max)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..236afa9
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,130 @@
+#include <linux/export.h>
+#include <linux/xbitmap.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, the per-cpu ida_bitmap will be taken.
+ *
+ * Returns: 0 on success. -EAGAIN or -ENOMEM indicates that @bit is not set.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void __rcu **slot;
+	struct ida_bitmap *bitmap;
+
+	bit %= IDA_BITMAP_BITS;
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (!bitmap) {
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void __rcu **slot;
+	struct ida_bitmap *bitmap;
+
+	bit %= IDA_BITMAP_BITS;
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ *
+ * Returns: true if the bit is set, or false otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+#ifndef __KERNEL__
+
+static DEFINE_XB(xb1);
+
+void xbitmap_check_bit(unsigned long bit)
+{
+	xb_preload(GFP_KERNEL);
+	assert(!xb_test_bit(&xb1, bit));
+	assert(xb_set_bit(&xb1, bit) == 0);
+	assert(xb_test_bit(&xb1, bit));
+	assert(xb_clear_bit(&xb1, bit) == 0);
+	assert(xb_empty(&xb1));
+	assert(xb_clear_bit(&xb1, bit) == 0);
+	assert(xb_empty(&xb1));
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+}
+
+int __weak main(void)
+{
+	radix_tree_init();
+	xbitmap_checks();
+}
+#endif
diff --git a/tools/testing/radix-tree/Makefile
b/tools/testing/radix-tree/Makefile
index fa7ee36..34ece78 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -6,7 +6,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -25,8 +26,11 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
-	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
+	$(RM) $(TARGETS) *.o radix-tree.c idr.c xbitmap.c generated/map-shift.h
 
 vpath %.c ../../lib
 
@@ -34,6 +38,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
@@ -42,6 +47,9 @@ radix-tree.c: ../../../lib/radix-tree.c
 idr.c: ../../../lib/idr.c
 	sed -e 's/^static //' -e 's/__always_inline //' -e
's/inline //' < $< > $@
 
+xbitmap.c: ../../../lib/xbitmap.c
+	sed -e 's/^static //' -e 's/__always_inline //' -e
's/inline //' < $< > $@
+
 .PHONY: mapshift
 
 mapshift:
diff --git a/tools/testing/radix-tree/linux/xbitmap.h
b/tools/testing/radix-tree/linux/xbitmap.h
new file mode 100644
index 0000000..61de214
--- /dev/null
+++ b/tools/testing/radix-tree/linux/xbitmap.h
@@ -0,0 +1 @@
+#include "../../../../include/linux/xbitmap.h"
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index 257f3f8..d112363 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -326,6 +326,10 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after idr_checks: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 	big_gang_check(long_run);
 	rcu_barrier();
 	printv(2, "after big_gang_check: %d allocated, preempt %d\n",
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index d9c031d..8175d6b 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -38,6 +38,7 @@ void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
 void ida_thread_tests(void);
+void xbitmap_checks(void);
 
 struct item *
 item_tag_set(struct radix_tree_root *root, unsigned long index, int tag);
-- 
2.7.4

Wei Wang

2017-Dec-19 12:17 UTC

head link

[PATCH v20 2/7] xbitmap: potential improvement

This patch made some changes to the original xbitmap implementation from
the linux-dax tree:

- xb_set_bit: delete the new inserted radix_tree_node when failing to
  get the per cpu ida bitmap, this avoids the kind of memory leak of the
  unused radix tree node left in the tree.

- xb_preload: with the original implementation, the CPU that successfully
  do __radix_tree_preload() may get into sleep by kmalloc(), which has a
  risk of getting the caller of xb_preload() scheduled to another CPU
  after waken up, and the new CPU may not have radix_tree_node
  pre-allocated there, this will be a problem when inserting a node to
  the tree later. This patch moves __radix_tree_preload() after kmalloc()
  and returns a boolean to indicate the success or failure. Also, add the
  __must_check annotation to xb_preload for prudence purpose.

Signed-off-by: Wei Wang <wei.w.wang at intel.com>
Cc: Matthew Wilcox <mawilcox at microsoft.com>
Cc: Andrew Morton <akpm at linux-foundation.org>
Cc: Michal Hocko <mhocko at kernel.org>
Cc: Michael S. Tsirkin <mst at redhat.com>
Cc: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp>
---
 include/linux/xbitmap.h |  2 +-
 lib/radix-tree.c        | 21 ++++++++++++++++++---
 lib/xbitmap.c           |  4 +++-
 3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 4ac2b8d..108f929 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -41,7 +41,7 @@ static inline bool xb_empty(const struct xb *xb)
 	return radix_tree_empty(&xb->xbrt);
 }
 
-void xb_preload(gfp_t);
+int xb_preload(gfp_t);
 
 static inline void xb_preload_end(void)
 {
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 2650e9e..f30347a 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2142,17 +2142,32 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
 }
 EXPORT_SYMBOL(ida_pre_get);
 
-void xb_preload(gfp_t gfp)
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). On success,
+ * return zero, with preemption disabled. On error, return -ENOMEM with
+ * preemption not disabled.
+ */
+__must_check int xb_preload(gfp_t gfp)
 {
-	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
 	if (!this_cpu_read(ida_bitmap)) {
 		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
 
 		if (!bitmap)
-			return;
+			return -ENOMEM;
+		/*
+		 * The per-CPU variable is updated with preemption enabled.
+		 * If the calling task is unlucky to be scheduled to another
+		 * CPU which has no ida_bitmap allocation, it will be detected
+		 * when setting a bit (i.e. xb_set_bit()).
+		 */
 		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
 		kfree(bitmap);
 	}
+
+	return __radix_tree_preload(gfp, XB_PRELOAD_SIZE);
 }
 EXPORT_SYMBOL(xb_preload);
 
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 236afa9..2dcfad5 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -29,8 +29,10 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
 	bitmap = rcu_dereference_raw(*slot);
 	if (!bitmap) {
 		bitmap = this_cpu_xchg(ida_bitmap, NULL);
-		if (!bitmap)
+		if (!bitmap) {
+			__radix_tree_delete(root, node, slot);
 			return -EAGAIN;
+		}
 		memset(bitmap, 0, sizeof(*bitmap));
 		__radix_tree_replace(root, node, slot, bitmap, NULL);
 	}
-- 
2.7.4

Wei Wang

2017-Dec-19 12:17 UTC

head link

[PATCH v20 3/7] xbitmap: add more operations

This patch adds support to find next 1 or 0 bit in a xbmitmap range and
clear a range of bits.

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits.
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add tags to help searching.

Signed-off-by: Wei Wang <wei.w.wang at intel.com>
Cc: Matthew Wilcox <mawilcox at microsoft.com>
Cc: Andrew Morton <akpm at linux-foundation.org>
Cc: Michal Hocko <mhocko at kernel.org>
Cc: Michael S. Tsirkin <mst at redhat.com>
Cc: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp>
Suggested-by: Matthew Wilcox <mawilcox at microsoft.com>
---
 include/linux/xbitmap.h      |   6 ++
 lib/xbitmap.c                | 198 +++++++++++++++++++++++++++++++++++++++++++
 tools/include/linux/bitmap.h |  34 ++++++++
 tools/include/linux/kernel.h |   2 +
 4 files changed, 240 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 108f929..ede1029 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -35,6 +35,12 @@ static inline void xb_init(struct xb *xb)
 int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit_range(struct xb *xb, unsigned long start,
+			unsigned long nbits);
+unsigned long xb_find_set(struct xb *xb, unsigned long size,
+			  unsigned long offset);
+unsigned long xb_find_zero(struct xb *xb, unsigned long size,
+			   unsigned long offset);
 
 static inline bool xb_empty(const struct xb *xb)
 {
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 2dcfad5..bb0a5b2 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -3,6 +3,16 @@
 #include <linux/bitmap.h>
 #include <linux/slab.h>
 
+/*
+ * Developer notes:
+ * - locks are required to gurantee there is no concurrent
+ *   calls of xb_set_bit, xb_clear_bit, xb_clear_bit_range, xb_test_bit,
+ *   xb_find_set, or xb_find_clear to operate on the same ida bitmap.
+ * - The current implementation of xb_clear_bit_range, xb_find_set and
+ *   xb_find_clear may cause long latency when the bit range to operate
+ *   on is super large (e.g. [0, ULONG_MAX)).
+ */
+
 /**
  *  xb_set_bit - set a bit in the xbitmap
  *  @xb: the xbitmap tree used to record the bit
@@ -72,6 +82,49 @@ void xb_clear_bit(struct xb *xb, unsigned long bit)
 EXPORT_SYMBOL(xb_clear_bit);
 
 /**
+ * xb_clear_bit_range - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @nbits: number of bits to clear
+ *
+ * This function is used to clear a range of bits in the xbitmap. If all the
+ * bits in the bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start,
+			unsigned long nbits)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void __rcu **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long index = start / IDA_BITMAP_BITS;
+	unsigned long bit = start % IDA_BITMAP_BITS;
+
+	if (nbits > ULONG_MAX - start)
+		nbits = ULONG_MAX - start;
+
+	while (nbits) {
+		unsigned int __nbits = min(nbits,
+					(unsigned long)IDA_BITMAP_BITS - bit);
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (bitmap) {
+			if (__nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, __nbits);
+
+			if (__nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+		bit = 0;
+		index++;
+		nbits -= __nbits;
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
  * xb_test_bit - test a bit in the xbitmap
  * @xb: the xbitmap tree used to record the bit
  * @bit: index of the bit to test
@@ -94,6 +147,99 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit)
 }
 EXPORT_SYMBOL(xb_test_bit);
 
+/**
+ * xb_find_set - find the next set bit in a range of bits
+ * @xb: the xbitmap to search from
+ * @offset: the offset in the range to start searching
+ * @size: the size of the range
+ *
+ * Returns: the found bit or, @size if no set bit is found.
+ */
+unsigned long xb_find_set(struct xb *xb, unsigned long size,
+			  unsigned long offset)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void __rcu **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long index = offset / IDA_BITMAP_BITS;
+	unsigned long index_end = size / IDA_BITMAP_BITS;
+	unsigned long bit = offset % IDA_BITMAP_BITS;
+
+	if (unlikely(offset >= size))
+		return size;
+
+	while (index <= index_end) {
+		unsigned long ret;
+		unsigned int nbits = size - index * IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (!node) {
+			index = (index | RADIX_TREE_MAP_MASK) + 1;
+			continue;
+		}
+
+		if (bitmap) {
+			if (nbits > IDA_BITMAP_BITS)
+				nbits = IDA_BITMAP_BITS;
+
+			ret = find_next_bit(bitmap->bitmap, nbits, bit);
+			if (ret != nbits)
+				return ret + index * IDA_BITMAP_BITS;
+		}
+		bit = 0;
+		index++;
+	}
+
+	return size;
+}
+EXPORT_SYMBOL(xb_find_set);
+
+/**
+ * xb_find_zero - find the next zero bit in a range of bits
+ * @xb: the xbitmap to search from
+ * @offset: the offset in the range to start searching
+ * @size: the size of the range
+ *
+ * Returns: the found bit or, @size if no zero bit is found.
+ */
+unsigned long xb_find_zero(struct xb *xb, unsigned long size,
+			   unsigned long offset)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void __rcu **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long index = offset / IDA_BITMAP_BITS;
+	unsigned long index_end = size / IDA_BITMAP_BITS;
+	unsigned long bit = offset % IDA_BITMAP_BITS;
+
+	if (unlikely(offset >= size))
+		return size;
+
+	while (index <= index_end) {
+		unsigned long ret;
+		unsigned int nbits = size - index * IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (bitmap) {
+			if (nbits > IDA_BITMAP_BITS)
+				nbits = IDA_BITMAP_BITS;
+
+			ret = find_next_zero_bit(bitmap->bitmap, nbits, bit);
+			if (ret != nbits)
+				return ret + index * IDA_BITMAP_BITS;
+		} else {
+			return bit + index * IDA_BITMAP_BITS;
+		}
+		bit = 0;
+		index++;
+	}
+
+	return size;
+}
+EXPORT_SYMBOL(xb_find_zero);
+
 #ifndef __KERNEL__
 
 static DEFINE_XB(xb1);
@@ -111,6 +257,56 @@ void xbitmap_check_bit(unsigned long bit)
 	xb_preload_end();
 }
 
+static void xbitmap_check_bit_range(void)
+{
+	/*
+	 * Regular tests
+	 * set bit 2000, 2001, 2040
+	 * Next 1 in [0, 2048)		--> 2000
+	 * Next 1 in [2000, 2002)	--> 2000
+	 * Next 1 in [2002, 2041)	--> 2040
+	 * Next 1 in [2002, 2040)	--> none
+	 * Next 0 in [2000, 2048)	--> 2002
+	 * Next 0 in [2048, 2060)	--> 2048
+	 */
+	xb_preload(GFP_KERNEL);
+	assert(!xb_set_bit(&xb1, 2000));
+	assert(!xb_set_bit(&xb1, 2001));
+	assert(!xb_set_bit(&xb1, 2040));
+	assert(xb_find_set(&xb1, 2048, 0) == 2000);
+	assert(xb_find_set(&xb1, 2002, 2000) == 2000);
+	assert(xb_find_set(&xb1, 2041, 2002) == 2040);
+	assert(xb_find_set(&xb1, 2040, 2002) == 2040);
+	assert(xb_find_zero(&xb1, 2048, 2000) == 2002);
+	assert(xb_find_zero(&xb1, 2060, 2048) == 2048);
+	xb_clear_bit_range(&xb1, 0, 2048);
+	assert(xb_find_set(&xb1, 2048, 0) == 2048);
+	xb_preload_end();
+
+	/*
+	 * Overflow tests:
+	 * Set bit 1 and ULONG_MAX - 4
+	 * Next 1 in [ULONG_MAX - 4, ULONG_MAX)		--> ULONG_MAX - 4
+	 * Next 1 [ULONG_MAX - 3, ULONG_MAX + 4)	--> none
+	 * Next 0 [ULONG_MAX - 4, ULONG_MAX + 4)	--> none
+	 */
+	xb_preload(GFP_KERNEL);
+	assert(!xb_set_bit(&xb1, 1));
+	xb_preload_end();
+	xb_preload(GFP_KERNEL);
+	assert(!xb_set_bit(&xb1, ULONG_MAX - 4));
+	assert(xb_find_set(&xb1, ULONG_MAX, ULONG_MAX - 4) == ULONG_MAX - 4);
+	assert(xb_find_set(&xb1, ULONG_MAX + 4, ULONG_MAX - 3) =+	       ULONG_MAX
+ 4);
+	assert(xb_find_zero(&xb1, ULONG_MAX + 4, ULONG_MAX - 4) =+	      
ULONG_MAX + 4);
+	xb_clear_bit_range(&xb1, ULONG_MAX - 4, 4);
+	assert(xb_find_set(&xb1, ULONG_MAX, ULONG_MAX - 10) == ULONG_MAX);
+	xb_clear_bit_range(&xb1, 0, 2);
+	assert(xb_find_set(&xb1, 2, 0) == 2);
+	xb_preload_end();
+}
+
 void xbitmap_checks(void)
 {
 	xb_init(&xb1);
@@ -122,6 +318,8 @@ void xbitmap_checks(void)
 	xbitmap_check_bit(1025);
 	xbitmap_check_bit((1UL << 63) | (1UL << 24));
 	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
 }
 
 int __weak main(void)
diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index ca16027..8d0bc1b 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -37,6 +37,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8)
&&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 0ad8844..3c992ae 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -13,6 +13,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
-- 
2.7.4

Wei Wang

2017-Dec-19 12:17 UTC

head link

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
host.

The implementation of the previous virtio-balloon is not very efficient,
because the balloon pages are transferred to the host by one array each
time. Here is the breakdown of the time in percentage spent on each step
of the balloon inflating process (inflating 7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete. The above
profiling shows that the bottlenecks are stage 2) and stage 4).

This patch optimizes step 2) by transferring pages to host in sgs. An sg
describes a chunk of guest physically continuous pages. With this
mechanism, step 4) can also be optimized by doing address translation and
madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~460ms resulting
in an improvement of ~89%.

TODO:
- optimize stage 1) by allocating/freeing a chunk of pages instead of a
single page each time.
- sort the internal balloon page queue.

Signed-off-by: Wei Wang <wei.w.wang at intel.com>
Signed-off-by: Liang Li <liang.z.li at intel.com>
Suggested-by: Michael S. Tsirkin <mst at redhat.com>
Cc: Tetsuo Handa <penguin-kernel at I-love.SAKURA.ne.jp>
---
 drivers/virtio/virtio_balloon.c     | 234 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 217 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index a1fb52c..fff0a3f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,15 +146,129 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static void add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
+{
+	struct scatterlist sg;
+	unsigned int unused;
+	int err;
+
+	sg_init_table(&sg, 1);
+	sg_set_page(&sg, pfn_to_page(pfn), len, 0);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	err = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+	/*
+	 * This is expected to never fail: there is always at least 1 entry
+	 * available on the vq, because when the vq is full the worker thread
+	 * that adds the sg will be put into sleep until at least 1 entry is
+	 * available to use.
+	 */
+	BUG_ON(err);
+}
+
+static void batch_balloon_page_sg(struct virtio_balloon *vb,
+				  struct virtqueue *vq,
+				  unsigned long pfn,
+				  uint32_t len)
+{
+	add_one_sg(vq, pfn, len);
+
+	/* Batch till the vq is full */
+	if (!vq->num_free)
+		kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which
correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long pfn_start, pfn_end;
+	uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
+	uint64_t len;
+
+	pfn_start = page_xb_start;
+	while (pfn_start < page_xb_end) {
+		pfn_start = xb_find_set(&vb->page_xb, page_xb_end + 1,
+					pfn_start);
+		if (pfn_start == page_xb_end + 1)
+			break;
+		pfn_end = xb_find_zero(&vb->page_xb, page_xb_end + 1,
+				       pfn_start);
+		len = (pfn_end - pfn_start) << PAGE_SHIFT;
+		while (len > max_len) {
+			batch_balloon_page_sg(vb, vq, pfn_start, max_len);
+			pfn_start += max_len >> PAGE_SHIFT;
+			len -= max_len;
+		}
+		batch_balloon_page_sg(vb, vq, pfn_start, (uint32_t)len);
+		pfn_start = pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+}
+
+static inline int xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int ret;
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+
+	do {
+		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
+			return -ENOMEM;
+
+		ret = xb_set_bit(&vb->page_xb, pfn);
+		xb_preload_end();
+	} while (unlikely(ret == -EAGAIN));
+
+	return ret;
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	unsigned num_allocated_pages;
 	unsigned num_pfns;
 	struct page *page;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	for (num_pfns = 0; num_pfns < num;
 	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
@@ -173,8 +292,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb,
size_t num)
 
 	while ((page = balloon_page_pop(&pages))) {
 		balloon_page_enqueue(&vb->vb_dev_info, page);
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				__free_page(page);
+				continue;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -184,8 +310,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb,
size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -211,9 +341,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb,
size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -223,7 +356,14 @@ static unsigned leak_balloon(struct virtio_balloon *vb,
size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				balloon_page_enqueue(&vb->vb_dev_info, page);
+				break;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -234,13 +374,55 @@ static unsigned leak_balloon(struct virtio_balloon *vb,
size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
 
+/*
+ * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
+ * for xbitmap, which is not suitable for the oom case. This function does not
+ * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
+ * pages when VIRTIO_BALLOON_F_SG is negotiated.
+ */
+static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
+{
+	unsigned int n;
+	struct page *page;
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	struct virtqueue *vq = vb->deflate_vq;
+	LIST_HEAD(pages);
+
+	mutex_lock(&vb->balloon_lock);
+	for (n = 0; n < oom_pages; n++) {
+		page = balloon_page_dequeue(vb_dev_info);
+		if (!page)
+			break;
+
+		list_add(&page->lru, &pages);
+		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+		batch_balloon_page_sg(vb, vb->deflate_vq, page_to_pfn(page),
+				      PAGE_SIZE);
+		release_pages_balloon(vb, &pages);
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+	mutex_unlock(&vb->balloon_lock);
+
+	return n;
+}
+
 static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
@@ -380,7 +562,10 @@ static int virtballoon_oom_notify(struct notifier_block
*self,
 		return NOTIFY_OK;
 
 	freed = parm;
-	num_freed_pages = leak_balloon(vb, oom_pages);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG))
+		num_freed_pages = leak_balloon_sg_oom(vb);
+	else
+		num_freed_pages = leak_balloon(vb, oom_pages);
 	update_balloon_size(vb);
 	*freed += num_freed_pages;
 
@@ -477,6 +662,7 @@ static int virtballoon_migratepage(struct balloon_dev_info
*vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -498,16 +684,24 @@ static int virtballoon_migratepage(struct balloon_dev_info
*vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		add_one_sg(vb->inflate_vq, page_to_pfn(newpage), PAGE_SIZE);
+		kick_and_wait(vb->inflate_vq, vb->acked);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		add_one_sg(vb->deflate_vq, page_to_pfn(page), PAGE_SIZE);
+		kick_and_wait(vb->deflate_vq, vb->acked);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -566,6 +760,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -682,6 +879,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

Wei Wang

2017-Dec-19 12:17 UTC

head link

[PATCH v20 5/7] mm: support reporting free page blocks

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang at intel.com>
Signed-off-by: Liang Li <liang.z.li at intel.com>
Cc: Michal Hocko <mhocko at kernel.org>
Cc: Michael S. Tsirkin <mst at redhat.com>
Acked-by: Michal Hocko <mhocko at kernel.org>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ea818ff..b3077dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1938,6 +1938,12 @@ extern void free_area_init_node(int nid, unsigned long *
zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e5e775..f074503 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4899,6 +4899,97 @@ void show_free_areas(unsigned int filter, nodemask_t
*nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

Wei Wang

2017-Dec-19 12:17 UTC

head link

[PATCH v20 6/7] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free pages by sending a new cmd
id to the guest via the free_page_report_cmd_id configuration register.

When the guest starts to report, the first element added to the free page
vq is the cmd id given by host. When the guest finishes the reporting
of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
to the vq to tell host that the reporting is done. Host may also requests
the guest to stop the reporting in advance by sending the stop cmd id to
the guest via the configuration register.

Signed-off-by: Wei Wang <wei.w.wang at intel.com>
Signed-off-by: Liang Li <liang.z.li at intel.com>
Cc: Michael S. Tsirkin <mst at redhat.com>
Cc: Michal Hocko <mhocko at kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 202 ++++++++++++++++++++++++++++++------
 include/uapi/linux/virtio_balloon.h |   4 +
 2 files changed, 174 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fff0a3f..eae65c1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The free page reporting work item submitted to the balloon wq */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +70,13 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Start to report free pages */
+	bool report_free_page;
+	/* Stores the cmd id given by host to start the free page reporting */
+	uint32_t start_cmd_id;
+	/* Stores STOP_ID as a sign to tell host that the reporting is done */
+	uint32_t stop_cmd_id;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -189,6 +201,28 @@ static void batch_balloon_page_sg(struct virtio_balloon
*vb,
 		kick_and_wait(vq, vb->acked);
 }
 
+static void batch_free_page_sg(struct virtqueue *vq,
+			       unsigned long pfn,
+			       uint32_t len)
+{
+	add_one_sg(vq, pfn, len);
+
+	/* Batch till the vq is full */
+	if (!vq->num_free)
+		virtqueue_kick(vq);
+}
+
+static void send_cmd_id(struct virtio_balloon *vb, void *addr)
+{
+	struct scatterlist sg;
+	int err;
+
+	sg_init_one(&sg, addr, sizeof(uint32_t));
+	err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb, GFP_KERNEL);
+	BUG_ON(err);
+	virtqueue_kick(vb->free_page_vq);
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -498,17 +532,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-	struct virtio_balloon *vb = vdev->priv;
-	unsigned long flags;
-
-	spin_lock_irqsave(&vb->stop_update_lock, flags);
-	if (!vb->stop_update)
-		queue_work(system_freezable_wq, &vb->update_balloon_size_work);
-	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
 	s64 target;
@@ -525,6 +548,36 @@ static inline s64 towards_target(struct virtio_balloon *vb)
 	return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+	unsigned long flags;
+	__u32 cmd_id;
+	s64 diff = towards_target(vb);
+
+	if (diff) {
+		spin_lock_irqsave(&vb->stop_update_lock, flags);
+		if (!vb->stop_update)
+			queue_work(system_freezable_wq,
+				   &vb->update_balloon_size_work);
+		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+	}
+
+	virtio_cread(vb->vdev, struct virtio_balloon_config,
+		     free_page_report_cmd_id, &cmd_id);
+	if (cmd_id == VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) {
+		WRITE_ONCE(vb->report_free_page, false);
+	} else if (cmd_id != vb->start_cmd_id) {
+		/*
+		 * Host requests to start the reporting by sending a new cmd
+		 * id.
+		 */
+		WRITE_ONCE(vb->report_free_page, true);
+		vb->start_cmd_id = cmd_id;
+		queue_work(vb->balloon_wq, &vb->report_free_page_work);
+	}
+}
+
 static void update_balloon_size(struct virtio_balloon *vb)
 {
 	u32 actual = vb->num_pages;
@@ -602,40 +655,116 @@ static void update_balloon_size_func(struct work_struct
*work)
 
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate",
"deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		vb->free_page_vq = vqs[i];
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
+}
+
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (!READ_ONCE(vb->report_free_page))
+		return false;
+
+	batch_free_page_sg(vb->free_page_vq, pfn, len);
+
+	return true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	/* Start by sending the obtained cmd id to the host with an outbuf */
+	send_cmd_id(vb, &vb->start_cmd_id);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	/*
+	 * End by sending the stop id to the host with an outbuf. Use the
+	 * non-batching mode here to trigger a kick after adding the stop id.
+	 */
+	send_cmd_id(vb, &vb->stop_cmd_id);
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -763,6 +892,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->stop_cmd_id = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -827,6 +963,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -880,6 +1017,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h
b/include/uapi/linux/virtio_balloon.h
index 37780a7..58f1274 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,15 +35,19 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* VQ to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+#define VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID		0
 struct virtio_balloon_config {
 	/* Number of pages host wants Guest to give up. */
 	__u32 num_pages;
 	/* Number of pages we've actually got in balloon. */
 	__u32 actual;
+	/* Free page report command id, readonly by guest */
+	__u32 free_page_report_cmd_id;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

Wei Wang

2017-Dec-19 12:17 UTC

head link

[PATCH v20 7/7] virtio-balloon: don't report free pages when page poisoning is enabled

The guest free pages should not be discarded by the live migration thread
when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
skipping the transfer of such poisoned free pages will trigger false
positive when new pages are allocated and checked on the destination.
This patch adds a config field, poison_val. Guest writes to the config
field to tell the host about the poisoning value. The value will be 0 in
the following cases:
1) PAGE_POISONING_NO_SANITY is enabled;
2) page poisoning is disabled; or
3) PAGE_POISONING_ZERO is enabled.

Signed-off-by: Wei Wang <wei.w.wang at intel.com>
Suggested-by: Michael S. Tsirkin <mst at redhat.com>
Cc: Michal Hocko <mhocko at suse.com>
---
 drivers/virtio/virtio_balloon.c     | 8 ++++++++
 include/uapi/linux/virtio_balloon.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index eae65c1..1fa8598 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -860,6 +860,7 @@ static struct file_system_type balloon_fs = {
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
+	__u32 poison_val;
 	int err;
 
 	if (!vdev->config->get) {
@@ -897,6 +898,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
 		INIT_WORK(&vb->report_free_page_work, report_free_page);
 		vb->stop_cmd_id = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+		if (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
+		    !page_poisoning_enabled())
+			poison_val = 0;
+		else
+			poison_val = PAGE_POISON;
+		virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+			      poison_val, &poison_val);
 	}
 
 	vb->nb.notifier_call = virtballoon_oom_notify;
diff --git a/include/uapi/linux/virtio_balloon.h
b/include/uapi/linux/virtio_balloon.h
index 58f1274..f270e9e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -48,6 +48,8 @@ struct virtio_balloon_config {
 	__u32 actual;
 	/* Free page report command id, readonly by guest */
 	__u32 free_page_report_cmd_id;
+	/* Stores PAGE_POISON if page poisoning with sanity check is in use */
+	__u32 poison_val;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

Philippe Ombredanne

2017-Dec-19 15:58 UTC

head link

[PATCH v20 1/7] xbitmap: Introduce xbitmap

Matthew,

On Tue, Dec 19, 2017 at 1:17 PM, Wei Wang <wei.w.wang at intel.com>
wrote:> From: Matthew Wilcox <mawilcox at microsoft.com>
>
> The eXtensible Bitmap is a sparse bitmap representation which is
> efficient for set bits which tend to cluster.  It supports up to
> 'unsigned long' worth of bits, and this commit adds the bare bones
--
> xb_set_bit(), xb_clear_bit() and xb_test_bit().
<snip>
> --- /dev/null
> +++ b/include/linux/xbitmap.h
> @@ -0,0 +1,49 @@
> +/*
> + * eXtensible Bitmaps
> + * Copyright (c) 2017 Microsoft Corporation <mawilcox at
microsoft.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 of the
> + * License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
> + * All bits are initially zero.
> + */
Would you mind using the new SPDX tags documented in Thomas patch set
[1] rather than this fine but longer legalese?

And if you could spread the word to others in your team this would be very nice.

Thank you!

[1] https://lkml.org/lkml/2017/12/4/934

-- 
Cordially
Philippe Ombredanne

Matthew Wilcox

2017-Dec-24 03:21 UTC

head link

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

On Tue, Dec 19, 2017 at 08:17:56PM +0800, Wei Wang
wrote:> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in
the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of
PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which
correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap
that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
I'm not crazy about the naming here.  I'd use pfn_min and pfn_max like
you use in the caller.
> +{
> +	unsigned long pfn_start, pfn_end;
> +	uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
> +	uint64_t len;
> +
> +	pfn_start = page_xb_start;
And I think pfn_start is actually just 'pfn'.

'pfn_end' is perhaps just 'end'.  Or 'gap'.
> +	while (pfn_start < page_xb_end) {
> +		pfn_start = xb_find_set(&vb->page_xb, page_xb_end + 1,
> +					pfn_start);
> +		if (pfn_start == page_xb_end + 1)
> +			break;
> +		pfn_end = xb_find_zero(&vb->page_xb, page_xb_end + 1,
> +				       pfn_start);
> +		len = (pfn_end - pfn_start) << PAGE_SHIFT;
> +static inline int xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
I really don't like it that you're naming things after the 'xb'.
Things should be named by something that makes sense to the user, not
after the particular implementation.  If you changed the underlying
representation from an xbitmap to, say, a BTree, you wouldn't want to
rename this function to 'btree_set_page'.  Maybe this function is really
"vb_set_page".  Or "record_page".  Or something.  Someone
who understands
this driver better than I do can probably weigh in with a better name.
> +	unsigned long pfn = page_to_pfn(page);
> +	int ret;
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +
> +	do {
> +		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
> +			return -ENOMEM;
> +
> +		ret = xb_set_bit(&vb->page_xb, pfn);
> +		xb_preload_end();
> +	} while (unlikely(ret == -EAGAIN));
OK, so you don't need a spinlock because you're under a mutex?  But you
can't allocate memory because you're in the balloon driver, and so a
GFP_KERNEL allocation might recurse into your driver?  Would GFP_NOIO
do the job?  I'm a little hazy on exactly how the balloon driver works.

If you can't preload with anything better than that, I think that
xb_set_bit() should attempt an allocation with GFP_NOWAIT | __GFP_NOWARN,
and then you can skip the preload; it has no value for you.
> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct virtio_balloon
*vb, size_t num)
>  
>  	while ((page = balloon_page_pop(&pages))) {
>  		balloon_page_enqueue(&vb->vb_dev_info, page);
> +		if (use_sg) {
> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> +				__free_page(page);
> +				continue;
> +			}
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
Is this the right behaviour?  If we can't record the page in the xb,
wouldn't we rather send it across as a single page?

Wei Wang

2017-Dec-24 07:42 UTC

head link

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

On 12/24/2017 12:45 PM, Tetsuo Handa wrote:> Matthew Wilcox wrote:
>>> +	unsigned long pfn = page_to_pfn(page);
>>> +	int ret;
>>> +
>>> +	*pfn_min = min(pfn, *pfn_min);
>>> +	*pfn_max = max(pfn, *pfn_max);
>>> +
>>> +	do {
>>> +		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
>>> +			return -ENOMEM;
>>> +
>>> +		ret = xb_set_bit(&vb->page_xb, pfn);
>>> +		xb_preload_end();
>>> +	} while (unlikely(ret == -EAGAIN));
>> OK, so you don't need a spinlock because you're under a mutex? 
But you
>> can't allocate memory because you're in the balloon driver, and
so a
>> GFP_KERNEL allocation might recurse into your driver?
> Right. We can't (directly or indirectly) depend on __GFP_DIRECT_RECLAIM
&& !__GFP_NORETRY
> allocations because the balloon driver needs to handle OOM notifier
callback.
>
>>                                                         Would GFP_NOIO
>> do the job?  I'm a little hazy on exactly how the balloon driver
works.
> GFP_NOIO implies __GFP_DIRECT_RECLAIM. In the worst case, it can lockup due
to
> the too small to fail memory allocation rule. GFP_NOIO | __GFP_NORETRY
would work
> if there is really a guarantee that GFP_NOIO | __GFP_NORETRY never depend
on
> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations, which is too
subtle for me to
> validate. The direct reclaim dependency is too complicated to validate.
> I consider that !__GFP_DIRECT_RECLAIM is the future-safe choice.
What's the problem with (or how is it better than) the "GFP_NOWAIT | 
__GFP_NOWARN" we are using here?

>> If you can't preload with anything better than that, I think that
>> xb_set_bit() should attempt an allocation with GFP_NOWAIT |
__GFP_NOWARN,
>> and then you can skip the preload; it has no value for you.
> Yes, that's why I suggest directly using kzalloc() at xb_set_bit().
It has some possibilities to remove that preload if we also do the 
bitmap allocation in the xb_set_bit():

bitmap = rcu_dereference_raw(*slot);
if (!bitmap) {
     bitmap = this_cpu_xchg(ida_bitmap, NULL);
     if (!bitmap) {
         bitmap = kmalloc(sizeof(*bitmap), gfp);
         if (!bitmap)
             return -ENOMEM;
     }
}

But why not just follow the radix tree implementation style that puts 
the allocation in preload, which would be invoked with a more relaxed 
gfp in other use cases?
Its usage in virtio_balloon is just a little special that we need to put 
the allocation within the balloon_lock, which doesn't give us the 
benefit of using a relaxed gfp in preload, but it doesn't prevent us 
from living with the current APIs (i.e. the preload + xb_set pattern).
On the other side, if we do it as above, we have more things that need 
to consider. For example, what if the a use case just want the radix 
tree implementation style, which means it doesn't want allocation within 
xb_set(), then would we be troubled with how to avoid the allocation 
path in that case?

So, I think it is better to stick with the convention by putting the 
allocation in preload. Breaking the convention should show obvious 
advantages, IMHO.

>
>>> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct
virtio_balloon *vb, size_t num)
>>>   
>>>   	while ((page = balloon_page_pop(&pages))) {
>>>   		balloon_page_enqueue(&vb->vb_dev_info, page);
>>> +		if (use_sg) {
>>> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
>>> +				__free_page(page);
>>> +				continue;
>>> +			}
>>> +		} else {
>>> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>>> +		}
>> Is this the right behaviour?
> I don't think so. In the worst case, we can set no bit using
xb_set_page().
>
>>                                If we can't record the page in the
xb,
>> wouldn't we rather send it across as a single page?
>>
> I think that we need to be able to fallback to !use_sg path when OOM.
I also have different thoughts:

1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with 
fill_balloon), which does not use xbitmap to record pages, thus no 
memory allocation.

2) If the memory is already under pressure, it is pointless to continue 
inflating memory to the host. We need to give thanks to the memory 
allocation failure reported by xbitmap, which gets us a chance to 
release the inflated pages that have been demonstrated to cause the 
memory pressure of the guest.

Best,
Wei

Wei Wang

2017-Dec-24 08:16 UTC

head link

[virtio-dev] Re: [PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

On 12/24/2017 03:42 PM, Wei Wang wrote:> On 12/24/2017 12:45 PM, Tetsuo Handa wrote:
>> Matthew Wilcox wrote:
>>>> +    unsigned long pfn = page_to_pfn(page);
>>>> +    int ret;
>>>> +
>>>> +    *pfn_min = min(pfn, *pfn_min);
>>>> +    *pfn_max = max(pfn, *pfn_max);
>>>> +
>>>> +    do {
>>>> +        if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
>>>> +            return -ENOMEM;
>>>> +
>>>> +        ret = xb_set_bit(&vb->page_xb, pfn);
>>>> +        xb_preload_end();
>>>> +    } while (unlikely(ret == -EAGAIN));
>>> OK, so you don't need a spinlock because you're under a
mutex?  But you
>>> can't allocate memory because you're in the balloon driver,
and so a
>>> GFP_KERNEL allocation might recurse into your driver?
>> Right. We can't (directly or indirectly) depend on 
>> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
>> allocations because the balloon driver needs to handle OOM notifier 
>> callback.
>>
>>> Would GFP_NOIO
>>> do the job?  I'm a little hazy on exactly how the balloon
driver works.
>> GFP_NOIO implies __GFP_DIRECT_RECLAIM. In the worst case, it can 
>> lockup due to
>> the too small to fail memory allocation rule. GFP_NOIO | 
>> __GFP_NORETRY would work
>> if there is really a guarantee that GFP_NOIO | __GFP_NORETRY never 
>> depend on
>> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations, which is
too
>> subtle for me to
>> validate. The direct reclaim dependency is too complicated to validate.
>> I consider that !__GFP_DIRECT_RECLAIM is the future-safe choice.
>
> What's the problem with (or how is it better than) the "GFP_NOWAIT
|
> __GFP_NOWARN" we are using here?
>
>
>>> If you can't preload with anything better than that, I think
that
>>> xb_set_bit() should attempt an allocation with GFP_NOWAIT | 
>>> __GFP_NOWARN,
>>> and then you can skip the preload; it has no value for you.
>> Yes, that's why I suggest directly using kzalloc() at xb_set_bit().
>
> It has some possibilities to remove that preload if we also do the 
> bitmap allocation in the xb_set_bit():
>
> bitmap = rcu_dereference_raw(*slot);
> if (!bitmap) {
>     bitmap = this_cpu_xchg(ida_bitmap, NULL);
>     if (!bitmap) {
>         bitmap = kmalloc(sizeof(*bitmap), gfp);
>         if (!bitmap)
>             return -ENOMEM;
>     }
> }
>
> But why not just follow the radix tree implementation style that puts 
> the allocation in preload, which would be invoked with a more relaxed 
> gfp in other use cases?
> Its usage in virtio_balloon is just a little special that we need to 
> put the allocation within the balloon_lock, which doesn't give us the 
> benefit of using a relaxed gfp in preload, but it doesn't prevent us 
> from living with the current APIs (i.e. the preload + xb_set pattern).
> On the other side, if we do it as above, we have more things that need 
> to consider. For example, what if the a use case just want the radix 
> tree implementation style, which means it doesn't want allocation 
> within xb_set(), then would we be troubled with how to avoid the 
> allocation path in that case?
>
> So, I think it is better to stick with the convention by putting the 
> allocation in preload. Breaking the convention should show obvious 
> advantages, IMHO.
>
>
>>
>>>> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct 
>>>> virtio_balloon *vb, size_t num)
>>>>         while ((page = balloon_page_pop(&pages))) {
>>>>           balloon_page_enqueue(&vb->vb_dev_info, page);
>>>> +        if (use_sg) {
>>>> +            if (xb_set_page(vb, page, &pfn_min,
&pfn_max) < 0) {
>>>> +                __free_page(page);
>>>> +                continue;
>>>> +            }
>>>> +        } else {
>>>> +            set_page_pfns(vb, vb->pfns + vb->num_pfns,
page);
>>>> +        }
>>> Is this the right behaviour?
>> I don't think so. In the worst case, we can set no bit using 
>> xb_set_page().
>
>>
>>>                                If we can't record the page in
the xb,
>>> wouldn't we rather send it across as a single page?
>>>
>> I think that we need to be able to fallback to !use_sg path when OOM.
>
> I also have different thoughts:
>
> 1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with 
> fill_balloon), which does not use xbitmap to record pages, thus no 
> memory allocation.
>
> 2) If the memory is already under pressure, it is pointless to 
> continue inflating memory to the host. We need to give thanks to the 
> memory allocation failure reported by xbitmap, which gets us a chance 
> to release the inflated pages that have been demonstrated to cause the 
> memory pressure of the guest.
>
Forgot to add my conclusion: I think the above behavior is correct.

Best,
Wei

Wei Wang

2017-Dec-26 03:06 UTC

head link

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

On 12/25/2017 10:51 PM, Tetsuo Handa wrote:> Wei Wang wrote:
>>>>>> @@ -173,8 +292,15 @@ static unsigned
fill_balloon(struct
>>>>>> virtio_balloon *vb, size_t num)
>>>>>>          while ((page = balloon_page_pop(&pages)))
{
>>>>>>           
balloon_page_enqueue(&vb->vb_dev_info, page);
>>>>>> +        if (use_sg) {
>>>>>> +            if (xb_set_page(vb, page, &pfn_min,
&pfn_max) < 0) {
>>>>>> +                __free_page(page);
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +        } else {
>>>>>> +            set_page_pfns(vb, vb->pfns +
vb->num_pfns, page);
>>>>>> +        }
>>>>> Is this the right behaviour?
>>>> I don't think so. In the worst case, we can set no bit
using
>>>> xb_set_page().
>>>>>                                 If we can't record the
page in the xb,
>>>>> wouldn't we rather send it across as a single page?
>>>>>
>>>> I think that we need to be able to fallback to !use_sg path
when OOM.
>>> I also have different thoughts:
>>>
>>> 1) For OOM, we have leak_balloon_sg_oom (oom has nothing to do with
>>> fill_balloon), which does not use xbitmap to record pages, thus no
>>> memory allocation.
>>>
>>> 2) If the memory is already under pressure, it is pointless to
>>> continue inflating memory to the host. We need to give thanks to
the
>>> memory allocation failure reported by xbitmap, which gets us a
chance
>>> to release the inflated pages that have been demonstrated to cause
the
>>> memory pressure of the guest.
>>>
>> Forgot to add my conclusion: I think the above behavior is correct.
>>
> What is the desired behavior when hitting OOM path during inflate/deflate?
> Once inflation started, the inflation logic is called again and again
> until the balloon inflates to the requested size.
The above is true, but I can't agree with the following. Please see below.
> Such situation will
> continue wasting CPU resource between inflate-due-to-host's-request
versus
> deflate-due-to-guest's-OOM. It is pointless but cannot stop doing
pointless
> thing.
What we are doing here is to free the pages that were just allocated in 
this round of inflating. Next round will be sometime later when the 
balloon work item gets its turn to run. Yes, it will then continue to 
inflate.
Here are the two cases that will happen then:
1) the guest is still under memory pressure, the inflate will fail at 
memory allocation, which results in a msleep(200), and then it exists 
for another time to run.
2) the guest isn't under memory pressure any more (e.g. the task which 
consumes the huge amount of memory is gone), it will continue to inflate 
as normal till the requested size.

I think what we are doing is a quite sensible behavior, except a small 
change I plan to make:

         while ((page = balloon_page_pop(&pages))) {
-               balloon_page_enqueue(&vb->vb_dev_info, page);
                 if (use_sg) {
                         if (xb_set_page(vb, page, &pfn_min, &pfn_max)
<
0) {
                                 __free_page(page);
                                 continue;
                         }
                 } else {
                         set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
                 }
+             balloon_page_enqueue(&vb->vb_dev_info, page);
>
> Also, as of Linux 4.15, only up to VIRTIO_BALLOON_ARRAY_PFNS_MAX pages
(i.e.
> 1MB) are invisible from deflate request. That amount would be an acceptable
> error. But your patch makes more pages being invisible, for pages allocated
> by balloon_page_alloc() without holding balloon_lock are stored into a
local
> variable "LIST_HEAD(pages)" (which means that
balloon_page_dequeue() with
> balloon_lock held won't be able to find pages not yet queued by
> balloon_page_enqueue()), doesn't it? What if all memory pages were held
in
> "LIST_HEAD(pages)" and balloon_page_dequeue() was called before
> balloon_page_enqueue() is called?
>
If we think of the balloon driver just as a regular driver or 
application, that will be a pretty nature thing. A regular driver can 
eat a huge amount of memory for its own usages, would this amount of 
memory be treated as an error as they are invisible to the 
balloon_page_enqueue?

Best,
Wei

Wei Wang

2017-Dec-26 11:36 UTC

head link

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

On 12/26/2017 06:38 PM, Tetsuo Handa wrote:> Wei Wang wrote:
>> On 12/25/2017 10:51 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>
>> What we are doing here is to free the pages that were just allocated in
>> this round of inflating. Next round will be sometime later when the
>> balloon work item gets its turn to run. Yes, it will then continue to
>> inflate.
>> Here are the two cases that will happen then:
>> 1) the guest is still under memory pressure, the inflate will fail at
>> memory allocation, which results in a msleep(200), and then it exists
>> for another time to run.
>> 2) the guest isn't under memory pressure any more (e.g. the task
which
>> consumes the huge amount of memory is gone), it will continue to
inflate
>> as normal till the requested size.
>>
> How likely does 2) occur? It is not so likely. msleep(200) is enough to
spam
> the guest with puff messages. Next round is starting too quickly.
I meant one of the two cases, 1) or 2), would happen, rather than 2) 
happens after 1).

If 2) doesn't happen, then 1) happens. It will continue to try to 
inflate round by round. But the memory allocation won't succeed, so 
there will be no pages to inflate to the host. That is, the inflating is 
simply a code path to the msleep(200) as long as the guest is under 
memory pressure.

Back to our code change, it doesn't result in incorrect behavior as 
explained above.
>> I think what we are doing is a quite sensible behavior, except a small
>> change I plan to make:
>>
>>           while ((page = balloon_page_pop(&pages))) {
>> -               balloon_page_enqueue(&vb->vb_dev_info, page);
>>                   if (use_sg) {
>>                           if (xb_set_page(vb, page, &pfn_min,
&pfn_max) <
>> 0) {
>>                                   __free_page(page);
>>                                   continue;
>>                           }
>>                   } else {
>>                           set_page_pfns(vb, vb->pfns +
vb->num_pfns, page);
>>                   }
>> +             balloon_page_enqueue(&vb->vb_dev_info, page);
>>
>>> Also, as of Linux 4.15, only up to VIRTIO_BALLOON_ARRAY_PFNS_MAX
pages (i.e.
>>> 1MB) are invisible from deflate request. That amount would be an
acceptable
>>> error. But your patch makes more pages being invisible, for pages
allocated
>>> by balloon_page_alloc() without holding balloon_lock are stored
into a local
>>> variable "LIST_HEAD(pages)" (which means that
balloon_page_dequeue() with
>>> balloon_lock held won't be able to find pages not yet queued by
>>> balloon_page_enqueue()), doesn't it? What if all memory pages
were held in
>>> "LIST_HEAD(pages)" and balloon_page_dequeue() was called
before
>>> balloon_page_enqueue() is called?
>>>
>> If we think of the balloon driver just as a regular driver or
>> application, that will be a pretty nature thing. A regular driver can
>> eat a huge amount of memory for its own usages, would this amount of
>> memory be treated as an error as they are invisible to the
>> balloon_page_enqueue?
>>
> No. Memory used by applications which consumed a lot of memory in their
> mm_struct is reclaimed by the OOM killer/reaper. Drivers try to avoid
> allocating more memory than they need. If drivers allocate more memory
> than they need, they have a hook for releasing unused memory (i.e.
> register_shrinker() or OOM notifier). What I'm saying here is that
> the hook for releasing unused memory does not work unless memory held in
> LIST_HEAD(pages) becomes visible to balloon_page_dequeue().
>
> If a system has 128GB of memory, and 127GB of memory was stored into
> LIST_HEAD(pages) upon first fill_balloon() request, and somebody held
> balloon_lock from OOM notifier path from out_of_memory() before
> fill_balloon() holds balloon_lock, leak_balloon_sg_oom() finds that
> no memory can be freed because balloon_page_enqueue() was never called,
> and allows the caller of out_of_memory() to invoke the OOM killer despite
> there is 127GB of memory which can be freed if fill_balloon() was able
> to hold balloon_lock before leak_balloon_sg_oom() holds balloon_lock.
> I don't think that that amount is an acceptable error.
I understand you are worried that OOM couldn't get balloon pages while 
there are some in the local list. This is a debatable issue, and it may 
lead to a long discussion. If this is considered to be a big issue, we 
can make the local list to be global in vb, and accessed by oom 
notifier, this won't affect this patch, and can be achieved with an 
add-on patch. How about leaving this discussion as a second step outside 
this series? Balloon has something more that can be improved, and this 
patch series is already big.

Best,
Wei

Matthew Wilcox

2018-Jan-02 13:24 UTC

head link

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

On Sun, Dec 24, 2017 at 03:42:02PM +0800, Wei Wang
wrote:> On 12/24/2017 12:45 PM, Tetsuo Handa wrote:
> > Matthew Wilcox wrote:
> > > If you can't preload with anything better than that, I think
that
> > > xb_set_bit() should attempt an allocation with GFP_NOWAIT |
__GFP_NOWARN,
> > > and then you can skip the preload; it has no value for you.
> > Yes, that's why I suggest directly using kzalloc() at
xb_set_bit().
> 
> It has some possibilities to remove that preload if we also do the bitmap
> allocation in the xb_set_bit():
> 
> bitmap = rcu_dereference_raw(*slot);
> if (!bitmap) {
>     bitmap = this_cpu_xchg(ida_bitmap, NULL);
>     if (!bitmap) {
>         bitmap = kmalloc(sizeof(*bitmap), gfp);
>         if (!bitmap)
>             return -ENOMEM;
>     }
> }
> 
> But why not just follow the radix tree implementation style that puts the
> allocation in preload, which would be invoked with a more relaxed gfp in
> other use cases?
Actually, the radix tree does attempt allocation, and falls back to the
preload.  The IDA was the odd one out that doesn't attempt allocation,
and that was where I copied the code from.  For other users, the preload
API is beneficial, so I will leave it in.
> Its usage in virtio_balloon is just a little special that we need to put
the
> allocation within the balloon_lock, which doesn't give us the benefit
of
> using a relaxed gfp in preload, but it doesn't prevent us from living
with
> the current APIs (i.e. the preload + xb_set pattern).
> On the other side, if we do it as above, we have more things that need to
> consider. For example, what if the a use case just want the radix tree
> implementation style, which means it doesn't want allocation within
> xb_set(), then would we be troubled with how to avoid the allocation path
in
> that case?
> 
> So, I think it is better to stick with the convention by putting the
> allocation in preload. Breaking the convention should show obvious
> advantages, IMHO.
The radix tree convention is objectively awful, which is why I'm working
to change it.  Specifying the GFP flags at radix tree initialisation time
rather than allocation time leads to all kinds of confusion.  The preload
API is a pretty awful workaround, and it will go away once the XArray
is working correctly.  That said, there's no alternative to it without
making XBitmap depend on XArray, and I don't want to hold you up there.
So there's an xb_preload for the moment.

Wei Wang

2018-Jan-03 09:00 UTC

head link

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

On 01/03/2018 10:29 AM, Tetsuo Handa wrote:> Matthew Wilcox wrote:
>> The radix tree convention is objectively awful, which is why I'm
working
>> to change it.  Specifying the GFP flags at radix tree initialisation
time
>> rather than allocation time leads to all kinds of confusion.  The
preload
>> API is a pretty awful workaround, and it will go away once the XArray
>> is working correctly.  That said, there's no alternative to it
without
>> making XBitmap depend on XArray, and I don't want to hold you up
there.
>> So there's an xb_preload for the moment.
> I'm ready to propose cvbmp shown below as an alternative to xbitmap
(but
> specialized for virtio-balloon case). Wei, can you do some benchmarking
> between xbitmap and cvbmp?
> ----------------------------------------
> cvbmp: clustered values bitmap
I don't think we need to replace xbitmap, at least at this stage. The 
new implementation doesn't look simpler at all, and virtio-balloon has 
worked well with xbitmap.

I would suggest you to send out the new implementation for discussion 
after this series ends, and justify with better performance results if 
you could get.

Best,
Wei

Apparently Analagous Threads

Search for more maybe matching threads

Linux Virtualization - Dec 2017 - [PATCH v20 0/7] Virtio-balloon Enhancement

[PATCH v20 0/7] Virtio-balloon Enhancement

[PATCH v20 1/7] xbitmap: Introduce xbitmap

[PATCH v20 2/7] xbitmap: potential improvement

[PATCH v20 3/7] xbitmap: add more operations

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

[PATCH v20 5/7] mm: support reporting free page blocks

[PATCH v20 6/7] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

[PATCH v20 7/7] virtio-balloon: don't report free pages when page poisoning is enabled

[PATCH v20 1/7] xbitmap: Introduce xbitmap

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

[virtio-dev] Re: [PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

[PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

Apparently Analagous Threads