thr3ads.net - Xen devel - [PATCH] Persistent grant maps for xen blk drivers [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Oliver Chick

2012-Sep-19 10:51 UTC

[PATCH] Persistent grant maps for xen blk drivers

This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap''s is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup, blk{front,back} use xenbus to communicate their ability to
perform ''feature-persistent''. Iff both ends have this ability,
persistent mode is used.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn''t been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy''s from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback has to store a mapping of grefs=>{page mapped to by gref} in
an array. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a linear search
through this array to find the page, for every gref we receive. The
overhead of this is low, however future work might want to use a more
efficient data structure to reduce this O(n) operation.

We (ijc, and myself) have introduced a new constant,
BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
from attempting a DoS, by supplying fresh grefs, causing the Dom0
kernel from to map excessively. This is currently set to 256---the
maximum number of entires in the ring. As the number of inflight
requests <= size of ring, 256 is also the maximum sensible size. This
introduces a maximum overhead of 11MB of mapped memory, per block
device. In practice, we don''t typically use about 60 of these. If the
guest exceeds the 256 limit, it is either buggy or malicious. We treat
this in one of two ways:
1) If we have mapped <
BLKIF_MAX_SEGMENTS_PER_REQUEST * BLKIF_MAX_PERS_REQUESTS_PER_DEV
pages, we will persistently map the grefs. This can occur is previous
requests have not used all BLKIF_MAX_SEGMENTS_PER_REQUEST segments.
2) Otherwise, we revert to non-persistent grants for all future grefs.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no''. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.


Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
---
 drivers/block/xen-blkback/blkback.c |  149 +++++++++++++++++++++++++---
 drivers/block/xen-blkback/common.h  |   16 +++
 drivers/block/xen-blkback/xenbus.c  |   21 +++-
 drivers/block/xen-blkfront.c        |  186 ++++++++++++++++++++++++++++++-----
 include/xen/interface/io/blkif.h    |    9 ++
 5 files changed, 338 insertions(+), 43 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
index 73f196c..f95dee9 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -78,6 +78,7 @@ struct pending_req {
 	unsigned short		operation;
 	int			status;
 	struct list_head	free_list;
+	u8			is_pers;
 };
 
 #define BLKBACK_INVALID_HANDLE (~0)
@@ -128,6 +129,24 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 static void make_response(struct xen_blkif *blkif, u64 id,
 			  unsigned short op, int st);
 
+static void add_pers_gnt(struct pers_gnt *pers_gnt, struct xen_blkif *blkif)
+{
+	BUG_ON(blkif->pers_gnt_c >= BLKIF_MAX_PERS_REQUESTS_PER_DEV *
+	       BLKIF_MAX_SEGMENTS_PER_REQUEST);
+	blkif->pers_gnts[blkif->pers_gnt_c++] = pers_gnt;
+}
+
+static struct pers_gnt *get_pers_gnt(struct xen_blkif *blkif,
+				     grant_ref_t gref)
+{
+	int i;
+
+	for (i = 0; i < blkif->pers_gnt_c; i++)
+		if (gref == blkif->pers_gnts[i]->gnt)
+			return blkif->pers_gnts[i];
+	return NULL;
+}
+
 /*
  * Retrieve from the ''pending_reqs'' a free pending_req
structure to be used.
  */
@@ -274,6 +293,12 @@ int xen_blkif_schedule(void *arg)
 {
 	struct xen_blkif *blkif = arg;
 	struct xen_vbd *vbd = &blkif->vbd;
+	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct pers_gnt *pers_gnt;
+	int i;
+	int ret = 0;
+	int segs_to_unmap;
 
 	xen_blkif_get(blkif);
 
@@ -301,6 +326,29 @@ int xen_blkif_schedule(void *arg)
 			print_stats(blkif);
 	}
 
+	/* Free all persistent grant pages */
+
+	while (segs_to_unmap = min(BLKIF_MAX_SEGMENTS_PER_REQUEST,
+				 blkif->pers_gnt_c)) {
+
+		for (i = 0; i < segs_to_unmap; i++) {
+			pers_gnt = blkif->pers_gnts[blkif->pers_gnt_c - i - 1];
+
+			gnttab_set_unmap_op(&unmap[i],
+					    pfn_to_kaddr(page_to_pfn(
+							     pers_gnt->page)),
+					    GNTMAP_host_map,
+					    pers_gnt->handle);
+
+			pages[i] = pers_gnt->page;
+		}
+
+		ret = gnttab_unmap_refs(unmap, pages, segs_to_unmap, false);
+		BUG_ON(ret);
+
+		blkif->pers_gnt_c -= segs_to_unmap;
+
+	}
+
 	if (log_stats)
 		print_stats(blkif);
 
@@ -343,13 +391,28 @@ static void xen_blkbk_unmap(struct pending_req *req)
 
 static int xen_blkbk_map(struct blkif_request *req,
 			 struct pending_req *pending_req,
-			 struct seg_buf seg[])
+			 struct seg_buf seg[],
+			 struct page *pages[])
 {
 	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct pers_gnt *new_pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct pers_gnt *pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct pers_gnt *pers_gnt;
+	phys_addr_t addr;
 	int i;
+	int new_map;
 	int nseg = req->u.rw.nr_segments;
+	int segs_to_init = 0;
 	int ret = 0;
+	int use_pers_gnts;
 
+	use_pers_gnts = (pending_req->blkif->can_grant_persist &&
+			 pending_req->blkif->pers_gnt_c <
+			 BLKIF_MAX_SEGMENTS_PER_REQUEST *
+			 BLKIF_MAX_PERS_REQUESTS_PER_DEV);
+
+	pending_req->is_pers = use_pers_gnts;
 	/*
 	 * Fill out preq.nr_sects with proper amount of sectors, and setup
 	 * assign map[..] with the PFN of the page in our domain with the
@@ -358,36 +421,87 @@ static int xen_blkbk_map(struct blkif_request *req,
 	for (i = 0; i < nseg; i++) {
 		uint32_t flags;
 
+		if (use_pers_gnts) {
+			pers_gnt = get_pers_gnt(pending_req->blkif,
+						req->u.rw.seg[i].gref);
+			if (!pers_gnt) {
+				new_map = 1;
+				pers_gnt = kmalloc(sizeof(struct pers_gnt),
+						   GFP_KERNEL);
+				if (!pers_gnt)
+					return -ENOMEM;
+				pers_gnt->page = alloc_page(GFP_KERNEL);
+				if (!pers_gnt->page)
+					return -ENOMEM;
+				pers_gnt->gnt = req->u.rw.seg[i].gref;
+
+				pages_to_gnt[segs_to_init] = pers_gnt->page;
+				new_pers_gnts[segs_to_init] = pers_gnt;
+
+				add_pers_gnt(pers_gnt, pending_req->blkif);
+
+			} else {
+				new_map = 0;
+			}
+			pages[i] = pers_gnt->page;
+			addr = (unsigned long) pfn_to_kaddr(
+				page_to_pfn(pers_gnt->page));
+			pers_gnts[i] = pers_gnt;
+		} else {
+			new_map = 1;
+			pages[i] = blkbk->pending_page(pending_req, i);
+			addr = vaddr(pending_req, i);
+			pages_to_gnt[i] = blkbk->pending_page(pending_req, i);
+		}
+
 		flags = GNTMAP_host_map;
-		if (pending_req->operation != BLKIF_OP_READ)
+		if (!use_pers_gnts && (pending_req->operation != BLKIF_OP_READ))
 			flags |= GNTMAP_readonly;
-		gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
-				  req->u.rw.seg[i].gref,
-				  pending_req->blkif->domid);
+		if (new_map) {
+			gnttab_set_map_op(&map[segs_to_init++], addr,
+					  flags, req->u.rw.seg[i].gref,
+					  pending_req->blkif->domid);
+		}
 	}
 
-	ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0),
nseg);
-	BUG_ON(ret);
+	if (segs_to_init) {
+		ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_init);
+		BUG_ON(ret);
+	}
 
 	/*
 	 * Now swizzle the MFN in our domain with the MFN from the other domain
 	 * so that when we access vaddr(pending_req,i) it has the contents of
 	 * the page from the other domain.
 	 */
-	for (i = 0; i < nseg; i++) {
+	for (i = 0; i < segs_to_init; i++) {
 		if (unlikely(map[i].status != 0)) {
 			pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
 			map[i].handle = BLKBACK_INVALID_HANDLE;
 			ret |= 1;
 		}
 
-		pending_handle(pending_req, i) = map[i].handle;
+		if (use_pers_gnts) {
+			/* store the `out'' values from map */
+		    pending_req->blkif->pers_gnts
+			[pending_req->blkif->pers_gnt_c - segs_to_init +
+			 i]->handle = map[i].handle;
+			new_pers_gnts[i]->dev_bus_addr = map[i].dev_bus_addr;
+		}
 
 		if (ret)
 			continue;
-
-		seg[i].buf  = map[i].dev_bus_addr |
-			(req->u.rw.seg[i].first_sect << 9);
+	}
+	for (i = 0; i < nseg; i++) {
+		if (use_pers_gnts) {
+			pending_handle(pending_req, i) = pers_gnts[i]->handle;
+			seg[i].buf = pers_gnts[i]->dev_bus_addr |
+				(req->u.rw.seg[i].first_sect << 9);
+		} else {
+			pending_handle(pending_req, i) = map[i].handle;
+			seg[i].buf = map[i].dev_bus_addr |
+				(req->u.rw.seg[i].first_sect << 9);
+		}
 	}
 	return ret;
 }
@@ -468,7 +582,8 @@ static void __end_block_io_op(struct pending_req
*pending_req, int error)
 	 * the proper response on the ring.
 	 */
 	if (atomic_dec_and_test(&pending_req->pendcnt)) {
-		xen_blkbk_unmap(pending_req);
+		if (!pending_req->is_pers)
+			xen_blkbk_unmap(pending_req);
 		make_response(pending_req->blkif, pending_req->id,
 			      pending_req->operation, pending_req->status);
 		xen_blkif_put(pending_req->blkif);
@@ -590,6 +705,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	int operation;
 	struct blk_plug plug;
 	bool drain = false;
+	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 
 	switch (req->operation) {
 	case BLKIF_OP_READ:
@@ -676,7 +792,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	 * the hypercall to unmap the grants - that is all done in
 	 * xen_blkbk_unmap.
 	 */
-	if (xen_blkbk_map(req, pending_req, seg))
+	if (xen_blkbk_map(req, pending_req, seg, pages))
 		goto fail_flush;
 
 	/*
@@ -688,7 +804,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	for (i = 0; i < nseg; i++) {
 		while ((bio == NULL) ||
 		       (bio_add_page(bio,
-				     blkbk->pending_page(pending_req, i),
+				     pages[i],
 				     seg[i].nsec << 9,
 				     seg[i].buf & ~PAGE_MASK) == 0)) {
 
@@ -743,7 +859,8 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	return 0;
 
  fail_flush:
-	xen_blkbk_unmap(pending_req);
+	if (!blkif->can_grant_persist)
+		xen_blkbk_unmap(pending_req);
  fail_response:
 	/* Haven''t submitted any bio''s yet. */
 	make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
diff --git a/drivers/block/xen-blkback/common.h
b/drivers/block/xen-blkback/common.h
index 9ad3b5e..1ecb709 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -47,6 +47,8 @@
 #define DPRINTK(fmt, args...)				\
 	pr_debug(DRV_PFX "(%s:%d) " fmt ".\n",		\
 		 __func__, __LINE__, ##args)
+#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
+
 
 
 /* Not a real protocol.  Used to generate ring structs which contain
@@ -164,6 +166,14 @@ struct xen_vbd {
 
 struct backend_info;
 
+
+struct pers_gnt {
+	struct page *page;
+	grant_ref_t gnt;
+	uint32_t handle;
+	uint64_t dev_bus_addr;
+};
+
 struct xen_blkif {
 	/* Unique identifier for this interface. */
 	domid_t			domid;
@@ -190,6 +200,12 @@ struct xen_blkif {
 	struct task_struct	*xenblkd;
 	unsigned int		waiting_reqs;
 
+	/* frontend feature information */
+	u8			can_grant_persist:1;
+	struct pers_gnt		*pers_gnts[BLKIF_MAX_PERS_REQUESTS_PER_DEV *
+					  BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	unsigned int		pers_gnt_c;
+
 	/* statistics */
 	unsigned long		st_print;
 	int			st_rd_req;
diff --git a/drivers/block/xen-blkback/xenbus.c
b/drivers/block/xen-blkback/xenbus.c
index 4f66171..dc58cc4 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -681,6 +681,13 @@ again:
 		goto abort;
 	}
 
+	err = xenbus_printf(xbt, dev->nodename,
"feature-persistent-grants",
+			    "%d", 1);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "writing persistent capability");
+		goto abort;
+	}
+
 	/* FIXME: use a typename instead */
 	err = xenbus_printf(xbt, dev->nodename, "info", "%u",
 			    be->blkif->vbd.type |
@@ -721,6 +728,7 @@ static int connect_ring(struct backend_info *be)
 	struct xenbus_device *dev = be->dev;
 	unsigned long ring_ref;
 	unsigned int evtchn;
+	u8 pers_grants;
 	char protocol[64] = "";
 	int err;
 
@@ -750,8 +758,17 @@ static int connect_ring(struct backend_info *be)
 		xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
 		return -1;
 	}
-	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d
(%s)\n",
-		ring_ref, evtchn, be->blkif->blk_protocol, protocol);
+	err = xenbus_gather(XBT_NIL, dev->otherend,
+			    "feature-persistent-grants", "%d",
+			    &pers_grants, NULL);
+	if (err)
+		pers_grants = 0;
+
+	be->blkif->can_grant_persist = pers_grants;
+
+	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)
persistent %d\n",
+		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
+		pers_grants);
 
 	/* Map the shared frame, irq etc. */
 	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index e4fb337..c1cc5fe 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -68,6 +68,13 @@ struct blk_shadow {
 	struct blkif_request req;
 	struct request *request;
 	unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct gnt_list *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+};
+
+struct gnt_list {
+	grant_ref_t gref;
+	unsigned long frame;
+	struct gnt_list *tail;
 };
 
 static DEFINE_MUTEX(blkfront_mutex);
@@ -97,11 +104,14 @@ struct blkfront_info
 	struct work_struct work;
 	struct gnttab_free_callback callback;
 	struct blk_shadow shadow[BLK_RING_SIZE];
+	struct gnt_list *persistent_grants_front;
+	unsigned int persistent_grants_c;
 	unsigned long shadow_free;
 	unsigned int feature_flush;
 	unsigned int flush_op;
 	unsigned int feature_discard:1;
 	unsigned int feature_secdiscard:1;
+	unsigned int feature_persistent:1;
 	unsigned int discard_granularity;
 	unsigned int discard_alignment;
 	int is_ready;
@@ -286,22 +296,37 @@ static int blkif_queue_request(struct request *req)
 	struct blkif_request *ring_req;
 	unsigned long id;
 	unsigned int fsect, lsect;
-	int i, ref;
+	int i, ref, use_pers_gnts, new_pers_gnts;
 	grant_ref_t gref_head;
+	struct bio_vec *bvecs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+	struct bio_vec *bvec;
+	struct req_iterator iter;
+	char *bvec_data;
+	void *shared_data;
+	unsigned long flags;
+	struct page *shared_page;
+	struct gnt_list *gnt_list_entry;
 	struct scatterlist *sg;
 
 	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
 		return 1;
 
-	if (gnttab_alloc_grant_references(
-		BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
-		gnttab_request_free_callback(
-			&info->callback,
-			blkif_restart_queue_callback,
-			info,
-			BLKIF_MAX_SEGMENTS_PER_REQUEST);
-		return 1;
+	use_pers_gnts = info->feature_persistent;
+
+	if (info->persistent_grants_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+		new_pers_gnts = 1;
+		if (gnttab_alloc_grant_references(
+			    BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
+			gnttab_request_free_callback(
+				&info->callback,
+				blkif_restart_queue_callback,
+				info,
+				BLKIF_MAX_SEGMENTS_PER_REQUEST);
+			return 1;
+		}
 	} else
+		new_pers_gnts = 0;
 
 	/* Fill out a communications ring structure. */
 	ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
@@ -341,20 +366,66 @@ static int blkif_queue_request(struct request *req)
 		       BLKIF_MAX_SEGMENTS_PER_REQUEST);
 
 		for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
-			buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
 			fsect = sg->offset >> 9;
 			lsect = fsect + (sg->length >> 9) - 1;
-			/* install a grant reference. */
-			ref = gnttab_claim_grant_reference(&gref_head);
-			BUG_ON(ref == -ENOSPC);
 
-			gnttab_grant_foreign_access_ref(
+			if (use_pers_gnts && info->persistent_grants_c) {
+				gnt_list_entry = info->persistent_grants_front;
+
+				info->persistent_grants_front +				
info->persistent_grants_front->tail;
+				ref = gnt_list_entry->gref;
+				buffer_mfn = pfn_to_mfn(gnt_list_entry->frame);
+				info->persistent_grants_c--;
+			} else {
+				ref = gnttab_claim_grant_reference(&gref_head);
+				BUG_ON(ref == -ENOSPC);
+
+				if (use_pers_gnts) {
+					gnt_list_entry +						kmalloc(sizeof(struct gnt_list),
+								 GFP_ATOMIC);
+					if (!gnt_list_entry)
+						return -ENOMEM;
+
+					shared_page = alloc_page(GFP_ATOMIC);
+					if (!shared_page)
+						return -ENOMEM;
+
+					gnt_list_entry->frame +						page_to_pfn(shared_page);
+					gnt_list_entry->gref = ref;
+				} else
+					shared_page = sg_page(sg);
+
+				buffer_mfn = pfn_to_mfn(page_to_pfn(
+								shared_page));
+				gnttab_grant_foreign_access_ref(
 					ref,
 					info->xbdev->otherend_id,
 					buffer_mfn,
-					rq_data_dir(req));
+					!use_pers_gnts && rq_data_dir(req));
+			}
+
+			if (use_pers_gnts)
+				info->shadow[id].grants_used[i] +					gnt_list_entry;
+
+			if (use_pers_gnts && rq_data_dir(req)) {
+				shared_data = kmap_atomic(
+					pfn_to_page(gnt_list_entry->frame));
+				bvec_data = kmap_atomic(sg_page(sg));
+
+				memcpy(shared_data + sg->offset,
+				       bvec_data   + sg->offset,
+				       sg->length);
+
+				kunmap_atomic(bvec_data);
+				kunmap_atomic(shared_data);
+			}
 
 			info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
+
 			ring_req->u.rw.seg[i]  					(struct blkif_request_segment) {
 						.gref       = ref,
@@ -368,7 +439,8 @@ static int blkif_queue_request(struct request *req)
 	/* Keep a private copy so we can reissue requests when recovering. */
 	info->shadow[id].req = *ring_req;
 
-	gnttab_free_grant_references(gref_head);
+	if (new_pers_gnts)
+		gnttab_free_grant_references(gref_head);
 
 	return 0;
 }
@@ -480,12 +552,13 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16
sector_size)
 static void xlvbd_flush(struct blkfront_info *info)
 {
 	blk_queue_flush(info->rq, info->feature_flush);
-	printk(KERN_INFO "blkfront: %s: %s: %s\n",
+	printk(KERN_INFO "blkfront: %s: %s: %s. Persistent=%d\n",
 	       info->gd->disk_name,
 	       info->flush_op == BLKIF_OP_WRITE_BARRIER ?
 		"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
 		"flush diskcache" : "barrier or flush"),
-	       info->feature_flush ? "enabled" : "disabled");
+	       info->feature_flush ? "enabled" : "disabled",
+	    info->feature_persistent);
 }
 
 static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
@@ -707,6 +780,8 @@ static void blkif_restart_queue(struct work_struct *work)
 
 static void blkif_free(struct blkfront_info *info, int suspend)
 {
+	struct gnt_list *pers_gnt;
+
 	/* Prevent new requests being issued until we fix things up. */
 	spin_lock_irq(&info->io_lock);
 	info->connected = suspend ?
@@ -714,6 +789,15 @@ static void blkif_free(struct blkfront_info *info, int
suspend)
 	/* No more blkif_request(). */
 	if (info->rq)
 		blk_stop_queue(info->rq);
+
+	/* Remove all persistent grants */
+	while (info->persistent_grants_front) {
+		pers_gnt = info->persistent_grants_front;
+		info->persistent_grants_front = pers_gnt->tail;
+		gnttab_end_foreign_access(pers_gnt->gref, 0, 0UL);
+		kfree(pers_gnt);
+	}
+
 	/* No more gnttab callback work. */
 	gnttab_cancel_free_callback(&info->callback);
 	spin_unlock_irq(&info->io_lock);
@@ -734,13 +818,44 @@ static void blkif_free(struct blkfront_info *info, int
suspend)
 
 }
 
-static void blkif_completion(struct blk_shadow *s)
+static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
+			     struct blkif_response *bret)
 {
 	int i;
-	/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
-	 * flag. */
-	for (i = 0; i < s->req.u.rw.nr_segments; i++)
-		gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
+	struct gnt_list *new_gnt_list_entry;
+	struct bio_vec *bvec;
+	struct req_iterator iter;
+	unsigned long flags;
+	char *bvec_data;
+	void *shared_data;
+
+	i = 0;
+	if (info->feature_persistent == 0) {
+		/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same
+		 * place flag. */
+		for (i = 0; i < s->req.u.rw.nr_segments; i++)
+			gnttab_end_foreign_access(s->req.u.rw.seg[i].gref,
+						  0, 0UL);
+		return;
+	}
+
+	if (bret->operation == BLKIF_OP_READ)
+		rq_for_each_segment(bvec, s->request, iter) {
+			shared_data = kmap_atomic
+				(pfn_to_page(s->grants_used[i++]->frame));
+			bvec_data = bvec_kmap_irq(bvec, &flags);
+			memcpy(bvec_data, shared_data + bvec->bv_offset,
+			       bvec->bv_len);
+			bvec_kunmap_irq(bvec_data, &flags);
+			kunmap_atomic(shared_data);
+		}
+	/* Add the persistent grant into the list of free grants */
+	for (i = 0; i < s->req.u.rw.nr_segments; i++) {
+		new_gnt_list_entry = s->grants_used[i];
+		new_gnt_list_entry->tail = info->persistent_grants_front;
+		info->persistent_grants_front = new_gnt_list_entry;
+		info->persistent_grants_c++;
+	}
 }
 
 static irqreturn_t blkif_interrupt(int irq, void *dev_id)
@@ -783,7 +898,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 		req  = info->shadow[id].request;
 
 		if (bret->operation != BLKIF_OP_DISCARD)
-			blkif_completion(&info->shadow[id]);
+			blkif_completion(&info->shadow[id], info, bret);
 
 		if (add_id_to_freelist(info, id)) {
 			WARN(1, "%s: response to %s (id %ld) couldn''t be
recycled!\n",
@@ -943,6 +1058,12 @@ again:
 		message = "writing protocol";
 		goto abort_transaction;
 	}
+	err = xenbus_printf(xbt, dev->nodename,
+			    "feature-persistent-grants", "%d", 1);
+	if (err) {
+		message = "Writing persistent grants front feature";
+		goto abort_transaction;
+	}
 
 	err = xenbus_transaction_end(xbt, 0);
 	if (err) {
@@ -1030,6 +1151,7 @@ static int blkfront_probe(struct xenbus_device *dev,
 	spin_lock_init(&info->io_lock);
 	info->xbdev = dev;
 	info->vdevice = vdevice;
+	info->persistent_grants_c = 0;
 	info->connected = BLKIF_STATE_DISCONNECTED;
 	INIT_WORK(&info->work, blkif_restart_queue);
 
@@ -1094,6 +1216,7 @@ static int blkif_recover(struct blkfront_info *info)
 					req->u.rw.seg[j].gref,
 					info->xbdev->otherend_id,
 					pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
+					!info->feature_persistent &&
 					rq_data_dir(info->shadow[req->u.rw.id].request));
 		}
 		info->shadow[req->u.rw.id].req = *req;
@@ -1226,7 +1349,7 @@ static void blkfront_connect(struct blkfront_info *info)
 	unsigned long sector_size;
 	unsigned int binfo;
 	int err;
-	int barrier, flush, discard;
+	int barrier, flush, discard, persistent;
 
 	switch (info->connected) {
 	case BLKIF_STATE_CONNECTED:
@@ -1297,6 +1420,19 @@ static void blkfront_connect(struct blkfront_info *info)
 		info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
 	}
 
+	/*
+	 * Are we dealing with an old blkback that will unmap
+	 * all grefs?
+	 */
+	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+			    "feature-persistent-grants", "%d", &persistent,
+			    NULL);
+
+	if (err)
+		info->feature_persistent = 0;
+	else
+		info->feature_persistent = persistent;
+
 	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
 			    "feature-discard", "%d", &discard,
 			    NULL);
diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
index ee338bf..cd267a9b 100644
--- a/include/xen/interface/io/blkif.h
+++ b/include/xen/interface/io/blkif.h
@@ -109,6 +109,15 @@ typedef uint64_t blkif_sector_t;
  */
 #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
 
+/*
+ * Maximum number of persistent grants that can be mapped by Dom0 for each
+ * interface. This is set to be the size of the ring, as this is a limit on
+ * the number of requests that can be inflight at any one time. 256 imposes
+ * an overhead of 11MB of mapped kernel space per interface.
+ */
+#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
+
+
 struct blkif_request_rw {
 	uint8_t        nr_segments;  /* number of segments                   */
 	blkif_vdev_t   handle;       /* only for read/write requests         */
-- 
1.7.9.5

Konrad Rzeszutek Wilk

2012-Sep-19 11:17 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick
wrote:> This patch implements persistent grants for the xen-blk{front,back}
> mechanism. The effect of this change is to reduce the number of unmap
> operations performed, since they cause a (costly) TLB shootdown. This
> allows the I/O performance to scale better when a large number of VMs
> are performing I/O.
> 
> Previously, the blkfront driver was supplied a bvec[] from the request
> queue. This was granted to dom0; dom0 performed the I/O and wrote
> directly into the grant-mapped memory and unmapped it; blkfront then
> removed foreign access for that grant. The cost of unmapping scales
> badly with the number of CPUs in Dom0. An experiment showed that when
> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> ramdisk, the IPIs from performing unmap''s is a bottleneck at 5
guests
> (at which point 650,000 IOPS are being performed in total). If more
> than 5 guests are used, the performance declines. By 10 guests, only
> 400,000 IOPS are being performed.
> 
> This patch improves performance by only unmapping when the connection
> between blkfront and back is broken.
> 
> On startup, blk{front,back} use xenbus to communicate their ability to
> perform ''feature-persistent''. Iff both ends have this
ability,
> persistent mode is used.
> 
> To perform a read, in persistent mode, blkfront uses a separate pool
> of pages that it maps to dom0. When a request comes in, blkfront
> transmutes the request so that blkback will write into one of these
> free pages. Blkback keeps note of which grefs it has already
> mapped. When a new ring request comes to blkback, it looks to see if
> it has already mapped that page. If so, it will not map it again. If
> the page hasn''t been previously mapped, it is mapped now, and a
record
> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> notified that blkback has completed a request, it memcpy''s from
the
> shared memory, into the bvec supplied. A record that the {gref, page}
> tuple is mapped, and not inflight is kept.
> 
> Writes are similar, except that the memcpy is peformed from the
> supplied bvecs, into the shared pages, before the request is put onto
> the ring.
> 
> Blkback has to store a mapping of grefs=>{page mapped to by gref} in
> an array. As the grefs are not known apriori, and provide no
> guarantees on their ordering, we have to perform a linear search
> through this array to find the page, for every gref we receive. The
> overhead of this is low, however future work might want to use a more
> efficient data structure to reduce this O(n) operation.
> 
> We (ijc, and myself) have introduced a new constant,
> BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
> from attempting a DoS, by supplying fresh grefs, causing the Dom0
> kernel from to map excessively. This is currently set to 256---the
> maximum number of entires in the ring. As the number of inflight
> requests <= size of ring, 256 is also the maximum sensible size. This
> introduces a maximum overhead of 11MB of mapped memory, per block
> device. In practice, we don''t typically use about 60 of these. If
the
> guest exceeds the 256 limit, it is either buggy or malicious. We treat
> this in one of two ways:
> 1) If we have mapped <
> BLKIF_MAX_SEGMENTS_PER_REQUEST * BLKIF_MAX_PERS_REQUESTS_PER_DEV
> pages, we will persistently map the grefs. This can occur is previous
> requests have not used all BLKIF_MAX_SEGMENTS_PER_REQUEST segments.
> 2) Otherwise, we revert to non-persistent grants for all future grefs.
> 
> In writing this patch, the question arrises as to if the additional
> cost of performing memcpys in the guest (to/from the pool of granted
> pages) outweigh the gains of not performing TLB shootdowns. The answer
> to that question is `no''. There appears to be very little, if any
> additional cost to the guest of using persistent grants. There is
> perhaps a small saving, from the reduced number of hypercalls
> performed in granting, and ending foreign access.
I can''t apply the patch to test it out?

patch -p1 < /tmp/blk 
patching file drivers/block/xen-blkback/blkback.c
patch: **** malformed patch at line 213:  

> 
> 
> Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
> ---
>  drivers/block/xen-blkback/blkback.c |  149 +++++++++++++++++++++++++---
>  drivers/block/xen-blkback/common.h  |   16 +++
>  drivers/block/xen-blkback/xenbus.c  |   21 +++-
>  drivers/block/xen-blkfront.c        |  186
++++++++++++++++++++++++++++++-----
>  include/xen/interface/io/blkif.h    |    9 ++
>  5 files changed, 338 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
> index 73f196c..f95dee9 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -78,6 +78,7 @@ struct pending_req {
>  	unsigned short		operation;
>  	int			status;
>  	struct list_head	free_list;
> +	u8			is_pers;
>  };
>  
>  #define BLKBACK_INVALID_HANDLE (~0)
> @@ -128,6 +129,24 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  static void make_response(struct xen_blkif *blkif, u64 id,
>  			  unsigned short op, int st);
>  
> +static void add_pers_gnt(struct pers_gnt *pers_gnt, struct xen_blkif
*blkif)
> +{
> +	BUG_ON(blkif->pers_gnt_c >= BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> +	       BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +	blkif->pers_gnts[blkif->pers_gnt_c++] = pers_gnt;
> +}
> +
> +static struct pers_gnt *get_pers_gnt(struct xen_blkif *blkif,
> +				     grant_ref_t gref)
> +{
> +	int i;
> +
> +	for (i = 0; i < blkif->pers_gnt_c; i++)
> +		if (gref == blkif->pers_gnts[i]->gnt)
> +			return blkif->pers_gnts[i];
> +	return NULL;
> +}
> +
>  /*
>   * Retrieve from the ''pending_reqs'' a free pending_req
structure to be used.
>   */
> @@ -274,6 +293,12 @@ int xen_blkif_schedule(void *arg)
>  {
>  	struct xen_blkif *blkif = arg;
>  	struct xen_vbd *vbd = &blkif->vbd;
> +	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnt;
> +	int i;
> +	int ret = 0;
> +	int segs_to_unmap;
>  
>  	xen_blkif_get(blkif);
>  
> @@ -301,6 +326,29 @@ int xen_blkif_schedule(void *arg)
>  			print_stats(blkif);
>  	}
>  
> +	/* Free all persistent grant pages */
> +
> +	while (segs_to_unmap = min(BLKIF_MAX_SEGMENTS_PER_REQUEST,
> +				 blkif->pers_gnt_c)) {
> +
> +		for (i = 0; i < segs_to_unmap; i++) {
> +			pers_gnt = blkif->pers_gnts[blkif->pers_gnt_c - i - 1];
> +
> +			gnttab_set_unmap_op(&unmap[i],
> +					    pfn_to_kaddr(page_to_pfn(
> +							     pers_gnt->page)),
> +					    GNTMAP_host_map,
> +					    pers_gnt->handle);
> +
> +			pages[i] = pers_gnt->page;
> +		}
> +
> +		ret = gnttab_unmap_refs(unmap, pages, segs_to_unmap, false);
> +		BUG_ON(ret);
> +
> +		blkif->pers_gnt_c -= segs_to_unmap;
> +
> +	}
> +
>  	if (log_stats)
>  		print_stats(blkif);
>  
> @@ -343,13 +391,28 @@ static void xen_blkbk_unmap(struct pending_req *req)
>  
>  static int xen_blkbk_map(struct blkif_request *req,
>  			 struct pending_req *pending_req,
> -			 struct seg_buf seg[])
> +			 struct seg_buf seg[],
> +			 struct page *pages[])
>  {
>  	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *new_pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnt;
> +	phys_addr_t addr;
>  	int i;
> +	int new_map;
>  	int nseg = req->u.rw.nr_segments;
> +	int segs_to_init = 0;
>  	int ret = 0;
> +	int use_pers_gnts;
>  
> +	use_pers_gnts = (pending_req->blkif->can_grant_persist &&
> +			 pending_req->blkif->pers_gnt_c <
> +			 BLKIF_MAX_SEGMENTS_PER_REQUEST *
> +			 BLKIF_MAX_PERS_REQUESTS_PER_DEV);
> +
> +	pending_req->is_pers = use_pers_gnts;
>  	/*
>  	 * Fill out preq.nr_sects with proper amount of sectors, and setup
>  	 * assign map[..] with the PFN of the page in our domain with the
> @@ -358,36 +421,87 @@ static int xen_blkbk_map(struct blkif_request *req,
>  	for (i = 0; i < nseg; i++) {
>  		uint32_t flags;
>  
> +		if (use_pers_gnts) {
> +			pers_gnt = get_pers_gnt(pending_req->blkif,
> +						req->u.rw.seg[i].gref);
> +			if (!pers_gnt) {
> +				new_map = 1;
> +				pers_gnt = kmalloc(sizeof(struct pers_gnt),
> +						   GFP_KERNEL);
> +				if (!pers_gnt)
> +					return -ENOMEM;
> +				pers_gnt->page = alloc_page(GFP_KERNEL);
> +				if (!pers_gnt->page)
> +					return -ENOMEM;
> +				pers_gnt->gnt = req->u.rw.seg[i].gref;
> +
> +				pages_to_gnt[segs_to_init] = pers_gnt->page;
> +				new_pers_gnts[segs_to_init] = pers_gnt;
> +
> +				add_pers_gnt(pers_gnt, pending_req->blkif);
> +
> +			} else {
> +				new_map = 0;
> +			}
> +			pages[i] = pers_gnt->page;
> +			addr = (unsigned long) pfn_to_kaddr(
> +				page_to_pfn(pers_gnt->page));
> +			pers_gnts[i] = pers_gnt;
> +		} else {
> +			new_map = 1;
> +			pages[i] = blkbk->pending_page(pending_req, i);
> +			addr = vaddr(pending_req, i);
> +			pages_to_gnt[i] = blkbk->pending_page(pending_req, i);
> +		}
> +
>  		flags = GNTMAP_host_map;
> -		if (pending_req->operation != BLKIF_OP_READ)
> +		if (!use_pers_gnts && (pending_req->operation !=
BLKIF_OP_READ))
>  			flags |= GNTMAP_readonly;
> -		gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> -				  req->u.rw.seg[i].gref,
> -				  pending_req->blkif->domid);
> +		if (new_map) {
> +			gnttab_set_map_op(&map[segs_to_init++], addr,
> +					  flags, req->u.rw.seg[i].gref,
> +					  pending_req->blkif->domid);
> +		}
>  	}
>  
> -	ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req,
0), nseg);
> -	BUG_ON(ret);
> +	if (segs_to_init) {
> +		ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_init);
> +		BUG_ON(ret);
> +	}
>  
>  	/*
>  	 * Now swizzle the MFN in our domain with the MFN from the other domain
>  	 * so that when we access vaddr(pending_req,i) it has the contents of
>  	 * the page from the other domain.
>  	 */
> -	for (i = 0; i < nseg; i++) {
> +	for (i = 0; i < segs_to_init; i++) {
>  		if (unlikely(map[i].status != 0)) {
>  			pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>  			map[i].handle = BLKBACK_INVALID_HANDLE;
>  			ret |= 1;
>  		}
>  
> -		pending_handle(pending_req, i) = map[i].handle;
> +		if (use_pers_gnts) {
> +			/* store the `out'' values from map */
> +		    pending_req->blkif->pers_gnts
> +			[pending_req->blkif->pers_gnt_c - segs_to_init +
> +			 i]->handle = map[i].handle;
> +			new_pers_gnts[i]->dev_bus_addr = map[i].dev_bus_addr;
> +		}
>  
>  		if (ret)
>  			continue;
> -
> -		seg[i].buf  = map[i].dev_bus_addr |
> -			(req->u.rw.seg[i].first_sect << 9);
> +	}
> +	for (i = 0; i < nseg; i++) {
> +		if (use_pers_gnts) {
> +			pending_handle(pending_req, i) = pers_gnts[i]->handle;
> +			seg[i].buf = pers_gnts[i]->dev_bus_addr |
> +				(req->u.rw.seg[i].first_sect << 9);
> +		} else {
> +			pending_handle(pending_req, i) = map[i].handle;
> +			seg[i].buf = map[i].dev_bus_addr |
> +				(req->u.rw.seg[i].first_sect << 9);
> +		}
>  	}
>  	return ret;
>  }
> @@ -468,7 +582,8 @@ static void __end_block_io_op(struct pending_req
*pending_req, int error)
>  	 * the proper response on the ring.
>  	 */
>  	if (atomic_dec_and_test(&pending_req->pendcnt)) {
> -		xen_blkbk_unmap(pending_req);
> +		if (!pending_req->is_pers)
> +			xen_blkbk_unmap(pending_req);
>  		make_response(pending_req->blkif, pending_req->id,
>  			      pending_req->operation, pending_req->status);
>  		xen_blkif_put(pending_req->blkif);
> @@ -590,6 +705,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	int operation;
>  	struct blk_plug plug;
>  	bool drain = false;
> +	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  
>  	switch (req->operation) {
>  	case BLKIF_OP_READ:
> @@ -676,7 +792,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	 * the hypercall to unmap the grants - that is all done in
>  	 * xen_blkbk_unmap.
>  	 */
> -	if (xen_blkbk_map(req, pending_req, seg))
> +	if (xen_blkbk_map(req, pending_req, seg, pages))
>  		goto fail_flush;
>  
>  	/*
> @@ -688,7 +804,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	for (i = 0; i < nseg; i++) {
>  		while ((bio == NULL) ||
>  		       (bio_add_page(bio,
> -				     blkbk->pending_page(pending_req, i),
> +				     pages[i],
>  				     seg[i].nsec << 9,
>  				     seg[i].buf & ~PAGE_MASK) == 0)) {
>  
> @@ -743,7 +859,8 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	return 0;
>  
>   fail_flush:
> -	xen_blkbk_unmap(pending_req);
> +	if (!blkif->can_grant_persist)
> +		xen_blkbk_unmap(pending_req);
>   fail_response:
>  	/* Haven''t submitted any bio''s yet. */
>  	make_response(blkif, req->u.rw.id, req->operation,
BLKIF_RSP_ERROR);
> diff --git a/drivers/block/xen-blkback/common.h
b/drivers/block/xen-blkback/common.h
> index 9ad3b5e..1ecb709 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -47,6 +47,8 @@
>  #define DPRINTK(fmt, args...)				\
>  	pr_debug(DRV_PFX "(%s:%d) " fmt ".\n",		\
>  		 __func__, __LINE__, ##args)
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> +
>  
>  
>  /* Not a real protocol.  Used to generate ring structs which contain
> @@ -164,6 +166,14 @@ struct xen_vbd {
>  
>  struct backend_info;
>  
> +
> +struct pers_gnt {
> +	struct page *page;
> +	grant_ref_t gnt;
> +	uint32_t handle;
> +	uint64_t dev_bus_addr;
> +};
> +
>  struct xen_blkif {
>  	/* Unique identifier for this interface. */
>  	domid_t			domid;
> @@ -190,6 +200,12 @@ struct xen_blkif {
>  	struct task_struct	*xenblkd;
>  	unsigned int		waiting_reqs;
>  
> +	/* frontend feature information */
> +	u8			can_grant_persist:1;
> +	struct pers_gnt		*pers_gnts[BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> +					  BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	unsigned int		pers_gnt_c;
> +
>  	/* statistics */
>  	unsigned long		st_print;
>  	int			st_rd_req;
> diff --git a/drivers/block/xen-blkback/xenbus.c
b/drivers/block/xen-blkback/xenbus.c
> index 4f66171..dc58cc4 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -681,6 +681,13 @@ again:
>  		goto abort;
>  	}
>  
> +	err = xenbus_printf(xbt, dev->nodename,
"feature-persistent-grants",
> +			    "%d", 1);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "writing persistent capability");
> +		goto abort;
> +	}
> +
>  	/* FIXME: use a typename instead */
>  	err = xenbus_printf(xbt, dev->nodename, "info",
"%u",
>  			    be->blkif->vbd.type |
> @@ -721,6 +728,7 @@ static int connect_ring(struct backend_info *be)
>  	struct xenbus_device *dev = be->dev;
>  	unsigned long ring_ref;
>  	unsigned int evtchn;
> +	u8 pers_grants;
>  	char protocol[64] = "";
>  	int err;
>  
> @@ -750,8 +758,17 @@ static int connect_ring(struct backend_info *be)
>  		xenbus_dev_fatal(dev, err, "unknown fe protocol %s",
protocol);
>  		return -1;
>  	}
> -	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d
(%s)\n",
> -		ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> +	err = xenbus_gather(XBT_NIL, dev->otherend,
> +			    "feature-persistent-grants", "%d",
> +			    &pers_grants, NULL);
> +	if (err)
> +		pers_grants = 0;
> +
> +	be->blkif->can_grant_persist = pers_grants;
> +
> +	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)
persistent %d\n",
> +		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> +		pers_grants);
>  
>  	/* Map the shared frame, irq etc. */
>  	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index e4fb337..c1cc5fe 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -68,6 +68,13 @@ struct blk_shadow {
>  	struct blkif_request req;
>  	struct request *request;
>  	unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct gnt_list *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +};
> +
> +struct gnt_list {
> +	grant_ref_t gref;
> +	unsigned long frame;
> +	struct gnt_list *tail;
>  };
>  
>  static DEFINE_MUTEX(blkfront_mutex);
> @@ -97,11 +104,14 @@ struct blkfront_info
>  	struct work_struct work;
>  	struct gnttab_free_callback callback;
>  	struct blk_shadow shadow[BLK_RING_SIZE];
> +	struct gnt_list *persistent_grants_front;
> +	unsigned int persistent_grants_c;
>  	unsigned long shadow_free;
>  	unsigned int feature_flush;
>  	unsigned int flush_op;
>  	unsigned int feature_discard:1;
>  	unsigned int feature_secdiscard:1;
> +	unsigned int feature_persistent:1;
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	int is_ready;
> @@ -286,22 +296,37 @@ static int blkif_queue_request(struct request *req)
>  	struct blkif_request *ring_req;
>  	unsigned long id;
>  	unsigned int fsect, lsect;
> -	int i, ref;
> +	int i, ref, use_pers_gnts, new_pers_gnts;
>  	grant_ref_t gref_head;
> +	struct bio_vec *bvecs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct bio_vec *bvec;
> +	struct req_iterator iter;
> +	char *bvec_data;
> +	void *shared_data;
> +	unsigned long flags;
> +	struct page *shared_page;
> +	struct gnt_list *gnt_list_entry;
>  	struct scatterlist *sg;
>  
>  	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>  		return 1;
>  
> -	if (gnttab_alloc_grant_references(
> -		BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> -		gnttab_request_free_callback(
> -			&info->callback,
> -			blkif_restart_queue_callback,
> -			info,
> -			BLKIF_MAX_SEGMENTS_PER_REQUEST);
> -		return 1;
> +	use_pers_gnts = info->feature_persistent;
> +
> +	if (info->persistent_grants_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> +		new_pers_gnts = 1;
> +		if (gnttab_alloc_grant_references(
> +			    BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> +			gnttab_request_free_callback(
> +				&info->callback,
> +				blkif_restart_queue_callback,
> +				info,
> +				BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +			return 1;
> +		}
>  	} else
> +		new_pers_gnts = 0;
>  
>  	/* Fill out a communications ring structure. */
>  	ring_req = RING_GET_REQUEST(&info->ring,
info->ring.req_prod_pvt);
> @@ -341,20 +366,66 @@ static int blkif_queue_request(struct request *req)
>  		       BLKIF_MAX_SEGMENTS_PER_REQUEST);
>  
>  		for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> -			buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
>  			fsect = sg->offset >> 9;
>  			lsect = fsect + (sg->length >> 9) - 1;
> -			/* install a grant reference. */
> -			ref = gnttab_claim_grant_reference(&gref_head);
> -			BUG_ON(ref == -ENOSPC);
>  
> -			gnttab_grant_foreign_access_ref(
> +			if (use_pers_gnts && info->persistent_grants_c) {
> +				gnt_list_entry = info->persistent_grants_front;
> +
> +				info->persistent_grants_front > +				
info->persistent_grants_front->tail;
> +				ref = gnt_list_entry->gref;
> +				buffer_mfn = pfn_to_mfn(gnt_list_entry->frame);
> +				info->persistent_grants_c--;
> +			} else {
> +				ref = gnttab_claim_grant_reference(&gref_head);
> +				BUG_ON(ref == -ENOSPC);
> +
> +				if (use_pers_gnts) {
> +					gnt_list_entry > +						kmalloc(sizeof(struct gnt_list),
> +								 GFP_ATOMIC);
> +					if (!gnt_list_entry)
> +						return -ENOMEM;
> +
> +					shared_page = alloc_page(GFP_ATOMIC);
> +					if (!shared_page)
> +						return -ENOMEM;
> +
> +					gnt_list_entry->frame > +						page_to_pfn(shared_page);
> +					gnt_list_entry->gref = ref;
> +				} else
> +					shared_page = sg_page(sg);
> +
> +				buffer_mfn = pfn_to_mfn(page_to_pfn(
> +								shared_page));
> +				gnttab_grant_foreign_access_ref(
>  					ref,
>  					info->xbdev->otherend_id,
>  					buffer_mfn,
> -					rq_data_dir(req));
> +					!use_pers_gnts && rq_data_dir(req));
> +			}
> +
> +			if (use_pers_gnts)
> +				info->shadow[id].grants_used[i] > +					gnt_list_entry;
> +
> +			if (use_pers_gnts && rq_data_dir(req)) {
> +				shared_data = kmap_atomic(
> +					pfn_to_page(gnt_list_entry->frame));
> +				bvec_data = kmap_atomic(sg_page(sg));
> +
> +				memcpy(shared_data + sg->offset,
> +				       bvec_data   + sg->offset,
> +				       sg->length);
> +
> +				kunmap_atomic(bvec_data);
> +				kunmap_atomic(shared_data);
> +			}
>  
>  			info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> +
>  			ring_req->u.rw.seg[i] >  					(struct blkif_request_segment) {
>  						.gref       = ref,
> @@ -368,7 +439,8 @@ static int blkif_queue_request(struct request *req)
>  	/* Keep a private copy so we can reissue requests when recovering. */
>  	info->shadow[id].req = *ring_req;
>  
> -	gnttab_free_grant_references(gref_head);
> +	if (new_pers_gnts)
> +		gnttab_free_grant_references(gref_head);
>  
>  	return 0;
>  }
> @@ -480,12 +552,13 @@ static int xlvbd_init_blk_queue(struct gendisk *gd,
u16 sector_size)
>  static void xlvbd_flush(struct blkfront_info *info)
>  {
>  	blk_queue_flush(info->rq, info->feature_flush);
> -	printk(KERN_INFO "blkfront: %s: %s: %s\n",
> +	printk(KERN_INFO "blkfront: %s: %s: %s. Persistent=%d\n",
>  	       info->gd->disk_name,
>  	       info->flush_op == BLKIF_OP_WRITE_BARRIER ?
>  		"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
>  		"flush diskcache" : "barrier or flush"),
> -	       info->feature_flush ? "enabled" :
"disabled");
> +	       info->feature_flush ? "enabled" :
"disabled",
> +	    info->feature_persistent);
>  }
>  
>  static int xen_translate_vdev(int vdevice, int *minor, unsigned int
*offset)
> @@ -707,6 +780,8 @@ static void blkif_restart_queue(struct work_struct
*work)
>  
>  static void blkif_free(struct blkfront_info *info, int suspend)
>  {
> +	struct gnt_list *pers_gnt;
> +
>  	/* Prevent new requests being issued until we fix things up. */
>  	spin_lock_irq(&info->io_lock);
>  	info->connected = suspend ?
> @@ -714,6 +789,15 @@ static void blkif_free(struct blkfront_info *info, int
suspend)
>  	/* No more blkif_request(). */
>  	if (info->rq)
>  		blk_stop_queue(info->rq);
> +
> +	/* Remove all persistent grants */
> +	while (info->persistent_grants_front) {
> +		pers_gnt = info->persistent_grants_front;
> +		info->persistent_grants_front = pers_gnt->tail;
> +		gnttab_end_foreign_access(pers_gnt->gref, 0, 0UL);
> +		kfree(pers_gnt);
> +	}
> +
>  	/* No more gnttab callback work. */
>  	gnttab_cancel_free_callback(&info->callback);
>  	spin_unlock_irq(&info->io_lock);
> @@ -734,13 +818,44 @@ static void blkif_free(struct blkfront_info *info,
int suspend)
>  
>  }
>  
> -static void blkif_completion(struct blk_shadow *s)
> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info
*info,
> +			     struct blkif_response *bret)
>  {
>  	int i;
> -	/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> -	 * flag. */
> -	for (i = 0; i < s->req.u.rw.nr_segments; i++)
> -		gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> +	struct gnt_list *new_gnt_list_entry;
> +	struct bio_vec *bvec;
> +	struct req_iterator iter;
> +	unsigned long flags;
> +	char *bvec_data;
> +	void *shared_data;
> +
> +	i = 0;
> +	if (info->feature_persistent == 0) {
> +		/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same
> +		 * place flag. */
> +		for (i = 0; i < s->req.u.rw.nr_segments; i++)
> +			gnttab_end_foreign_access(s->req.u.rw.seg[i].gref,
> +						  0, 0UL);
> +		return;
> +	}
> +
> +	if (bret->operation == BLKIF_OP_READ)
> +		rq_for_each_segment(bvec, s->request, iter) {
> +			shared_data = kmap_atomic
> +				(pfn_to_page(s->grants_used[i++]->frame));
> +			bvec_data = bvec_kmap_irq(bvec, &flags);
> +			memcpy(bvec_data, shared_data + bvec->bv_offset,
> +			       bvec->bv_len);
> +			bvec_kunmap_irq(bvec_data, &flags);
> +			kunmap_atomic(shared_data);
> +		}
> +	/* Add the persistent grant into the list of free grants */
> +	for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> +		new_gnt_list_entry = s->grants_used[i];
> +		new_gnt_list_entry->tail = info->persistent_grants_front;
> +		info->persistent_grants_front = new_gnt_list_entry;
> +		info->persistent_grants_c++;
> +	}
>  }
>  
>  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> @@ -783,7 +898,7 @@ static irqreturn_t blkif_interrupt(int irq, void
*dev_id)
>  		req  = info->shadow[id].request;
>  
>  		if (bret->operation != BLKIF_OP_DISCARD)
> -			blkif_completion(&info->shadow[id]);
> +			blkif_completion(&info->shadow[id], info, bret);
>  
>  		if (add_id_to_freelist(info, id)) {
>  			WARN(1, "%s: response to %s (id %ld) couldn''t be
recycled!\n",
> @@ -943,6 +1058,12 @@ again:
>  		message = "writing protocol";
>  		goto abort_transaction;
>  	}
> +	err = xenbus_printf(xbt, dev->nodename,
> +			    "feature-persistent-grants", "%d", 1);
> +	if (err) {
> +		message = "Writing persistent grants front feature";
> +		goto abort_transaction;
> +	}
>  
>  	err = xenbus_transaction_end(xbt, 0);
>  	if (err) {
> @@ -1030,6 +1151,7 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	spin_lock_init(&info->io_lock);
>  	info->xbdev = dev;
>  	info->vdevice = vdevice;
> +	info->persistent_grants_c = 0;
>  	info->connected = BLKIF_STATE_DISCONNECTED;
>  	INIT_WORK(&info->work, blkif_restart_queue);
>  
> @@ -1094,6 +1216,7 @@ static int blkif_recover(struct blkfront_info *info)
>  					req->u.rw.seg[j].gref,
>  					info->xbdev->otherend_id,
>  					pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> +					!info->feature_persistent &&
>  					rq_data_dir(info->shadow[req->u.rw.id].request));
>  		}
>  		info->shadow[req->u.rw.id].req = *req;
> @@ -1226,7 +1349,7 @@ static void blkfront_connect(struct blkfront_info
*info)
>  	unsigned long sector_size;
>  	unsigned int binfo;
>  	int err;
> -	int barrier, flush, discard;
> +	int barrier, flush, discard, persistent;
>  
>  	switch (info->connected) {
>  	case BLKIF_STATE_CONNECTED:
> @@ -1297,6 +1420,19 @@ static void blkfront_connect(struct blkfront_info
*info)
>  		info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
>  	}
>  
> +	/*
> +	 * Are we dealing with an old blkback that will unmap
> +	 * all grefs?
> +	 */
> +	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +			    "feature-persistent-grants", "%d",
&persistent,
> +			    NULL);
> +
> +	if (err)
> +		info->feature_persistent = 0;
> +	else
> +		info->feature_persistent = persistent;
> +
>  	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
>  			    "feature-discard", "%d", &discard,
>  			    NULL);
> diff --git a/include/xen/interface/io/blkif.h
b/include/xen/interface/io/blkif.h
> index ee338bf..cd267a9b 100644
> --- a/include/xen/interface/io/blkif.h
> +++ b/include/xen/interface/io/blkif.h
> @@ -109,6 +109,15 @@ typedef uint64_t blkif_sector_t;
>   */
>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
>  
> +/*
> + * Maximum number of persistent grants that can be mapped by Dom0 for each
> + * interface. This is set to be the size of the ring, as this is a limit
on
> + * the number of requests that can be inflight at any one time. 256
imposes
> + * an overhead of 11MB of mapped kernel space per interface.
> + */
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> +
> +
>  struct blkif_request_rw {
>  	uint8_t        nr_segments;  /* number of segments                   */
>  	blkif_vdev_t   handle;       /* only for read/write requests         */
> -- 
> 1.7.9.5

Jan Beulich

2012-Sep-19 12:03 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

>>> On 19.09.12 at 12:51, Oliver Chick <oliver.chick@citrix.com>
wrote:
> +/*
> + * Maximum number of persistent grants that can be mapped by Dom0 for each
> + * interface. This is set to be the size of the ring, as this is a limit
on
> + * the number of requests that can be inflight at any one time. 256
imposes
> + * an overhead of 11MB of mapped kernel space per interface.
> + */
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
How can this be a fixed number, especially in the context of ring
size extensions? Iirc, single page rings can accommodate 64
requests, so I''d guess the number above was observed with a
4-page (order 2) ring...

Jan

Pasi Kärkkäinen

2012-Sep-19 13:16 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick
wrote:> This patch implements persistent grants for the xen-blk{front,back}
> mechanism. The effect of this change is to reduce the number of unmap
> operations performed, since they cause a (costly) TLB shootdown. This
> allows the I/O performance to scale better when a large number of VMs
> are performing I/O.
> 
> Previously, the blkfront driver was supplied a bvec[] from the request
> queue. This was granted to dom0; dom0 performed the I/O and wrote
> directly into the grant-mapped memory and unmapped it; blkfront then
> removed foreign access for that grant. The cost of unmapping scales
> badly with the number of CPUs in Dom0. An experiment showed that when
> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> ramdisk, the IPIs from performing unmap''s is a bottleneck at 5
guests
> (at which point 650,000 IOPS are being performed in total). If more
> than 5 guests are used, the performance declines. By 10 guests, only
> 400,000 IOPS are being performed.
> 
> This patch improves performance by only unmapping when the connection
> between blkfront and back is broken.
>
So how many IOPS can you get with this patch / persistent grants ? 

-- Pasi

Konrad Rzeszutek Wilk

2012-Sep-19 14:11 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick
wrote:> This patch implements persistent grants for the xen-blk{front,back}
> mechanism. The effect of this change is to reduce the number of unmap
> operations performed, since they cause a (costly) TLB shootdown. This
> allows the I/O performance to scale better when a large number of VMs
> are performing I/O.
> 
> Previously, the blkfront driver was supplied a bvec[] from the request
> queue. This was granted to dom0; dom0 performed the I/O and wrote
> directly into the grant-mapped memory and unmapped it; blkfront then
> removed foreign access for that grant. The cost of unmapping scales
> badly with the number of CPUs in Dom0. An experiment showed that when
> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> ramdisk, the IPIs from performing unmap''s is a bottleneck at 5
guests
> (at which point 650,000 IOPS are being performed in total). If more
> than 5 guests are used, the performance declines. By 10 guests, only
> 400,000 IOPS are being performed.
> 
> This patch improves performance by only unmapping when the connection
> between blkfront and back is broken.
> 
> On startup, blk{front,back} use xenbus to communicate their ability to
> perform ''feature-persistent''. Iff both ends have this
ability,
> persistent mode is used.
> 
> To perform a read, in persistent mode, blkfront uses a separate pool
> of pages that it maps to dom0. When a request comes in, blkfront
> transmutes the request so that blkback will write into one of these
> free pages. Blkback keeps note of which grefs it has already
> mapped. When a new ring request comes to blkback, it looks to see if
> it has already mapped that page. If so, it will not map it again. If
> the page hasn''t been previously mapped, it is mapped now, and a
record
> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> notified that blkback has completed a request, it memcpy''s from
the
> shared memory, into the bvec supplied. A record that the {gref, page}
> tuple is mapped, and not inflight is kept.
> 
> Writes are similar, except that the memcpy is peformed from the
> supplied bvecs, into the shared pages, before the request is put onto
> the ring.
> 
> Blkback has to store a mapping of grefs=>{page mapped to by gref} in
> an array. As the grefs are not known apriori, and provide no
> guarantees on their ordering, we have to perform a linear search
> through this array to find the page, for every gref we receive. The
> overhead of this is low, however future work might want to use a more
> efficient data structure to reduce this O(n) operation.
> 
> We (ijc, and myself) have introduced a new constant,
> BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
> from attempting a DoS, by supplying fresh grefs, causing the Dom0
> kernel from to map excessively. This is currently set to 256---the
> maximum number of entires in the ring. As the number of inflight
> requests <= size of ring, 256 is also the maximum sensible size. This
> introduces a maximum overhead of 11MB of mapped memory, per block
> device. In practice, we don''t typically use about 60 of these. If
the
> guest exceeds the 256 limit, it is either buggy or malicious. We treat
> this in one of two ways:
> 1) If we have mapped <
> BLKIF_MAX_SEGMENTS_PER_REQUEST * BLKIF_MAX_PERS_REQUESTS_PER_DEV
> pages, we will persistently map the grefs. This can occur is previous
> requests have not used all BLKIF_MAX_SEGMENTS_PER_REQUEST segments.
> 2) Otherwise, we revert to non-persistent grants for all future grefs.
> 
> In writing this patch, the question arrises as to if the additional
> cost of performing memcpys in the guest (to/from the pool of granted
> pages) outweigh the gains of not performing TLB shootdowns. The answer
> to that question is `no''. There appears to be very little, if any
> additional cost to the guest of using persistent grants. There is
> perhaps a small saving, from the reduced number of hypercalls
> performed in granting, and ending foreign access.
> 
> 
> Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
> ---
>  drivers/block/xen-blkback/blkback.c |  149 +++++++++++++++++++++++++---
>  drivers/block/xen-blkback/common.h  |   16 +++
>  drivers/block/xen-blkback/xenbus.c  |   21 +++-
>  drivers/block/xen-blkfront.c        |  186
++++++++++++++++++++++++++++++-----
>  include/xen/interface/io/blkif.h    |    9 ++
>  5 files changed, 338 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
> index 73f196c..f95dee9 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -78,6 +78,7 @@ struct pending_req {
>  	unsigned short		operation;
>  	int			status;
>  	struct list_head	free_list;
> +	u8			is_pers;
Just do what you did in the blkfront.c:

	unsigned int feature_persistent:1

or:
	unsigned int flag_persistent:1

or:
	unsigned int flag_pers_gnt:1
>  };
>  
>  #define BLKBACK_INVALID_HANDLE (~0)
> @@ -128,6 +129,24 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  static void make_response(struct xen_blkif *blkif, u64 id,
>  			  unsigned short op, int st);
>  
> +static void add_pers_gnt(struct pers_gnt *pers_gnt, struct xen_blkif
*blkif)
> +{
> +	BUG_ON(blkif->pers_gnt_c >= BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> +	       BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +	blkif->pers_gnts[blkif->pers_gnt_c++] = pers_gnt;
> +}
> +
> +static struct pers_gnt *get_pers_gnt(struct xen_blkif *blkif,
> +				     grant_ref_t gref)
> +{
> +	int i;
> +
> +	for (i = 0; i < blkif->pers_gnt_c; i++)
> +		if (gref == blkif->pers_gnts[i]->gnt)
> +			return blkif->pers_gnts[i];
> +	return NULL;
> +}
> +
>  /*
>   * Retrieve from the ''pending_reqs'' a free pending_req
structure to be used.
>   */
> @@ -274,6 +293,12 @@ int xen_blkif_schedule(void *arg)
>  {
>  	struct xen_blkif *blkif = arg;
>  	struct xen_vbd *vbd = &blkif->vbd;
> +	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnt;
> +	int i;
> +	int ret = 0;
> +	int segs_to_unmap;
>  
>  	xen_blkif_get(blkif);
>  
> @@ -301,6 +326,29 @@ int xen_blkif_schedule(void *arg)
>  			print_stats(blkif);
>  	}
>  
> +	/* Free all persistent grant pages */
> +
> +	while (segs_to_unmap = min(BLKIF_MAX_SEGMENTS_PER_REQUEST,
> +				 blkif->pers_gnt_c)) {
> +
> +		for (i = 0; i < segs_to_unmap; i++) {
> +			pers_gnt = blkif->pers_gnts[blkif->pers_gnt_c - i - 1];
> +
> +			gnttab_set_unmap_op(&unmap[i],
> +					    pfn_to_kaddr(page_to_pfn(
> +							     pers_gnt->page)),
> +					    GNTMAP_host_map,
> +					    pers_gnt->handle);
> +
> +			pages[i] = pers_gnt->page;
> +		}
> +
> +		ret = gnttab_unmap_refs(unmap, pages, segs_to_unmap, false);
> +		BUG_ON(ret);
> +
> +		blkif->pers_gnt_c -= segs_to_unmap;
> +
> +	}
> +
>  	if (log_stats)
>  		print_stats(blkif);
>  
> @@ -343,13 +391,28 @@ static void xen_blkbk_unmap(struct pending_req *req)
>  
>  static int xen_blkbk_map(struct blkif_request *req,
>  			 struct pending_req *pending_req,
> -			 struct seg_buf seg[])
> +			 struct seg_buf seg[],
> +			 struct page *pages[])
>  {
>  	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *new_pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnt;
> +	phys_addr_t addr;
>  	int i;
> +	int new_map;
>  	int nseg = req->u.rw.nr_segments;
> +	int segs_to_init = 0;
segs_to_map is probably a better name.
>  	int ret = 0;
> +	int use_pers_gnts;
>  
> +	use_pers_gnts = (pending_req->blkif->can_grant_persist &&
> +			 pending_req->blkif->pers_gnt_c <
> +			 BLKIF_MAX_SEGMENTS_PER_REQUEST *
> +			 BLKIF_MAX_PERS_REQUESTS_PER_DEV);
> +
> +	pending_req->is_pers = use_pers_gnts;
>  	/*
>  	 * Fill out preq.nr_sects with proper amount of sectors, and setup
>  	 * assign map[..] with the PFN of the page in our domain with the
> @@ -358,36 +421,87 @@ static int xen_blkbk_map(struct blkif_request *req,
>  	for (i = 0; i < nseg; i++) {
>  		uint32_t flags;
>  
> +		if (use_pers_gnts) {
> +			pers_gnt = get_pers_gnt(pending_req->blkif,
> +						req->u.rw.seg[i].gref);
> +			if (!pers_gnt) {
> +				new_map = 1;
> +				pers_gnt = kmalloc(sizeof(struct pers_gnt),
> +						   GFP_KERNEL);
> +				if (!pers_gnt)
> +					return -ENOMEM;
> +				pers_gnt->page = alloc_page(GFP_KERNEL);
> +				if (!pers_gnt->page)
> +					return -ENOMEM;
You want to kfree pers_gnt here.
> +				pers_gnt->gnt = req->u.rw.seg[i].gref;
> +
> +				pages_to_gnt[segs_to_init] = pers_gnt->page;
> +				new_pers_gnts[segs_to_init] = pers_gnt;
> +
> +				add_pers_gnt(pers_gnt, pending_req->blkif);
> +
> +			} else {
> +				new_map = 0;
> +			}
> +			pages[i] = pers_gnt->page;
> +			addr = (unsigned long) pfn_to_kaddr(
> +				page_to_pfn(pers_gnt->page));
Would it make sense to cache this in the ''pers_gnt'' structure?
> +			pers_gnts[i] = pers_gnt;
> +		} else {
> +			new_map = 1;
> +			pages[i] = blkbk->pending_page(pending_req, i);
> +			addr = vaddr(pending_req, i);
> +			pages_to_gnt[i] = blkbk->pending_page(pending_req, i);
> +		}
> +
>  		flags = GNTMAP_host_map;
> -		if (pending_req->operation != BLKIF_OP_READ)
> +		if (!use_pers_gnts && (pending_req->operation !=
BLKIF_OP_READ))
>  			flags |= GNTMAP_readonly;
> -		gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> -				  req->u.rw.seg[i].gref,
> -				  pending_req->blkif->domid);
> +		if (new_map) {
> +			gnttab_set_map_op(&map[segs_to_init++], addr,
> +					  flags, req->u.rw.seg[i].gref,
> +					  pending_req->blkif->domid);
> +		}
>  	}
>  
> -	ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req,
0), nseg);
> -	BUG_ON(ret);
> +	if (segs_to_init) {
> +		ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_init);
> +		BUG_ON(ret);
> +	}
>  
>  	/*
>  	 * Now swizzle the MFN in our domain with the MFN from the other domain
>  	 * so that when we access vaddr(pending_req,i) it has the contents of
>  	 * the page from the other domain.
>  	 */
> -	for (i = 0; i < nseg; i++) {
> +	for (i = 0; i < segs_to_init; i++) {
>  		if (unlikely(map[i].status != 0)) {
>  			pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>  			map[i].handle = BLKBACK_INVALID_HANDLE;
>  			ret |= 1;
>  		}
>  
> -		pending_handle(pending_req, i) = map[i].handle;
> +		if (use_pers_gnts) {
> +			/* store the `out'' values from map */
> +		    pending_req->blkif->pers_gnts
> +			[pending_req->blkif->pers_gnt_c - segs_to_init +
> +			 i]->handle = map[i].handle;
> +			new_pers_gnts[i]->dev_bus_addr = map[i].dev_bus_addr;
> +		}
>  
>  		if (ret)
>  			continue;
> -
> -		seg[i].buf  = map[i].dev_bus_addr |
> -			(req->u.rw.seg[i].first_sect << 9);
> +	}
> +	for (i = 0; i < nseg; i++) {
> +		if (use_pers_gnts) {
> +			pending_handle(pending_req, i) = pers_gnts[i]->handle;
> +			seg[i].buf = pers_gnts[i]->dev_bus_addr |
> +				(req->u.rw.seg[i].first_sect << 9);
> +		} else {
> +			pending_handle(pending_req, i) = map[i].handle;
> +			seg[i].buf = map[i].dev_bus_addr |
> +				(req->u.rw.seg[i].first_sect << 9);
> +		}
>  	}
>  	return ret;
>  }
> @@ -468,7 +582,8 @@ static void __end_block_io_op(struct pending_req
*pending_req, int error)
>  	 * the proper response on the ring.
>  	 */
>  	if (atomic_dec_and_test(&pending_req->pendcnt)) {
> -		xen_blkbk_unmap(pending_req);
> +		if (!pending_req->is_pers)
> +			xen_blkbk_unmap(pending_req);
>  		make_response(pending_req->blkif, pending_req->id,
>  			      pending_req->operation, pending_req->status);
>  		xen_blkif_put(pending_req->blkif);
> @@ -590,6 +705,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	int operation;
>  	struct blk_plug plug;
>  	bool drain = false;
> +	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  
>  	switch (req->operation) {
>  	case BLKIF_OP_READ:
> @@ -676,7 +792,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	 * the hypercall to unmap the grants - that is all done in
>  	 * xen_blkbk_unmap.
>  	 */
> -	if (xen_blkbk_map(req, pending_req, seg))
> +	if (xen_blkbk_map(req, pending_req, seg, pages))
>  		goto fail_flush;
>  
>  	/*
> @@ -688,7 +804,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	for (i = 0; i < nseg; i++) {
>  		while ((bio == NULL) ||
>  		       (bio_add_page(bio,
> -				     blkbk->pending_page(pending_req, i),
Can we get rid of pending_page macro?
> +				     pages[i],
>  				     seg[i].nsec << 9,
>  				     seg[i].buf & ~PAGE_MASK) == 0)) {
>  
> @@ -743,7 +859,8 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
>  	return 0;
>  
>   fail_flush:
> -	xen_blkbk_unmap(pending_req);
> +	if (!blkif->can_grant_persist)
> +		xen_blkbk_unmap(pending_req);
>   fail_response:
>  	/* Haven''t submitted any bio''s yet. */
>  	make_response(blkif, req->u.rw.id, req->operation,
BLKIF_RSP_ERROR);
> diff --git a/drivers/block/xen-blkback/common.h
b/drivers/block/xen-blkback/common.h
> index 9ad3b5e..1ecb709 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -47,6 +47,8 @@
>  #define DPRINTK(fmt, args...)				\
>  	pr_debug(DRV_PFX "(%s:%d) " fmt ".\n",		\
>  		 __func__, __LINE__, ##args)
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
You need a comment explaining why this number.
> +
>  
>  
>  /* Not a real protocol.  Used to generate ring structs which contain
> @@ -164,6 +166,14 @@ struct xen_vbd {
>  
>  struct backend_info;
>  
> +
> +struct pers_gnt {
> +	struct page *page;
> +	grant_ref_t gnt;
> +	uint32_t handle;
Not grant_handle_t ?
> +	uint64_t dev_bus_addr;
> +};
> +
>  struct xen_blkif {
>  	/* Unique identifier for this interface. */
>  	domid_t			domid;
> @@ -190,6 +200,12 @@ struct xen_blkif {
>  	struct task_struct	*xenblkd;
>  	unsigned int		waiting_reqs;
>  
> +	/* frontend feature information */
> +	u8			can_grant_persist:1;
Pls move that to the vbd structure, and if you feel
especially adventourous, you could modify the

bool flush_support
bool discard_secure

to be ''unsigned int feature_flush:1'' ,etc.. as its own
seperate patch and then introduce the

unsigned int feature_grnt_pers:1

flag.> +	struct pers_gnt		*pers_gnts[BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> +					  BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	unsigned int		pers_gnt_c;
> +
>  	/* statistics */
>  	unsigned long		st_print;
>  	int			st_rd_req;
> diff --git a/drivers/block/xen-blkback/xenbus.c
b/drivers/block/xen-blkback/xenbus.c
> index 4f66171..dc58cc4 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -681,6 +681,13 @@ again:
>  		goto abort;
>  	}
>  
> +	err = xenbus_printf(xbt, dev->nodename,
"feature-persistent-grants",
> +			    "%d", 1);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "writing persistent capability");
It is not fatal. Just do dev_warn, like the xen_blkbk_discard does.
> +		goto abort;
> +	}
> +
>  	/* FIXME: use a typename instead */
>  	err = xenbus_printf(xbt, dev->nodename, "info",
"%u",
>  			    be->blkif->vbd.type |
> @@ -721,6 +728,7 @@ static int connect_ring(struct backend_info *be)
>  	struct xenbus_device *dev = be->dev;
>  	unsigned long ring_ref;
>  	unsigned int evtchn;
> +	u8 pers_grants;
>  	char protocol[64] = "";
>  	int err;
>  
> @@ -750,8 +758,17 @@ static int connect_ring(struct backend_info *be)
>  		xenbus_dev_fatal(dev, err, "unknown fe protocol %s",
protocol);
>  		return -1;
>  	}
> -	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d
(%s)\n",
> -		ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> +	err = xenbus_gather(XBT_NIL, dev->otherend,
> +			    "feature-persistent-grants", "%d",
> +			    &pers_grants, NULL);
> +	if (err)
> +		pers_grants = 0;
> +
> +	be->blkif->can_grant_persist = pers_grants;
We should also have a patch for the Xen blkif.h to document this feature.. but
that
can be done later.
> +
> +	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)
persistent %d\n",
> +		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> +		pers_grants);
>  
>  	/* Map the shared frame, irq etc. */
>  	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index e4fb337..c1cc5fe 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -68,6 +68,13 @@ struct blk_shadow {
>  	struct blkif_request req;
>  	struct request *request;
>  	unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct gnt_list *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +};
> +
> +struct gnt_list {
> +	grant_ref_t gref;
> +	unsigned long frame;
Pls mention what ''frame'' is and what the other ones do.
> +	struct gnt_list *tail;
>  };
How did the compiler actually compile this? You are using it in ''struct
blk_shadow''
but it is declared later?>  
>  static DEFINE_MUTEX(blkfront_mutex);
> @@ -97,11 +104,14 @@ struct blkfront_info
>  	struct work_struct work;
>  	struct gnttab_free_callback callback;
>  	struct blk_shadow shadow[BLK_RING_SIZE];
> +	struct gnt_list *persistent_grants_front;
I don''t think you need the ''front'' in there. Its
obvious you are
in the frontend code.
> +	unsigned int persistent_grants_c;
>  	unsigned long shadow_free;
>  	unsigned int feature_flush;
>  	unsigned int flush_op;
>  	unsigned int feature_discard:1;
>  	unsigned int feature_secdiscard:1;
> +	unsigned int feature_persistent:1;
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	int is_ready;
> @@ -286,22 +296,37 @@ static int blkif_queue_request(struct request *req)
>  	struct blkif_request *ring_req;
>  	unsigned long id;
>  	unsigned int fsect, lsect;
> -	int i, ref;
> +	int i, ref, use_pers_gnts, new_pers_gnts;
use_pers_gnts and new_pers_gnt? Can you document what the difference
is pls?

Perhaps ''new_pers_gnts'' should be called:
''got_grant''? or ''using_grant'' ?
>  	grant_ref_t gref_head;
> +	struct bio_vec *bvecs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct bio_vec *bvec;
> +	struct req_iterator iter;
> +	char *bvec_data;
> +	void *shared_data;
> +	unsigned long flags;
What kind of flags?> +	struct page *shared_page;
It is not really shared_page. It is a temporary page. Perhaps
tmp_page ?
> +	struct gnt_list *gnt_list_entry;
>  	struct scatterlist *sg;
>  
>  	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>  		return 1;
>  
> -	if (gnttab_alloc_grant_references(
> -		BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> -		gnttab_request_free_callback(
> -			&info->callback,
> -			blkif_restart_queue_callback,
> -			info,
> -			BLKIF_MAX_SEGMENTS_PER_REQUEST);
> -		return 1;
> +	use_pers_gnts = info->feature_persistent;
> +
> +	if (info->persistent_grants_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> +		new_pers_gnts = 1;
> +		if (gnttab_alloc_grant_references(
> +			    BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> +			gnttab_request_free_callback(
> +				&info->callback,
> +				blkif_restart_queue_callback,
> +				info,
> +				BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +			return 1;
> +		}
>  	} else
> +		new_pers_gnts = 0;
>  
>  	/* Fill out a communications ring structure. */
>  	ring_req = RING_GET_REQUEST(&info->ring,
info->ring.req_prod_pvt);
> @@ -341,20 +366,66 @@ static int blkif_queue_request(struct request *req)
>  		       BLKIF_MAX_SEGMENTS_PER_REQUEST);
>  
>  		for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> -			buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
>  			fsect = sg->offset >> 9;
>  			lsect = fsect + (sg->length >> 9) - 1;
> -			/* install a grant reference. */
> -			ref = gnttab_claim_grant_reference(&gref_head);
> -			BUG_ON(ref == -ENOSPC);
>  
> -			gnttab_grant_foreign_access_ref(
> +			if (use_pers_gnts && info->persistent_grants_c) {
> +				gnt_list_entry = info->persistent_grants_front;
> +
> +				info->persistent_grants_front > +				
info->persistent_grants_front->tail;
> +				ref = gnt_list_entry->gref;
> +				buffer_mfn = pfn_to_mfn(gnt_list_entry->frame);
Ah, so frame is a PFN! why don''t you just call it that?
> +				info->persistent_grants_c--;
> +			} else {
> +				ref = gnttab_claim_grant_reference(&gref_head);
> +				BUG_ON(ref == -ENOSPC);
> +
> +				if (use_pers_gnts) {
> +					gnt_list_entry > +						kmalloc(sizeof(struct gnt_list),
> +								 GFP_ATOMIC);
> +					if (!gnt_list_entry)
> +						return -ENOMEM;
> +
> +					shared_page = alloc_page(GFP_ATOMIC);
> +					if (!shared_page)
> +						return -ENOMEM;
Make sure to kfree gnt_list_entry.
> +
> +					gnt_list_entry->frame > +						page_to_pfn(shared_page);
> +					gnt_list_entry->gref = ref;
> +				} else
> +					shared_page = sg_page(sg);
> +
> +				buffer_mfn = pfn_to_mfn(page_to_pfn(
> +								shared_page));
> +				gnttab_grant_foreign_access_ref(
>  					ref,
>  					info->xbdev->otherend_id,
>  					buffer_mfn,
> -					rq_data_dir(req));
> +					!use_pers_gnts && rq_data_dir(req));
> +			}
> +
> +			if (use_pers_gnts)
> +				info->shadow[id].grants_used[i] > +					gnt_list_entry;
> +
> +			if (use_pers_gnts && rq_data_dir(req)) {
You can declare the ''shared_data'' here:

				void *shared_data;

and also bvec_data.
> +				shared_data = kmap_atomic(
> +					pfn_to_page(gnt_list_entry->frame));
> +				bvec_data = kmap_atomic(sg_page(sg));
> +
> +				memcpy(shared_data + sg->offset,
> +				       bvec_data   + sg->offset,
> +				       sg->length);
Do we need to worry about it spilling over a page? Should we check that
sg>offset + sg->length < PAGE_SIZE ?

Also this would imply that based on the offset (so say it is 3999) that the old
data (0->3998)
might be still there - don''t know how important that
is?> +
> +				kunmap_atomic(bvec_data);
> +				kunmap_atomic(shared_data);
> +			}
>  
>  			info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> +
>  			ring_req->u.rw.seg[i] >  					(struct blkif_request_segment) {
>  						.gref       = ref,
> @@ -368,7 +439,8 @@ static int blkif_queue_request(struct request *req)
>  	/* Keep a private copy so we can reissue requests when recovering. */
>  	info->shadow[id].req = *ring_req;
>  
> -	gnttab_free_grant_references(gref_head);
> +	if (new_pers_gnts)
> +		gnttab_free_grant_references(gref_head);
>  
>  	return 0;
>  }
> @@ -480,12 +552,13 @@ static int xlvbd_init_blk_queue(struct gendisk *gd,
u16 sector_size)
>  static void xlvbd_flush(struct blkfront_info *info)
>  {
>  	blk_queue_flush(info->rq, info->feature_flush);
> -	printk(KERN_INFO "blkfront: %s: %s: %s\n",
> +	printk(KERN_INFO "blkfront: %s: %s: %s. Persistent=%d\n",
Just use %s like the rest>  	       info->gd->disk_name,
>  	       info->flush_op == BLKIF_OP_WRITE_BARRIER ?
>  		"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
>  		"flush diskcache" : "barrier or flush"),
> -	       info->feature_flush ? "enabled" :
"disabled");
> +	       info->feature_flush ? "enabled" :
"disabled",
> +	    info->feature_persistent);
and this can be ''info->feature_persistent ? "persitent" :
"".
>  }
>  
>  static int xen_translate_vdev(int vdevice, int *minor, unsigned int
*offset)
> @@ -707,6 +780,8 @@ static void blkif_restart_queue(struct work_struct
*work)
>  
>  static void blkif_free(struct blkfront_info *info, int suspend)
>  {
> +	struct gnt_list *pers_gnt;
> +
>  	/* Prevent new requests being issued until we fix things up. */
>  	spin_lock_irq(&info->io_lock);
>  	info->connected = suspend ?
> @@ -714,6 +789,15 @@ static void blkif_free(struct blkfront_info *info, int
suspend)
>  	/* No more blkif_request(). */
>  	if (info->rq)
>  		blk_stop_queue(info->rq);
> +
> +	/* Remove all persistent grants */
> +	while (info->persistent_grants_front) {
> +		pers_gnt = info->persistent_grants_front;
> +		info->persistent_grants_front = pers_gnt->tail;
> +		gnttab_end_foreign_access(pers_gnt->gref, 0, 0UL);
> +		kfree(pers_gnt);
> +	}
> +
>  	/* No more gnttab callback work. */
>  	gnttab_cancel_free_callback(&info->callback);
>  	spin_unlock_irq(&info->io_lock);
> @@ -734,13 +818,44 @@ static void blkif_free(struct blkfront_info *info,
int suspend)
>  
>  }
>  
> -static void blkif_completion(struct blk_shadow *s)
> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info
*info,
> +			     struct blkif_response *bret)
>  {
>  	int i;
> -	/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> -	 * flag. */
> -	for (i = 0; i < s->req.u.rw.nr_segments; i++)
> -		gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> +	struct gnt_list *new_gnt_list_entry;
> +	struct bio_vec *bvec;
> +	struct req_iterator iter;
> +	unsigned long flags;
> +	char *bvec_data;
> +	void *shared_data;
> +
> +	i = 0;
> +	if (info->feature_persistent == 0) {
> +		/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same
> +		 * place flag. */
> +		for (i = 0; i < s->req.u.rw.nr_segments; i++)
> +			gnttab_end_foreign_access(s->req.u.rw.seg[i].gref,
> +						  0, 0UL);
> +		return;
> +	}
> +
Move the ''i = 0'' to here.
> +	if (bret->operation == BLKIF_OP_READ)
> +		rq_for_each_segment(bvec, s->request, iter) {
> +			shared_data = kmap_atomic
> +				(pfn_to_page(s->grants_used[i++]->frame));
> +			bvec_data = bvec_kmap_irq(bvec, &flags);
> +			memcpy(bvec_data, shared_data + bvec->bv_offset,
> +			       bvec->bv_len);
> +			bvec_kunmap_irq(bvec_data, &flags);
> +			kunmap_atomic(shared_data);
> +		}
> +	/* Add the persistent grant into the list of free grants */
> +	for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> +		new_gnt_list_entry = s->grants_used[i];
> +		new_gnt_list_entry->tail = info->persistent_grants_front;
> +		info->persistent_grants_front = new_gnt_list_entry;
> +		info->persistent_grants_c++;
> +	}
>  }
>  
>  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> @@ -783,7 +898,7 @@ static irqreturn_t blkif_interrupt(int irq, void
*dev_id)
>  		req  = info->shadow[id].request;
>  
>  		if (bret->operation != BLKIF_OP_DISCARD)
> -			blkif_completion(&info->shadow[id]);
> +			blkif_completion(&info->shadow[id], info, bret);
>  
>  		if (add_id_to_freelist(info, id)) {
>  			WARN(1, "%s: response to %s (id %ld) couldn''t be
recycled!\n",
> @@ -943,6 +1058,12 @@ again:
>  		message = "writing protocol";
>  		goto abort_transaction;
>  	}
> +	err = xenbus_printf(xbt, dev->nodename,
> +			    "feature-persistent-grants", "%d", 1);
> +	if (err) {
> +		message = "Writing persistent grants front feature";
> +		goto abort_transaction;
I wouldn''t abort. Just continue on using the non-grant feature.
> +	}
>  
>  	err = xenbus_transaction_end(xbt, 0);
>  	if (err) {
> @@ -1030,6 +1151,7 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	spin_lock_init(&info->io_lock);
>  	info->xbdev = dev;
>  	info->vdevice = vdevice;
> +	info->persistent_grants_c = 0;
>  	info->connected = BLKIF_STATE_DISCONNECTED;
>  	INIT_WORK(&info->work, blkif_restart_queue);
>  
> @@ -1094,6 +1216,7 @@ static int blkif_recover(struct blkfront_info *info)
>  					req->u.rw.seg[j].gref,
>  					info->xbdev->otherend_id,
>  					pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> +					!info->feature_persistent &&
>  					rq_data_dir(info->shadow[req->u.rw.id].request));
>  		}
>  		info->shadow[req->u.rw.id].req = *req;
> @@ -1226,7 +1349,7 @@ static void blkfront_connect(struct blkfront_info
*info)
>  	unsigned long sector_size;
>  	unsigned int binfo;
>  	int err;
> -	int barrier, flush, discard;
> +	int barrier, flush, discard, persistent;
>  
>  	switch (info->connected) {
>  	case BLKIF_STATE_CONNECTED:
> @@ -1297,6 +1420,19 @@ static void blkfront_connect(struct blkfront_info
*info)
>  		info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
>  	}
>  
> +	/*
> +	 * Are we dealing with an old blkback that will unmap
> +	 * all grefs?
> +	 */
> +	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +			    "feature-persistent-grants", "%d",
&persistent,
> +			    NULL);
> +
> +	if (err)
> +		info->feature_persistent = 0;
> +	else
> +		info->feature_persistent = persistent;
> +
>  	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
>  			    "feature-discard", "%d", &discard,
>  			    NULL);
> diff --git a/include/xen/interface/io/blkif.h
b/include/xen/interface/io/blkif.h
> index ee338bf..cd267a9b 100644
> --- a/include/xen/interface/io/blkif.h
> +++ b/include/xen/interface/io/blkif.h
> @@ -109,6 +109,15 @@ typedef uint64_t blkif_sector_t;
>   */
>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
>  
> +/*
> + * Maximum number of persistent grants that can be mapped by Dom0 for each
by blkback> + * interface. This is set to be the size of the ring, as this is a limit
on
Size of the ring? You sure?> + * the number of requests that can be inflight at any one time. 256
imposes
> + * an overhead of 11MB of mapped kernel space per interface.
> + */
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> +
> +
>  struct blkif_request_rw {
>  	uint8_t        nr_segments;  /* number of segments                   */
>  	blkif_vdev_t   handle;       /* only for read/write requests         */
> -- 
> 1.7.9.5

Oliver Chick

2012-Sep-20 08:51 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

Ah yes, you''re right - rings can accomodate 64 requests. ijc, and
myself
got muddled. It also explains the comment in my summary that we only use
the first 60-ish persistent grants!

With multipage rings, this will probably become a function, rather than
a constant. Does that sound sensible?

On Wed, 2012-09-19 at 13:03 +0100, Jan Beulich wrote:> >>> On 19.09.12 at 12:51, Oliver Chick
<oliver.chick@citrix.com> wrote:
> > +/*
> > + * Maximum number of persistent grants that can be mapped by Dom0 for
each
> > + * interface. This is set to be the size of the ring, as this is a
limit on
> > + * the number of requests that can be inflight at any one time. 256
imposes
> > + * an overhead of 11MB of mapped kernel space per interface.
> > + */
> > +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> 
> How can this be a fixed number, especially in the context of ring
> size extensions? Iirc, single page rings can accommodate 64
> requests, so I''d guess the number above was observed with a
> 4-page (order 2) ring...
> 
> Jan
>

Jan Beulich

2012-Sep-20 09:11 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

>>> On 20.09.12 at 10:51, Oliver Chick <oliver.chick@citrix.com>
wrote:
> Ah yes, you''re right - rings can accomodate 64 requests. ijc, and
myself
> got muddled. It also explains the comment in my summary that we only use
> the first 60-ish persistent grants!
> 
> With multipage rings, this will probably become a function, rather than
> a constant. Does that sound sensible?
Yes, that''s what I was expecting. Whether a separate function/
macro is necessary is unclear, as it''s simply __RING_SIZE().

Jan

Oliver Chick

2012-Sep-20 09:35 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

I have attached a graph that shows the results of my benchmarking.

The setup is:
-Xeon X5650
-32GB of Ram
-Xen 4.2
-Linux 3.5.0 for dom0 and domus
-Dom0 has 24 CPUs.
-Each guest has a (separate) xvdb backed by a 1GB ramdisk (/dev/ramX) in
dom0. No LVM on these.
-The setup is that initially 1 guest does an fio sequential read of the
ramdisk, then 2, then 3 etc.
-The y axis is the sum of the iops, as reported by the guests'' fio
output.

On Wed, 2012-09-19 at 14:16 +0100, X5650Pasi Kärkkäinen
wrote:> On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick wrote:
> > This patch implements persistent grants for the xen-blk{front,back}
> > mechanism. The effect of this change is to reduce the number of unmap
> > operations performed, since they cause a (costly) TLB shootdown. This
> > allows the I/O performance to scale better when a large number of VMs
> > are performing I/O.
> > 
> > Previously, the blkfront driver was supplied a bvec[] from the request
> > queue. This was granted to dom0; dom0 performed the I/O and wrote
> > directly into the grant-mapped memory and unmapped it; blkfront then
> > removed foreign access for that grant. The cost of unmapping scales
> > badly with the number of CPUs in Dom0. An experiment showed that when
> > Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> > ramdisk, the IPIs from performing unmap''s is a bottleneck at
5 guests
> > (at which point 650,000 IOPS are being performed in total). If more
> > than 5 guests are used, the performance declines. By 10 guests, only
> > 400,000 IOPS are being performed.
> > 
> > This patch improves performance by only unmapping when the connection
> > between blkfront and back is broken.
> >
> 
> So how many IOPS can you get with this patch / persistent grants ? 
> 
> -- Pasi
> 

--=-WTNocMG3YNTG0rOLTsJV
Content-Type: application/pdf; name="pers-vs-nonpers.pdf"
Content-Disposition: attachment; filename="pers-vs-nonpers.pdf"
Content-Transfer-Encoding: base64

JVBERi0xLjQKJYHigeOBz4HTXHIKMSAwIG9iago8PAovQ3JlYXRpb25EYXRlIChEOjIwMTIwOTIw
MTAyODI2KQovTW9kRGF0ZSAoRDoyMDEyMDkyMDEwMjgyNikKL1RpdGxlIChSIEdyYXBoaWNzIE91
dHB1dCkKL1Byb2R1Y2VyIChSIDIuMTQuMSkKL0NyZWF0b3IgKFIpCj4+CmVuZG9iagoyIDAgb2Jq
Cjw8IC9UeXBlIC9DYXRhbG9nIC9QYWdlcyAzIDAgUiA+PgplbmRvYmoKNyAwIG9iago8PCAvVHlw
ZSAvUGFnZSAvUGFyZW50IDMgMCBSIC9Db250ZW50cyA4IDAgUiAvUmVzb3VyY2VzIDQgMCBSID4+
CmVuZG9iago4IDAgb2JqCjw8Ci9MZW5ndGggMTE2NCAvRmlsdGVyIC9GbGF0ZURlY29kZQo+Pgpz
dHJlYW0KeJy1V01vHDcMvc+v0NE+WBFJSZy5JmgKGEiA1ov2EOSUuE0Mb+x4HfTv95HSjCeukWQn
aIDY+5akPh4p8pnCeaBwFT4Pv4XPoUwx5aAScw6ZclQOohQLh7vL8Gf4NDw7/P7r8/DiYkgxpRTW
Py9evMa3WsI/w5u3IYX3A4Vz/L8ayBzCq+H5bnj2krDd7i/7cRc05hGeqX3ARtiWSKNq2O3DyfVp
2F3Bursbftl9L3ocI0mgPEaZjo+mlOOIcMUqeUM4gyQJnCbj6vhwGSNzYClR6obwkuOE8MpxpA3h
IL5S4FEibwkH80JBkkYqx4dzKlYdwmMsG9LOYF4Rnr1Yjw8XrxcpGuuWwxccHuGa4rQhcQzmFeE1
W/qPDwfz+CW1xLKh5gXMk4WrJeD4cDA/KsK3JU7E42Rj0QqY5+rhvOHBWlebLFxilQ3hYL5WZ37L
g81gXoofnraEM8eE8FI3dZsM5jUjfIqy4e4ZzGcLz1a7/w3HHPm5MUGZrC7bFNoHmdCeuMPrcPHY
3mGtVghmZ86WnNne4YNdkjf5Zf0GV/b1hsv+D3bjx+/37vDE/Q7vMCVfcqePYzeEM//ZviDROJaQ
dc5fMQYX7r8dzBiQ07QKpnREtCRMWV1HH7O3jJg064PzsneTDqNGTiCtoSxk0Eh7ZEbrmNGDlZN3
lNnc4YNdqNqUne0dPtj7drN9tfv6aqvrnK2uiSrSGjADx/YinyT1m7E0kmXV85ns39Er8CRWyi2p
25bISedD0NenWGu7/ZDxQedHcj3DjDGKtng9E/o1nJ0v/I3/yDOQXk35cTX5F8woANRinWL2I785
2X24O0Xiw8nNl78/3H65P30bduc/+jAyhrGlYVYCr2/CH68OR1CIj7guQ0NyG4gfb26X+P9fH0NF
FNOSIIdRxeqisEOrYs0x2yFTq/IZqhW72e09mRpUW2YfJiSPZmh2NDJCQPFUw95gJVvGG2uSOGkg
M5gDkQvDjpuHWtMn7CTeezu2rdoedrqqaFLOCjzY1WHHzQPnnpBkNeG2DzOmYmu6B9ojxipLjsXP
kV0idrwaASzJJtkyA1zN9XNgPKW2BgTa3q+pacbNA8S1NdiazoLVaHMPJMReJfpMdQ91sdhx88gm
YhiFg0dnHg2DhNx3AVUoblO55LedXDF27B6gFwewbseje3Q8mqj3gYY/TiCtjWUyTo1MHWfcPLRl
LkWxXRaMX/NQTFYmhAauo0/FsYl3x6uxSaNr2WVumryv7aRGTcEaqH9xD+SFdMbugTzgI+FFF5+9
HataetwDeeH2h1byc1Rv8h03j9HUjVHJvkvH1l0bp2xyW3xfdT7UpWTHzSPbS7NzTM5Hx+oXcA/k
pbY1Jl9jcj3ZsXtM7e9RbOaZWzDqVrqQyFHwQrXGbGNJyEVlxyupQcomIRatYRxzWU+u1mh1sqd+
lsUeODrKxQDVgW8yatRfAk6HlHRo4bigTTqu7R10iCLvz+A7Wg7Lo+Qzia33tBL8OaWDi9Z2Pm5D
6eb6/RF6o4fTMlhvL+8OHw/3l5/ul7Y8/AtaU3BbZW5kc3RyZWFtCmVuZG9iagozIDAgb2JqCjw8
IC9UeXBlIC9QYWdlcyAvS2lkcyBbIDcgMCBSIF0gL0NvdW50IDEgL01lZGlhQm94IFswIDAgNTA0
IDUwNF0gPj4KZW5kb2JqCjQgMCBvYmoKPDwKL1Byb2NTZXQgWy9QREYgL1RleHRdCi9Gb250IDw8
IC9GMSAxMCAwIFIgL0YyIDExIDAgUiAvRjMgMTIgMCBSID4+Ci9FeHRHU3RhdGUgPDwgPj4KL0Nv
bG9yU3BhY2UgPDwgL3NSR0IgNSAwIFIgPj4KPj4KZW5kb2JqCjUgMCBvYmoKWy9JQ0NCYXNlZCA2
IDAgUl0KZW5kb2JqCjYgMCBvYmoKPDwgL0FsdGVybmF0ZSAvRGV2aWNlUkdCIC9OIDMgL0xlbmd0
aCAyNTk2IC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4nJ2Wd1RT2RaHz703vVCSEIqU
0GtoUgJIDb1IkS4qMQkQSsCQACI2RFRwRFGRpggyKOCAo0ORsSKKhQFRsesEGUTUcXAUG5ZJZK0Z
37x5782b3x/3fmufvc/dZ+991roAkPyDBcJMWAmADKFYFOHnxYiNi2dgBwEM8AADbADgcLOzQhb4
RgKZAnzYjGyZE/gXvboOIPn7KtM/jMEA/5+UuVkiMQBQmIzn8vjZXBkXyTg9V5wlt0/JmLY0Tc4w
Ss4iWYIyVpNz8ixbfPaZZQ858zKEPBnLc87iZfDk3CfjjTkSvoyRYBkX5wj4uTK+JmODdEmGQMZv
5LEZfE42ACiS3C7mc1NkbC1jkigygi3jeQDgSMlf8NIvWMzPE8sPxc7MWi4SJKeIGSZcU4aNkxOL
4c/PTeeLxcwwDjeNI+Ix2JkZWRzhcgBmz/xZFHltGbIiO9g4OTgwbS1tvijUf138m5L3dpZehH/u
GUQf+MP2V36ZDQCwpmW12fqHbWkVAF3rAVC7/YfNYC8AirK+dQ59cR66fF5SxOIsZyur3NxcSwGf
aykv6O/6nw5/Q198z1K+3e/lYXjzkziSdDFDXjduZnqmRMTIzuJw+Qzmn4f4Hwf+dR4WEfwkvogv
lEVEy6ZMIEyWtVvIE4gFmUKGQPifmvgPw/6k2bmWidr4EdCWWAKlIRpAfh4AKCoRIAl7ZCvQ730L
xkcD+c2L0ZmYnfvPgv59V7hM/sgWJH+OY0dEMrgSUc7smvxaAjQgAEVAA+pAG+gDE8AEtsARuAAP
4AMCQSiIBHFgMeCCFJABRCAXFIC1oBiUgq1gJ6gGdaARNIM2cBh0gWPgNDgHLoHLYATcAVIwDp6A
KfAKzEAQhIXIEBVSh3QgQ8gcsoVYkBvkAwVDEVAclAglQ0JIAhVA66BSqByqhuqhZuhb6Ch0GroA
DUO3oFFoEvoVegcjMAmmwVqwEWwFs2BPOAiOhBfByfAyOB8ugrfAlXADfBDuhE/Dl+ARWAo/gacR
gBAROqKLMBEWwkZCkXgkCREhq5ASpAJpQNqQHqQfuYpIkafIWxQGRUUxUEyUC8ofFYXiopahVqE2
o6pRB1CdqD7UVdQoagr1EU1Ga6LN0c7oAHQsOhmdiy5GV6Cb0B3os+gR9Dj6FQaDoWOMMY4Yf0wc
JhWzArMZsxvTjjmFGcaMYaaxWKw61hzrig3FcrBibDG2CnsQexJ7BTuOfYMj4nRwtjhfXDxOiCvE
VeBacCdwV3ATuBm8Et4Q74wPxfPwy/Fl+EZ8D34IP46fISgTjAmuhEhCKmEtoZLQRjhLuEt4QSQS
9YhOxHCigLiGWEk8RDxPHCW+JVFIZiQ2KYEkIW0h7SedIt0ivSCTyUZkD3I8WUzeQm4mnyHfJ79R
oCpYKgQo8BRWK9QodCpcUXimiFc0VPRUXKyYr1iheERxSPGpEl7JSImtxFFapVSjdFTphtK0MlXZ
RjlUOUN5s3KL8gXlRxQsxYjiQ+FRiij7KGcoY1SEqk9lU7nUddRG6lnqOA1DM6YF0FJppbRvaIO0
KRWKip1KtEqeSo3KcRUpHaEb0QPo6fQy+mH6dfo7VS1VT1W+6ibVNtUrqq/V5qh5qPHVStTa1UbU
3qkz1H3U09S3qXep39NAaZhphGvkauzROKvxdA5tjssc7pySOYfn3NaENc00IzRXaO7THNCc1tLW
8tPK0qrSOqP1VJuu7aGdqr1D+4T2pA5Vx01HoLND56TOY4YKw5ORzqhk9DGmdDV1/XUluvW6g7oz
esZ6UXqFeu169/QJ+iz9JP0d+r36UwY6BiEGBQatBrcN8YYswxTDXYb9hq+NjI1ijDYYdRk9MlYz
DjDON241vmtCNnE3WWbSYHLNFGPKMk0z3W162Qw2szdLMasxGzKHzR3MBea7zYct0BZOFkKLBosb
TBLTk5nDbGWOWtItgy0LLbssn1kZWMVbbbPqt/pobW+dbt1ofceGYhNoU2jTY/OrrZkt17bG9tpc
8lzfuavnds99bmdux7fbY3fTnmofYr/Bvtf+g4Ojg8ihzWHS0cAx0bHW8QaLxgpjbWadd0I7eTmt
djrm9NbZwVnsfNj5FxemS5pLi8ujecbz+PMa54256rlyXOtdpW4Mt0S3vW5Sd113jnuD+wMPfQ+e
R5PHhKepZ6rnQc9nXtZeIq8Or9dsZ/ZK9ilvxNvPu8R70IfiE+VT7XPfV8832bfVd8rP3m+F3yl/
tH+Q/zb/GwFaAdyA5oCpQMfAlYF9QaSgBUHVQQ+CzYJFwT0hcEhgyPaQu/MN5wvnd4WC0IDQ7aH3
wozDloV9H44JDwuvCX8YYRNRENG/gLpgyYKWBa8ivSLLIu9EmURJonqjFaMTopujX8d4x5THSGOt
YlfGXorTiBPEdcdj46Pjm+KnF/os3LlwPME+oTjh+iLjRXmLLizWWJy++PgSxSWcJUcS0YkxiS2J
7zmhnAbO9NKApbVLp7hs7i7uE54Hbwdvku/KL+dPJLkmlSc9SnZN3p48meKeUpHyVMAWVAuep/qn
1qW+TgtN25/2KT0mvT0Dl5GYcVRIEaYJ+zK1M/Myh7PMs4qzpMucl+1cNiUKEjVlQ9mLsrvFNNnP
1IDERLJeMprjllOT8yY3OvdInnKeMG9gudnyTcsn8n3zv16BWsFd0VugW7C2YHSl58r6VdCqpat6
V+uvLlo9vsZvzYG1hLVpa38otC4sL3y5LmZdT5FW0ZqisfV+61uLFYpFxTc2uGyo24jaKNg4uGnu
pqpNH0t4JRdLrUsrSt9v5m6++JXNV5VffdqStGWwzKFsz1bMVuHW69vctx0oVy7PLx/bHrK9cwdj
R8mOlzuX7LxQYVdRt4uwS7JLWhlc2V1lULW16n11SvVIjVdNe61m7aba17t5u6/s8djTVqdVV1r3
bq9g7816v/rOBqOGin2YfTn7HjZGN/Z/zfq6uUmjqbTpw37hfumBiAN9zY7NzS2aLWWtcKukdfJg
wsHL33h/093GbKtvp7eXHgKHJIcef5v47fXDQYd7j7COtH1n+F1tB7WjpBPqXN451ZXSJe2O6x4+
Gni0t8elp+N7y+/3H9M9VnNc5XjZCcKJohOfTuafnD6Vderp6eTTY71Leu+ciT1zrS+8b/Bs0Nnz
53zPnen37D953vX8sQvOF45eZF3suuRwqXPAfqDjB/sfOgYdBjuHHIe6Lztd7hmeN3ziivuV01e9
r567FnDt0sj8keHrUddv3ki4Ib3Ju/noVvqt57dzbs/cWXMXfbfkntK9ivua9xt+NP2xXeogPT7q
PTrwYMGDO2PcsSc/Zf/0frzoIflhxYTORPMj20fHJn0nLz9e+Hj8SdaTmafFPyv/XPvM5Nl3v3j8
MjAVOzX+XPT806+bX6i/2P/S7mXvdNj0/VcZr2Zel7xRf3PgLett/7uYdxMzue+x7ys/mH7o+Rj0
8e6njE+ffgP3hPP7ZW5kc3RyZWFtCmVuZG9iago5IDAgb2JqCjw8Ci9UeXBlIC9FbmNvZGluZyAv
QmFzZUVuY29kaW5nIC9XaW5BbnNpRW5jb2RpbmcKL0RpZmZlcmVuY2VzIFsgNDUvbWludXMgOTYv
cXVvdGVsZWZ0CjE0NC9kb3RsZXNzaSAvZ3JhdmUgL2FjdXRlIC9jaXJjdW1mbGV4IC90aWxkZSAv
bWFjcm9uIC9icmV2ZSAvZG90YWNjZW50Ci9kaWVyZXNpcyAvLm5vdGRlZiAvcmluZyAvY2VkaWxs
YSAvLm5vdGRlZiAvaHVuZ2FydW1sYXV0IC9vZ29uZWsgL2Nhcm9uIC9zcGFjZV0KPj4KZW5kb2Jq
CjEwIDAgb2JqCjw8IC9UeXBlIC9Gb250IC9TdWJ0eXBlIC9UeXBlMSAvTmFtZSAvRjEgL0Jhc2VG
b250IC9aYXBmRGluZ2JhdHMgPj4KZW5kb2JqCjExIDAgb2JqCjw8IC9UeXBlIC9Gb250IC9TdWJ0
eXBlIC9UeXBlMSAvTmFtZSAvRjIgL0Jhc2VGb250IC9IZWx2ZXRpY2EKL0VuY29kaW5nIDkgMCBS
ID4+CmVuZG9iagoxMiAwIG9iago8PCAvVHlwZSAvRm9udCAvU3VidHlwZSAvVHlwZTEgL05hbWUg
L0YzIC9CYXNlRm9udCAvSGVsdmV0aWNhLUJvbGQKL0VuY29kaW5nIDkgMCBSID4+CmVuZG9iagp4
cmVmCjAgMTMKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDAwMDIxIDAwMDAwIG4gCjAwMDAwMDAx
NjQgMDAwMDAgbiAKMDAwMDAwMTUyOSAwMDAwMCBuIAowMDAwMDAxNjEyIDAwMDAwIG4gCjAwMDAw
MDE3NDcgMDAwMDAgbiAKMDAwMDAwMTc4MCAwMDAwMCBuIAowMDAwMDAwMjEzIDAwMDAwIG4gCjAw
MDAwMDAyOTMgMDAwMDAgbiAKMDAwMDAwNDQ3NSAwMDAwMCBuIAowMDAwMDA0NzMyIDAwMDAwIG4g
CjAwMDAwMDQ4MTYgMDAwMDAgbiAKMDAwMDAwNDkxMyAwMDAwMCBuIAp0cmFpbGVyCjw8IC9TaXpl
IDEzIC9JbmZvIDEgMCBSIC9Sb290IDIgMCBSID4+CnN0YXJ0eHJlZgo1MDE1CiUlRU9GCg=

--=-WTNocMG3YNTG0rOLTsJV
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--=-WTNocMG3YNTG0rOLTsJV--

Pasi Kärkkäinen

2012-Sep-20 10:09 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Thu, Sep 20, 2012 at 10:35:33AM +0100, Oliver Chick
wrote:> I have attached a graph that shows the results of my benchmarking.
> 
> The setup is:
> -Xeon X5650
> -32GB of Ram
> -Xen 4.2
> -Linux 3.5.0 for dom0 and domus
> -Dom0 has 24 CPUs.
> -Each guest has a (separate) xvdb backed by a 1GB ramdisk (/dev/ramX) in
> dom0. No LVM on these.
> -The setup is that initially 1 guest does an fio sequential read of the
> ramdisk, then 2, then 3 etc.
> -The y axis is the sum of the iops, as reported by the guests'' fio
> output.
> 
Thanks! The graph looks pretty nice :)

-- Pasi
> On Wed, 2012-09-19 at 14:16 +0100, X5650Pasi Kärkkäinen wrote:
> > On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick wrote:
> > > This patch implements persistent grants for the
xen-blk{front,back}
> > > mechanism. The effect of this change is to reduce the number of
unmap
> > > operations performed, since they cause a (costly) TLB shootdown.
This
> > > allows the I/O performance to scale better when a large number of
VMs
> > > are performing I/O.
> > > 
> > > Previously, the blkfront driver was supplied a bvec[] from the
request
> > > queue. This was granted to dom0; dom0 performed the I/O and wrote
> > > directly into the grant-mapped memory and unmapped it; blkfront
then
> > > removed foreign access for that grant. The cost of unmapping
scales
> > > badly with the number of CPUs in Dom0. An experiment showed that
when
> > > Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> > > ramdisk, the IPIs from performing unmap''s is a
bottleneck at 5 guests
> > > (at which point 650,000 IOPS are being performed in total). If
more
> > > than 5 guests are used, the performance declines. By 10 guests,
only
> > > 400,000 IOPS are being performed.
> > > 
> > > This patch improves performance by only unmapping when the
connection
> > > between blkfront and back is broken.
> > >
> > 
> > So how many IOPS can you get with this patch / persistent grants ? 
> > 
> > -- Pasi
> > 
>

David Vrabel

2012-Sep-20 10:34 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On 19/09/12 11:51, Oliver Chick wrote:> This patch implements persistent grants for the xen-blk{front,back}
> mechanism.
[...]> We (ijc, and myself) have introduced a new constant,
> BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
> from attempting a DoS, by supplying fresh grefs, causing the Dom0
> kernel from to map excessively. 
[...]> 2) Otherwise, we revert to non-persistent grants for all future grefs.
Why fallback instead of immediately failing the request?
> diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
> index 73f196c..f95dee9 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -78,6 +78,7 @@ struct pending_req {
>  	unsigned short		operation;
>  	int			status;
>  	struct list_head	free_list;
> +	u8			is_pers;
Using "pers" as an abbreviation for "persistent"
isn''t obvious.  For
readability it may be better spell it in full.
> +/*
> + * Maximum number of persistent grants that can be mapped by Dom0 for each
> + * interface. This is set to be the size of the ring, as this is a limit
on
> + * the number of requests that can be inflight at any one time. 256
imposes
> + * an overhead of 11MB of mapped kernel space per interface.
> + */
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
This 11MB per VBD seems like a lot.  With 150 VMs each with 2 VBDs this
requires > 3 GB.  Is this a scalability problem?

Does there need to be a mechanism to expire old maps in blkback?

David

Oliver Chick

2012-Sep-20 11:30 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

The memory overhead, and fallback mode points are related:
-Firstly, it turns out that the overhead is actually 2.75MB, not 11MB
per device. I made a mistake (pointed out by Jan) as the maximum number
of requests that can fit into a single-page ring is 64, not 256.
-Clearly, this still scales linearly. So the problem of memory footprint
will occur with more VMs, or block devices.
-Whilst 2.75MB per device is probably acceptable (?), if we start using
multipage rings, then we might not want to have
BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will cause the
memory overhead to increase. This is why I have implemented the
''fallback'' mode. With a multipage ring, it seems reasonable to
want the
first $x$ grefs seen by blkback to be treated as persistent, and any
later ones to be non-persistent. Does that seem sensible?


On Thu, 2012-09-20 at 11:34 +0100, David Vrabel wrote:> On 19/09/12 11:51, Oliver Chick wrote:
> > This patch implements persistent grants for the xen-blk{front,back}
> > mechanism.
> [...]
> > We (ijc, and myself) have introduced a new constant,
> > BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
> > from attempting a DoS, by supplying fresh grefs, causing the Dom0
> > kernel from to map excessively. 
> [...]
> > 2) Otherwise, we revert to non-persistent grants for all future grefs.
> 
> Why fallback instead of immediately failing the request?
> > diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
> > index 73f196c..f95dee9 100644
> > --- a/drivers/block/xen-blkback/blkback.c
> > +++ b/drivers/block/xen-blkback/blkback.c
> > @@ -78,6 +78,7 @@ struct pending_req {
> >  	unsigned short		operation;
> >  	int			status;
> >  	struct list_head	free_list;
> > +	u8			is_pers;
> 
> Using "pers" as an abbreviation for "persistent"
isn''t obvious.  For
> readability it may be better spell it in full.
> Good point
> > +/*
> > + * Maximum number of persistent grants that can be mapped by Dom0 for
each
> > + * interface. This is set to be the size of the ring, as this is a
limit on
> > + * the number of requests that can be inflight at any one time. 256
imposes
> > + * an overhead of 11MB of mapped kernel space per interface.
> > + */
> > +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> 
> This 11MB per VBD seems like a lot.  With 150 VMs each with 2 VBDs this
> requires > 3 GB.  Is this a scalability problem?
> 
> Does there need to be a mechanism to expire old maps in blkback?

When blkback closes, I unmap. Or do you mean that I could unmap if there
has been a spike in block-device activity, after which the mapped pages
are not getting used?> 
> David

Jan Beulich

2012-Sep-20 11:48 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

>>> On 20.09.12 at 13:30, Oliver Chick <oliver.chick@citrix.com>
wrote:
> The memory overhead, and fallback mode points are related:
> -Firstly, it turns out that the overhead is actually 2.75MB, not 11MB
> per device. I made a mistake (pointed out by Jan) as the maximum number
> of requests that can fit into a single-page ring is 64, not 256.
> -Clearly, this still scales linearly. So the problem of memory footprint
> will occur with more VMs, or block devices.
> -Whilst 2.75MB per device is probably acceptable (?), if we start using
> multipage rings, then we might not want to have
> BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will cause the
> memory overhead to increase. This is why I have implemented the
> ''fallback'' mode. With a multipage ring, it seems
reasonable to want the
> first $x$ grefs seen by blkback to be treated as persistent, and any
> later ones to be non-persistent. Does that seem sensible?
From a resource usage pov, perhaps. But this will get the guest
entirely unpredictable performance. Plus I don''t think 11Mb of
_virtual_ space is unacceptable overhead in a 64-bit kernel. If
you really want/need this in a 32-bit one, then perhaps some
other alternatives would be needed (and persistent grants may
not be the right approach there in the first place).

Jan

Oliver Chick

2012-Sep-20 13:12 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

Thank you for all your useful feedback, Konrad.

I have made most of the changes you suggest. I will test them and
repost. I have made a few points below.

On Wed, 2012-09-19 at 15:11 +0100, Konrad Rzeszutek Wilk
wrote:> On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick wrote:
> > This patch implements persistent grants for the xen-blk{front,back}
> > mechanism. The effect of this change is to reduce the number of unmap
> > operations performed, since they cause a (costly) TLB shootdown. This
> > allows the I/O performance to scale better when a large number of VMs
> > are performing I/O.
> >
> > Previously, the blkfront driver was supplied a bvec[] from the request
> > queue. This was granted to dom0; dom0 performed the I/O and wrote
> > directly into the grant-mapped memory and unmapped it; blkfront then
> > removed foreign access for that grant. The cost of unmapping scales
> > badly with the number of CPUs in Dom0. An experiment showed that when
> > Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> > ramdisk, the IPIs from performing unmap''s is a bottleneck at
5 guests
> > (at which point 650,000 IOPS are being performed in total). If more
> > than 5 guests are used, the performance declines. By 10 guests, only
> > 400,000 IOPS are being performed.
> >
> > This patch improves performance by only unmapping when the connection
> > between blkfront and back is broken.
> >
> > On startup, blk{front,back} use xenbus to communicate their ability to
> > perform ''feature-persistent''. Iff both ends have
this ability,
> > persistent mode is used.
> >
> > To perform a read, in persistent mode, blkfront uses a separate pool
> > of pages that it maps to dom0. When a request comes in, blkfront
> > transmutes the request so that blkback will write into one of these
> > free pages. Blkback keeps note of which grefs it has already
> > mapped. When a new ring request comes to blkback, it looks to see if
> > it has already mapped that page. If so, it will not map it again. If
> > the page hasn''t been previously mapped, it is mapped now, and
a record
> > is kept of this mapping. Blkback proceeds as usual. When blkfront is
> > notified that blkback has completed a request, it memcpy''s
from the
> > shared memory, into the bvec supplied. A record that the {gref, page}
> > tuple is mapped, and not inflight is kept.
> >
> > Writes are similar, except that the memcpy is peformed from the
> > supplied bvecs, into the shared pages, before the request is put onto
> > the ring.
> >
> > Blkback has to store a mapping of grefs=>{page mapped to by gref}
in
> > an array. As the grefs are not known apriori, and provide no
> > guarantees on their ordering, we have to perform a linear search
> > through this array to find the page, for every gref we receive. The
> > overhead of this is low, however future work might want to use a more
> > efficient data structure to reduce this O(n) operation.
> >
> > We (ijc, and myself) have introduced a new constant,
> > BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
> > from attempting a DoS, by supplying fresh grefs, causing the Dom0
> > kernel from to map excessively. This is currently set to 256---the
> > maximum number of entires in the ring. As the number of inflight
> > requests <= size of ring, 256 is also the maximum sensible size.
This
> > introduces a maximum overhead of 11MB of mapped memory, per block
> > device. In practice, we don''t typically use about 60 of
these. If the
> > guest exceeds the 256 limit, it is either buggy or malicious. We treat
> > this in one of two ways:
> > 1) If we have mapped <
> > BLKIF_MAX_SEGMENTS_PER_REQUEST * BLKIF_MAX_PERS_REQUESTS_PER_DEV
> > pages, we will persistently map the grefs. This can occur is previous
> > requests have not used all BLKIF_MAX_SEGMENTS_PER_REQUEST segments.
> > 2) Otherwise, we revert to non-persistent grants for all future grefs.
> >
> > In writing this patch, the question arrises as to if the additional
> > cost of performing memcpys in the guest (to/from the pool of granted
> > pages) outweigh the gains of not performing TLB shootdowns. The answer
> > to that question is `no''. There appears to be very little, if
any
> > additional cost to the guest of using persistent grants. There is
> > perhaps a small saving, from the reduced number of hypercalls
> > performed in granting, and ending foreign access.
> >
> >
> > Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
> > ---
> >  drivers/block/xen-blkback/blkback.c |  149
+++++++++++++++++++++++++---
> >  drivers/block/xen-blkback/common.h  |   16 +++
> >  drivers/block/xen-blkback/xenbus.c  |   21 +++-
> >  drivers/block/xen-blkfront.c        |  186
++++++++++++++++++++++++++++++-----
> >  include/xen/interface/io/blkif.h    |    9 ++
> >  5 files changed, 338 insertions(+), 43 deletions(-)
> >
> > diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
> > index 73f196c..f95dee9 100644
> > --- a/drivers/block/xen-blkback/blkback.c
> > +++ b/drivers/block/xen-blkback/blkback.c
> > @@ -78,6 +78,7 @@ struct pending_req {
> >       unsigned short          operation;
> >       int                     status;
> >       struct list_head        free_list;
> > +     u8                      is_pers;
> 
> Just do what you did in the blkfront.c:
> 
>         unsigned int feature_persistent:1
> 
> or:
>         unsigned int flag_persistent:1
> 
> or:
>         unsigned int flag_pers_gnt:1
> 
> >  };
> >
> >  #define BLKBACK_INVALID_HANDLE (~0)
> > @@ -128,6 +129,24 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
> >  static void make_response(struct xen_blkif *blkif, u64 id,
> >                         unsigned short op, int st);
> >
> > +static void add_pers_gnt(struct pers_gnt *pers_gnt, struct xen_blkif
*blkif)
> > +{
> > +     BUG_ON(blkif->pers_gnt_c >=
BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> > +            BLKIF_MAX_SEGMENTS_PER_REQUEST);
> > +     blkif->pers_gnts[blkif->pers_gnt_c++] = pers_gnt;
> > +}
> > +
> > +static struct pers_gnt *get_pers_gnt(struct xen_blkif *blkif,
> > +                                  grant_ref_t gref)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < blkif->pers_gnt_c; i++)
> > +             if (gref == blkif->pers_gnts[i]->gnt)
> > +                     return blkif->pers_gnts[i];
> > +     return NULL;
> > +}
> > +
> >  /*
> >   * Retrieve from the ''pending_reqs'' a free
pending_req structure to be used.
> >   */
> > @@ -274,6 +293,12 @@ int xen_blkif_schedule(void *arg)
> >  {
> >       struct xen_blkif *blkif = arg;
> >       struct xen_vbd *vbd = &blkif->vbd;
> > +     struct gnttab_unmap_grant_ref
unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *pers_gnt;
> > +     int i;
> > +     int ret = 0;
> > +     int segs_to_unmap;
> >
> >       xen_blkif_get(blkif);
> >
> > @@ -301,6 +326,29 @@ int xen_blkif_schedule(void *arg)
> >                       print_stats(blkif);
> >       }
> >
> > +     /* Free all persistent grant pages */
> > +
> > +     while (segs_to_unmap = min(BLKIF_MAX_SEGMENTS_PER_REQUEST,
> > +                              blkif->pers_gnt_c)) {
> > +
> > +             for (i = 0; i < segs_to_unmap; i++) {
> > +                     pers_gnt =
blkif->pers_gnts[blkif->pers_gnt_c - i - 1];
> > +
> > +                     gnttab_set_unmap_op(&unmap[i],
> > +                                         pfn_to_kaddr(page_to_pfn(
> > +                                                         
pers_gnt->page)),
> > +                                         GNTMAP_host_map,
> > +                                         pers_gnt->handle);
> > +
> > +                     pages[i] = pers_gnt->page;
> > +             }
> > +
> > +             ret = gnttab_unmap_refs(unmap, pages, segs_to_unmap,
false);
> > +             BUG_ON(ret);
> > +
> > +             blkif->pers_gnt_c -= segs_to_unmap;
> > +
> > +     }
> > +
> >       if (log_stats)
> >               print_stats(blkif);
> >
> > @@ -343,13 +391,28 @@ static void xen_blkbk_unmap(struct pending_req
*req)
> >
> >  static int xen_blkbk_map(struct blkif_request *req,
> >                        struct pending_req *pending_req,
> > -                      struct seg_buf seg[])
> > +                      struct seg_buf seg[],
> > +                      struct page *pages[])
> >  {
> >       struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *new_pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *pers_gnt;
> > +     phys_addr_t addr;
> >       int i;
> > +     int new_map;
> >       int nseg = req->u.rw.nr_segments;
> > +     int segs_to_init = 0;
> 
> segs_to_map is probably a better name.
> 
> >       int ret = 0;
> > +     int use_pers_gnts;
> >
> > +     use_pers_gnts = (pending_req->blkif->can_grant_persist
&&
> > +                      pending_req->blkif->pers_gnt_c <
> > +                      BLKIF_MAX_SEGMENTS_PER_REQUEST *
> > +                      BLKIF_MAX_PERS_REQUESTS_PER_DEV);
> > +
> > +     pending_req->is_pers = use_pers_gnts;
> >       /*
> >        * Fill out preq.nr_sects with proper amount of sectors, and
setup
> >        * assign map[..] with the PFN of the page in our domain with
the
> > @@ -358,36 +421,87 @@ static int xen_blkbk_map(struct blkif_request
*req,
> >       for (i = 0; i < nseg; i++) {
> >               uint32_t flags;
> >
> > +             if (use_pers_gnts) {
> > +                     pers_gnt = get_pers_gnt(pending_req->blkif,
> > +                                            
req->u.rw.seg[i].gref);
> > +                     if (!pers_gnt) {
> > +                             new_map = 1;
> > +                             pers_gnt = kmalloc(sizeof(struct
pers_gnt),
> > +                                                GFP_KERNEL);
> > +                             if (!pers_gnt)
> > +                                     return -ENOMEM;
> > +                             pers_gnt->page =
alloc_page(GFP_KERNEL);
> > +                             if (!pers_gnt->page)
> > +                                     return -ENOMEM;
> 
> You want to kfree pers_gnt here.
> 
> > +                             pers_gnt->gnt =
req->u.rw.seg[i].gref;
> > +
> > +                             pages_to_gnt[segs_to_init] =
pers_gnt->page;
> > +                             new_pers_gnts[segs_to_init] = pers_gnt;
> > +
> > +                             add_pers_gnt(pers_gnt,
pending_req->blkif);
> > +
> > +                     } else {
> > +                             new_map = 0;
> > +                     }
> > +                     pages[i] = pers_gnt->page;
> > +                     addr = (unsigned long) pfn_to_kaddr(
> > +                             page_to_pfn(pers_gnt->page));
> 
> Would it make sense to cache this in the ''pers_gnt''
structure?
As far as I can tell, we only need this value when mapping, and
unmapping. So if we cache it, we will use it a maximum of one time. I
think it''s cheap to calculate. Am I right?
> 
> > +                     pers_gnts[i] = pers_gnt;
> > +             } else {
> > +                     new_map = 1;
> > +                     pages[i] = blkbk->pending_page(pending_req,
i);
> > +                     addr = vaddr(pending_req, i);
> > +                     pages_to_gnt[i] =
blkbk->pending_page(pending_req, i);
> > +             }
> > +
> >               flags = GNTMAP_host_map;
> > -             if (pending_req->operation != BLKIF_OP_READ)
> > +             if (!use_pers_gnts && (pending_req->operation
!= BLKIF_OP_READ))
> >                       flags |= GNTMAP_readonly;
> > -             gnttab_set_map_op(&map[i], vaddr(pending_req, i),
flags,
> > -                               req->u.rw.seg[i].gref,
> > -                               pending_req->blkif->domid);
> > +             if (new_map) {
> > +                     gnttab_set_map_op(&map[segs_to_init++],
addr,
> > +                                       flags,
req->u.rw.seg[i].gref,
> > +                                      
pending_req->blkif->domid);
> > +             }
> >       }
> >
> > -     ret = gnttab_map_refs(map, NULL,
&blkbk->pending_page(pending_req, 0), nseg);
> > -     BUG_ON(ret);
> > +     if (segs_to_init) {
> > +             ret = gnttab_map_refs(map, NULL, pages_to_gnt,
segs_to_init);
> > +             BUG_ON(ret);
> > +     }
> >
> >       /*
> >        * Now swizzle the MFN in our domain with the MFN from the other
domain
> >        * so that when we access vaddr(pending_req,i) it has the
contents of
> >        * the page from the other domain.
> >        */
> > -     for (i = 0; i < nseg; i++) {
> > +     for (i = 0; i < segs_to_init; i++) {
> >               if (unlikely(map[i].status != 0)) {
> >                       pr_debug(DRV_PFX "invalid buffer -- could
not remap it\n");
> >                       map[i].handle = BLKBACK_INVALID_HANDLE;
> >                       ret |= 1;
> >               }
> >
> > -             pending_handle(pending_req, i) = map[i].handle;
> > +             if (use_pers_gnts) {
> > +                     /* store the `out'' values from map */
> > +                 pending_req->blkif->pers_gnts
> > +                     [pending_req->blkif->pers_gnt_c -
segs_to_init +
> > +                      i]->handle = map[i].handle;
> > +                     new_pers_gnts[i]->dev_bus_addr =
map[i].dev_bus_addr;
> > +             }
> >
> >               if (ret)
> >                       continue;
> > -
> > -             seg[i].buf  = map[i].dev_bus_addr |
> > -                     (req->u.rw.seg[i].first_sect << 9);
> > +     }
> > +     for (i = 0; i < nseg; i++) {
> > +             if (use_pers_gnts) {
> > +                     pending_handle(pending_req, i) =
pers_gnts[i]->handle;
> > +                     seg[i].buf = pers_gnts[i]->dev_bus_addr |
> > +                             (req->u.rw.seg[i].first_sect <<
9);
> > +             } else {
> > +                     pending_handle(pending_req, i) = map[i].handle;
> > +                     seg[i].buf = map[i].dev_bus_addr |
> > +                             (req->u.rw.seg[i].first_sect <<
9);
> > +             }
> >       }
> >       return ret;
> >  }
> > @@ -468,7 +582,8 @@ static void __end_block_io_op(struct pending_req
*pending_req, int error)
> >        * the proper response on the ring.
> >        */
> >       if (atomic_dec_and_test(&pending_req->pendcnt)) {
> > -             xen_blkbk_unmap(pending_req);
> > +             if (!pending_req->is_pers)
> > +                     xen_blkbk_unmap(pending_req);
> >               make_response(pending_req->blkif, pending_req->id,
> >                             pending_req->operation,
pending_req->status);
> >               xen_blkif_put(pending_req->blkif);
> > @@ -590,6 +705,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
> >       int operation;
> >       struct blk_plug plug;
> >       bool drain = false;
> > +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >
> >       switch (req->operation) {
> >       case BLKIF_OP_READ:
> > @@ -676,7 +792,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
> >        * the hypercall to unmap the grants - that is all done in
> >        * xen_blkbk_unmap.
> >        */
> > -     if (xen_blkbk_map(req, pending_req, seg))
> > +     if (xen_blkbk_map(req, pending_req, seg, pages))
> >               goto fail_flush;
> >
> >       /*
> > @@ -688,7 +804,7 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
> >       for (i = 0; i < nseg; i++) {
> >               while ((bio == NULL) ||
> >                      (bio_add_page(bio,
> > -                                  blkbk->pending_page(pending_req,
i),
> 
> Can we get rid of pending_page macro?
Unfortunately not, it is still used in the non-persistent mode to
populate the pages[].
> 
> > +                                  pages[i],
> >                                    seg[i].nsec << 9,
> >                                    seg[i].buf & ~PAGE_MASK) == 0))
{
> >
> > @@ -743,7 +859,8 @@ static int dispatch_rw_block_io(struct xen_blkif
*blkif,
> >       return 0;
> >
> >   fail_flush:
> > -     xen_blkbk_unmap(pending_req);
> > +     if (!blkif->can_grant_persist)
> > +             xen_blkbk_unmap(pending_req);
> >   fail_response:
> >       /* Haven''t submitted any bio''s yet. */
> >       make_response(blkif, req->u.rw.id, req->operation,
BLKIF_RSP_ERROR);
> > diff --git a/drivers/block/xen-blkback/common.h
b/drivers/block/xen-blkback/common.h
> > index 9ad3b5e..1ecb709 100644
> > --- a/drivers/block/xen-blkback/common.h
> > +++ b/drivers/block/xen-blkback/common.h
> > @@ -47,6 +47,8 @@
> >  #define DPRINTK(fmt, args...)                                \
> >       pr_debug(DRV_PFX "(%s:%d) " fmt ".\n",      
\
> >                __func__, __LINE__, ##args)
> > +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> 
> You need a comment explaining why this number.
> 
> > +
> >
> >
> >  /* Not a real protocol.  Used to generate ring structs which contain
> > @@ -164,6 +166,14 @@ struct xen_vbd {
> >
> >  struct backend_info;
> >
> > +
> > +struct pers_gnt {
> > +     struct page *page;
> > +     grant_ref_t gnt;
> > +     uint32_t handle;
> 
> Not grant_handle_t ?
> 
> > +     uint64_t dev_bus_addr;
> > +};
> > +
> >  struct xen_blkif {
> >       /* Unique identifier for this interface. */
> >       domid_t                 domid;
> > @@ -190,6 +200,12 @@ struct xen_blkif {
> >       struct task_struct      *xenblkd;
> >       unsigned int            waiting_reqs;
> >
> > +     /* frontend feature information */
> > +     u8                      can_grant_persist:1;
> 
> Pls move that to the vbd structure, and if you feel
> especially adventourous, you could modify the
> 
> bool flush_support
> bool discard_secure
> 
> to be ''unsigned int feature_flush:1'' ,etc.. as its own
> seperate patch and then introduce the
> 
> unsigned int feature_grnt_pers:1
> 
> flag.
> > +     struct pers_gnt        
*pers_gnts[BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> > +                                      
BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     unsigned int            pers_gnt_c;
> > +
> >       /* statistics */
> >       unsigned long           st_print;
> >       int                     st_rd_req;
> > diff --git a/drivers/block/xen-blkback/xenbus.c
b/drivers/block/xen-blkback/xenbus.c
> > index 4f66171..dc58cc4 100644
> > --- a/drivers/block/xen-blkback/xenbus.c
> > +++ b/drivers/block/xen-blkback/xenbus.c
> > @@ -681,6 +681,13 @@ again:
> >               goto abort;
> >       }
> >
> > +     err = xenbus_printf(xbt, dev->nodename,
"feature-persistent-grants",
> > +                         "%d", 1);
> > +     if (err) {
> > +             xenbus_dev_fatal(dev, err, "writing persistent
capability");
> 
> It is not fatal. Just do dev_warn, like the xen_blkbk_discard does.
> 
> > +             goto abort;
> > +     }
> > +
> >       /* FIXME: use a typename instead */
> >       err = xenbus_printf(xbt, dev->nodename, "info",
"%u",
> >                           be->blkif->vbd.type |
> > @@ -721,6 +728,7 @@ static int connect_ring(struct backend_info *be)
> >       struct xenbus_device *dev = be->dev;
> >       unsigned long ring_ref;
> >       unsigned int evtchn;
> > +     u8 pers_grants;
> >       char protocol[64] = "";
> >       int err;
> >
> > @@ -750,8 +758,17 @@ static int connect_ring(struct backend_info *be)
> >               xenbus_dev_fatal(dev, err, "unknown fe protocol
%s", protocol);
> >               return -1;
> >       }
> > -     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol
%d (%s)\n",
> > -             ring_ref, evtchn, be->blkif->blk_protocol,
protocol);
> > +     err = xenbus_gather(XBT_NIL, dev->otherend,
> > +                         "feature-persistent-grants",
"%d",
> > +                         &pers_grants, NULL);
> > +     if (err)
> > +             pers_grants = 0;
> > +
> > +     be->blkif->can_grant_persist = pers_grants;
> 
> We should also have a patch for the Xen blkif.h to document this feature..
but that
> can be done later.
> 
> > +
> > +     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol
%d (%s) persistent %d\n",
> > +             ring_ref, evtchn, be->blkif->blk_protocol,
protocol,
> > +             pers_grants);
> >
> >       /* Map the shared frame, irq etc. */
> >       err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> > diff --git a/drivers/block/xen-blkfront.c
b/drivers/block/xen-blkfront.c
> > index e4fb337..c1cc5fe 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -68,6 +68,13 @@ struct blk_shadow {
> >       struct blkif_request req;
> >       struct request *request;
> >       unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct gnt_list *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +};
> > +
> > +struct gnt_list {
> > +     grant_ref_t gref;
> > +     unsigned long frame;
> 
> Pls mention what ''frame'' is and what the other ones do.
> 
> > +     struct gnt_list *tail;
> >  };
> 
> How did the compiler actually compile this? You are using it in
''struct blk_shadow''
> but it is declared later?
> >
> >  static DEFINE_MUTEX(blkfront_mutex);
> > @@ -97,11 +104,14 @@ struct blkfront_info
> >       struct work_struct work;
> >       struct gnttab_free_callback callback;
> >       struct blk_shadow shadow[BLK_RING_SIZE];
> > +     struct gnt_list *persistent_grants_front;
> 
> I don''t think you need the ''front'' in there. Its
obvious you are
> in the frontend code.
> 
> > +     unsigned int persistent_grants_c;
> >       unsigned long shadow_free;
> >       unsigned int feature_flush;
> >       unsigned int flush_op;
> >       unsigned int feature_discard:1;
> >       unsigned int feature_secdiscard:1;
> > +     unsigned int feature_persistent:1;
> >       unsigned int discard_granularity;
> >       unsigned int discard_alignment;
> >       int is_ready;
> > @@ -286,22 +296,37 @@ static int blkif_queue_request(struct request
*req)
> >       struct blkif_request *ring_req;
> >       unsigned long id;
> >       unsigned int fsect, lsect;
> > -     int i, ref;
> > +     int i, ref, use_pers_gnts, new_pers_gnts;
> 
> use_pers_gnts and new_pers_gnt? Can you document what the difference
> is pls?
> 
> Perhaps ''new_pers_gnts'' should be called:
> ''got_grant''? or ''using_grant'' ?
> 
> >       grant_ref_t gref_head;
> > +     struct bio_vec *bvecs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct bio_vec *bvec;
> > +     struct req_iterator iter;
> > +     char *bvec_data;
> > +     void *shared_data;
> > +     unsigned long flags;
> 
> What kind of flags?
> > +     struct page *shared_page;
> 
> It is not really shared_page. It is a temporary page. Perhaps
> tmp_page ?
> 
> > +     struct gnt_list *gnt_list_entry;
> >       struct scatterlist *sg;
> >
> >       if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> >               return 1;
> >
> > -     if (gnttab_alloc_grant_references(
> > -             BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0)
{
> > -             gnttab_request_free_callback(
> > -                     &info->callback,
> > -                     blkif_restart_queue_callback,
> > -                     info,
> > -                     BLKIF_MAX_SEGMENTS_PER_REQUEST);
> > -             return 1;
> > +     use_pers_gnts = info->feature_persistent;
> > +
> > +     if (info->persistent_grants_c <
BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> > +             new_pers_gnts = 1;
> > +             if (gnttab_alloc_grant_references(
> > +                         BLKIF_MAX_SEGMENTS_PER_REQUEST,
&gref_head) < 0) {
> > +                     gnttab_request_free_callback(
> > +                             &info->callback,
> > +                             blkif_restart_queue_callback,
> > +                             info,
> > +                             BLKIF_MAX_SEGMENTS_PER_REQUEST);
> > +                     return 1;
> > +             }
> >       } else
> > +             new_pers_gnts = 0;
> >
> >       /* Fill out a communications ring structure. */
> >       ring_req = RING_GET_REQUEST(&info->ring,
info->ring.req_prod_pvt);
> > @@ -341,20 +366,66 @@ static int blkif_queue_request(struct request
*req)
> >                      BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >
> >               for_each_sg(info->sg, sg,
ring_req->u.rw.nr_segments, i) {
> > -                     buffer_mfn =
pfn_to_mfn(page_to_pfn(sg_page(sg)));
> >                       fsect = sg->offset >> 9;
> >                       lsect = fsect + (sg->length >> 9) - 1;
> > -                     /* install a grant reference. */
> > -                     ref =
gnttab_claim_grant_reference(&gref_head);
> > -                     BUG_ON(ref == -ENOSPC);
> >
> > -                     gnttab_grant_foreign_access_ref(
> > +                     if (use_pers_gnts &&
info->persistent_grants_c) {
> > +                             gnt_list_entry =
info->persistent_grants_front;
> > +
> > +                             info->persistent_grants_front >
> +                                    
info->persistent_grants_front->tail;
> > +                             ref = gnt_list_entry->gref;
> > +                             buffer_mfn =
pfn_to_mfn(gnt_list_entry->frame);
> 
> Ah, so frame is a PFN! why don''t you just call it that?
> 
> > +                             info->persistent_grants_c--;
> > +                     } else {
> > +                             ref =
gnttab_claim_grant_reference(&gref_head);
> > +                             BUG_ON(ref == -ENOSPC);
> > +
> > +                             if (use_pers_gnts) {
> > +                                     gnt_list_entry > > +      
kmalloc(sizeof(struct gnt_list),
> > +                                                             
GFP_ATOMIC);
> > +                                     if (!gnt_list_entry)
> > +                                             return -ENOMEM;
> > +
> > +                                     shared_page =
alloc_page(GFP_ATOMIC);
> > +                                     if (!shared_page)
> > +                                             return -ENOMEM;
> 
> Make sure to kfree gnt_list_entry.
> 
> > +
> > +                                     gnt_list_entry->frame >
> +                                             page_to_pfn(shared_page);
> > +                                     gnt_list_entry->gref = ref;
> > +                             } else
> > +                                     shared_page = sg_page(sg);
> > +
> > +                             buffer_mfn = pfn_to_mfn(page_to_pfn(
> > +                                                            
shared_page));
> > +                             gnttab_grant_foreign_access_ref(
> >                                       ref,
> >                                       info->xbdev->otherend_id,
> >                                       buffer_mfn,
> > -                                     rq_data_dir(req));
> > +                                     !use_pers_gnts &&
rq_data_dir(req));
> > +                     }
> > +
> > +                     if (use_pers_gnts)
> > +                             info->shadow[id].grants_used[i] >
> +                                     gnt_list_entry;
> > +
> > +                     if (use_pers_gnts && rq_data_dir(req)) {
> 
> You can declare the ''shared_data'' here:
> 
>                                 void *shared_data;
> 
> and also bvec_data.
> 
> > +                             shared_data = kmap_atomic(
> > +                                    
pfn_to_page(gnt_list_entry->frame));
> > +                             bvec_data = kmap_atomic(sg_page(sg));
> > +
> > +                             memcpy(shared_data + sg->offset,
> > +                                    bvec_data   + sg->offset,
> > +                                    sg->length);
> 
> Do we need to worry about it spilling over a page? Should we check that
> sg>offset + sg->length < PAGE_SIZE ?
I agree, this is probably a worthwhile thing to check.
> 
> Also this would imply that based on the offset (so say it is 3999) that the
old data (0->3998)
> might be still there - don''t know how important that is?
This is true. I spoke with IanC about this, and we *think* that this is
ok. Any old data that is there will have already been given to blkback,
so we''re not really leaking data that we shouldn''t be. 
> > +
> > +                             kunmap_atomic(bvec_data);
> > +                             kunmap_atomic(shared_data);
> > +                     }
> >
> >                       info->shadow[id].frame[i] =
mfn_to_pfn(buffer_mfn);
> > +
> >                       ring_req->u.rw.seg[i] > >              
(struct blkif_request_segment) {
> >                                               .gref       = ref,
> > @@ -368,7 +439,8 @@ static int blkif_queue_request(struct request
*req)
> >       /* Keep a private copy so we can reissue requests when
recovering. */
> >       info->shadow[id].req = *ring_req;
> >
> > -     gnttab_free_grant_references(gref_head);
> > +     if (new_pers_gnts)
> > +             gnttab_free_grant_references(gref_head);
> >
> >       return 0;
> >  }
> > @@ -480,12 +552,13 @@ static int xlvbd_init_blk_queue(struct gendisk
*gd, u16 sector_size)
> >  static void xlvbd_flush(struct blkfront_info *info)
> >  {
> >       blk_queue_flush(info->rq, info->feature_flush);
> > -     printk(KERN_INFO "blkfront: %s: %s: %s\n",
> > +     printk(KERN_INFO "blkfront: %s: %s: %s.
Persistent=%d\n",
> 
> Just use %s like the rest
> >              info->gd->disk_name,
> >              info->flush_op == BLKIF_OP_WRITE_BARRIER ?
> >               "barrier" : (info->flush_op ==
BLKIF_OP_FLUSH_DISKCACHE ?
> >               "flush diskcache" : "barrier or
flush"),
> > -            info->feature_flush ? "enabled" :
"disabled");
> > +            info->feature_flush ? "enabled" :
"disabled",
> > +         info->feature_persistent);
> 
> and this can be ''info->feature_persistent ?
"persitent" : "".
> 
> >  }
> >
> >  static int xen_translate_vdev(int vdevice, int *minor, unsigned int
*offset)
> > @@ -707,6 +780,8 @@ static void blkif_restart_queue(struct work_struct
*work)
> >
> >  static void blkif_free(struct blkfront_info *info, int suspend)
> >  {
> > +     struct gnt_list *pers_gnt;
> > +
> >       /* Prevent new requests being issued until we fix things up. */
> >       spin_lock_irq(&info->io_lock);
> >       info->connected = suspend ?
> > @@ -714,6 +789,15 @@ static void blkif_free(struct blkfront_info
*info, int suspend)
> >       /* No more blkif_request(). */
> >       if (info->rq)
> >               blk_stop_queue(info->rq);
> > +
> > +     /* Remove all persistent grants */
> > +     while (info->persistent_grants_front) {
> > +             pers_gnt = info->persistent_grants_front;
> > +             info->persistent_grants_front = pers_gnt->tail;
> > +             gnttab_end_foreign_access(pers_gnt->gref, 0, 0UL);
> > +             kfree(pers_gnt);
> > +     }
> > +
> >       /* No more gnttab callback work. */
> >       gnttab_cancel_free_callback(&info->callback);
> >       spin_unlock_irq(&info->io_lock);
> > @@ -734,13 +818,44 @@ static void blkif_free(struct blkfront_info
*info, int suspend)
> >
> >  }
> >
> > -static void blkif_completion(struct blk_shadow *s)
> > +static void blkif_completion(struct blk_shadow *s, struct
blkfront_info *info,
> > +                          struct blkif_response *bret)
> >  {
> >       int i;
> > -     /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same
place
> > -      * flag. */
> > -     for (i = 0; i < s->req.u.rw.nr_segments; i++)
> > -             gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0,
0UL);
> > +     struct gnt_list *new_gnt_list_entry;
> > +     struct bio_vec *bvec;
> > +     struct req_iterator iter;
> > +     unsigned long flags;
> > +     char *bvec_data;
> > +     void *shared_data;
> > +
> > +     i = 0;
> > +     if (info->feature_persistent == 0) {
> > +             /* Do not let BLKIF_OP_DISCARD as nr_segment is in the
same
> > +              * place flag. */
> > +             for (i = 0; i < s->req.u.rw.nr_segments; i++)
> > +                    
gnttab_end_foreign_access(s->req.u.rw.seg[i].gref,
> > +                                               0, 0UL);
> > +             return;
> > +     }
> > +
> 
> Move the ''i = 0'' to here.
> 
> > +     if (bret->operation == BLKIF_OP_READ)
> > +             rq_for_each_segment(bvec, s->request, iter) {
> > +                     shared_data = kmap_atomic
> > +                            
(pfn_to_page(s->grants_used[i++]->frame));
> > +                     bvec_data = bvec_kmap_irq(bvec, &flags);
> > +                     memcpy(bvec_data, shared_data +
bvec->bv_offset,
> > +                            bvec->bv_len);
> > +                     bvec_kunmap_irq(bvec_data, &flags);
> > +                     kunmap_atomic(shared_data);
> > +             }
> > +     /* Add the persistent grant into the list of free grants */
> > +     for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> > +             new_gnt_list_entry = s->grants_used[i];
> > +             new_gnt_list_entry->tail =
info->persistent_grants_front;
> > +             info->persistent_grants_front = new_gnt_list_entry;
> > +             info->persistent_grants_c++;
> > +     }
> >  }
> >
> >  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> > @@ -783,7 +898,7 @@ static irqreturn_t blkif_interrupt(int irq, void
*dev_id)
> >               req  = info->shadow[id].request;
> >
> >               if (bret->operation != BLKIF_OP_DISCARD)
> > -                     blkif_completion(&info->shadow[id]);
> > +                     blkif_completion(&info->shadow[id], info,
bret);
> >
> >               if (add_id_to_freelist(info, id)) {
> >                       WARN(1, "%s: response to %s (id %ld)
couldn''t be recycled!\n",
> > @@ -943,6 +1058,12 @@ again:
> >               message = "writing protocol";
> >               goto abort_transaction;
> >       }
> > +     err = xenbus_printf(xbt, dev->nodename,
> > +                         "feature-persistent-grants",
"%d", 1);
> > +     if (err) {
> > +             message = "Writing persistent grants front
feature";
> > +             goto abort_transaction;
> 
> I wouldn''t abort. Just continue on using the non-grant feature.
> 
> > +     }
> >
> >       err = xenbus_transaction_end(xbt, 0);
> >       if (err) {
> > @@ -1030,6 +1151,7 @@ static int blkfront_probe(struct xenbus_device
*dev,
> >       spin_lock_init(&info->io_lock);
> >       info->xbdev = dev;
> >       info->vdevice = vdevice;
> > +     info->persistent_grants_c = 0;
> >       info->connected = BLKIF_STATE_DISCONNECTED;
> >       INIT_WORK(&info->work, blkif_restart_queue);
> >
> > @@ -1094,6 +1216,7 @@ static int blkif_recover(struct blkfront_info
*info)
> >                                       req->u.rw.seg[j].gref,
> >                                       info->xbdev->otherend_id,
> >                                      
pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> > +                                     !info->feature_persistent
&&
> >                                      
rq_data_dir(info->shadow[req->u.rw.id].request));
> >               }
> >               info->shadow[req->u.rw.id].req = *req;
> > @@ -1226,7 +1349,7 @@ static void blkfront_connect(struct
blkfront_info *info)
> >       unsigned long sector_size;
> >       unsigned int binfo;
> >       int err;
> > -     int barrier, flush, discard;
> > +     int barrier, flush, discard, persistent;
> >
> >       switch (info->connected) {
> >       case BLKIF_STATE_CONNECTED:
> > @@ -1297,6 +1420,19 @@ static void blkfront_connect(struct
blkfront_info *info)
> >               info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
> >       }
> >
> > +     /*
> > +      * Are we dealing with an old blkback that will unmap
> > +      * all grefs?
> > +      */
> > +     err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > +                         "feature-persistent-grants",
"%d", &persistent,
> > +                         NULL);
> > +
> > +     if (err)
> > +             info->feature_persistent = 0;
> > +     else
> > +             info->feature_persistent = persistent;
> > +
> >       err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> >                           "feature-discard", "%d",
&discard,
> >                           NULL);
> > diff --git a/include/xen/interface/io/blkif.h
b/include/xen/interface/io/blkif.h
> > index ee338bf..cd267a9b 100644
> > --- a/include/xen/interface/io/blkif.h
> > +++ b/include/xen/interface/io/blkif.h
> > @@ -109,6 +109,15 @@ typedef uint64_t blkif_sector_t;
> >   */
> >  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
> >
> > +/*
> > + * Maximum number of persistent grants that can be mapped by Dom0 for
each
> 
> by blkback
> > + * interface. This is set to be the size of the ring, as this is a
limit on
> 
> Size of the ring? You sure?
> > + * the number of requests that can be inflight at any one time. 256
imposes
> > + * an overhead of 11MB of mapped kernel space per interface.
> > + */
> > +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> > +
> > +
> >  struct blkif_request_rw {
> >       uint8_t        nr_segments;  /* number of segments              
*/
> >       blkif_vdev_t   handle;       /* only for read/write requests    
*/
> > --
> > 1.7.9.5

Konrad Rzeszutek Wilk

2012-Sep-20 13:49 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich
wrote:> >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
> > The memory overhead, and fallback mode points are related:
> > -Firstly, it turns out that the overhead is actually 2.75MB, not 11MB
> > per device. I made a mistake (pointed out by Jan) as the maximum
number
> > of requests that can fit into a single-page ring is 64, not 256.
> > -Clearly, this still scales linearly. So the problem of memory
footprint
> > will occur with more VMs, or block devices.
> > -Whilst 2.75MB per device is probably acceptable (?), if we start
using
> > multipage rings, then we might not want to have
> > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will cause
the
> > memory overhead to increase. This is why I have implemented the
> > ''fallback'' mode. With a multipage ring, it seems
reasonable to want the
> > first $x$ grefs seen by blkback to be treated as persistent, and any
> > later ones to be non-persistent. Does that seem sensible?
> 
> From a resource usage pov, perhaps. But this will get the guest
> entirely unpredictable performance. Plus I don''t think 11Mb of
Wouldn''t it fall back to the older performance?> _virtual_ space is unacceptable overhead in a 64-bit kernel. If
> you really want/need this in a 32-bit one, then perhaps some
> other alternatives would be needed (and persistent grants may
> not be the right approach there in the first place).
> 
> Jan
>

Konrad Rzeszutek Wilk

2012-Sep-20 13:52 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Thu, Sep 20, 2012 at 02:12:25PM +0100, Oliver Chick
wrote:> Thank you for all your useful feedback, Konrad.
> 
> I have made most of the changes you suggest. I will test them and
> repost. I have made a few points below.
Great. Looking forward to your next patch. Also pls include it
as an attachment as I could not apply it last time.. Not sure if
that is something with your mailer - but attachments usually survive
any MUA mangling.> > > +                             page_to_pfn(pers_gnt->page));
> > 
> > Would it make sense to cache this in the ''pers_gnt''
structure?
> 
> As far as I can tell, we only need this value when mapping, and
> unmapping. So if we cache it, we will use it a maximum of one time. I
> think it''s cheap to calculate. Am I right?
Then lets not cache it.> > > @@ -688,7 +804,7 @@ static int dispatch_rw_block_io(struct
xen_blkif *blkif,
> > >       for (i = 0; i < nseg; i++) {
> > >               while ((bio == NULL) ||
> > >                      (bio_add_page(bio,
> > > -                                 
blkbk->pending_page(pending_req, i),
> > 
> > Can we get rid of pending_page macro?
> 
> Unfortunately not, it is still used in the non-persistent mode to
> populate the pages[].
Fair enough> > > +                             memcpy(shared_data + sg->offset,
> > > +                                    bvec_data   + sg->offset,
> > > +                                    sg->length);
> > 
> > Do we need to worry about it spilling over a page? Should we check
that
> > sg>offset + sg->length < PAGE_SIZE ?
> 
> I agree, this is probably a worthwhile thing to check.
> 
> > 
> > Also this would imply that based on the offset (so say it is 3999)
that the old data (0->3998)
> > might be still there - don''t know how important that is?
> 
> This is true. I spoke with IanC about this, and we *think* that this is
> ok. Any old data that is there will have already been given to blkback,
> so we''re not really leaking data that we shouldn''t be. 
Put a comment in saying that. In case in the future we want to share
the persistent grants with other domains, we would need to address this.

Oliver Chick

2012-Sep-20 14:13 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Thu, 2012-09-20 at 14:49 +0100, Konrad Rzeszutek Wilk
wrote:> On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich wrote:
> > >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
> > > The memory overhead, and fallback mode points are related:
> > > -Firstly, it turns out that the overhead is actually 2.75MB, not
11MB
> > > per device. I made a mistake (pointed out by Jan) as the maximum
number
> > > of requests that can fit into a single-page ring is 64, not 256.
> > > -Clearly, this still scales linearly. So the problem of memory
footprint
> > > will occur with more VMs, or block devices.
> > > -Whilst 2.75MB per device is probably acceptable (?), if we start
using
> > > multipage rings, then we might not want to have
> > > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will
cause the
> > > memory overhead to increase. This is why I have implemented the
> > > ''fallback'' mode. With a multipage ring, it
seems reasonable to want the
> > > first $x$ grefs seen by blkback to be treated as persistent, and
any
> > > later ones to be non-persistent. Does that seem sensible?
> > 
> > From a resource usage pov, perhaps. But this will get the guest
> > entirely unpredictable performance. Plus I don''t think 11Mb
of
> 
> Wouldn''t it fall back to the older performance?
I guess it would be a bit more complex than that. It would be worse than
the new performance because the grefs that get processed by the
''fallback'' mode will cause TLB shootdowns. But any early grefs
will
still be processed by the persistent mode, so won''t have shootdowns.
Therefore, depending on the ratio of {persistent grants}:{non-persistent
grants), allocated by blkfront, the performance will be somewhere
inbetween the two extremes.

I guess that the choice is between
1) Compiling blk{front,back} with a pre-determined number of persistent
grants, and failing if this limit is exceeded. This seems rather
unflexible, as blk{front,back} must then both both use the same version,
or you will get failures.
2 (current setup)) Have a recommended maximum number of
persistently-mapped pages, and going into a ''fallback'' mode if
blkfront
exceeds this limit.
3) Having blkback inform blkfront on startup as to how many grefs it is
willing to persistently-map. We then hit the same question again though:
what should be do if blkfront ignores this limit?
> > _virtual_ space is unacceptable overhead in a 64-bit kernel. If
> > you really want/need this in a 32-bit one, then perhaps some
> > other alternatives would be needed (and persistent grants may
> > not be the right approach there in the first place).
> > 
> > Jan
> >

Jan Beulich

2012-Sep-20 15:35 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

>>> On 20.09.12 at 15:49, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich wrote:
>> >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
>> > The memory overhead, and fallback mode points are related:
>> > -Firstly, it turns out that the overhead is actually 2.75MB, not
11MB
>> > per device. I made a mistake (pointed out by Jan) as the maximum
number
>> > of requests that can fit into a single-page ring is 64, not 256.
>> > -Clearly, this still scales linearly. So the problem of memory
footprint
>> > will occur with more VMs, or block devices.
>> > -Whilst 2.75MB per device is probably acceptable (?), if we start
using
>> > multipage rings, then we might not want to have
>> > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will
cause the
>> > memory overhead to increase. This is why I have implemented the
>> > ''fallback'' mode. With a multipage ring, it seems
reasonable to want the
>> > first $x$ grefs seen by blkback to be treated as persistent, and
any
>> > later ones to be non-persistent. Does that seem sensible?
>> 
>> From a resource usage pov, perhaps. But this will get the guest
>> entirely unpredictable performance. Plus I don''t think 11Mb of
> 
> Wouldn''t it fall back to the older performance?
Right, but the guest can''t really predict this. That may have
significant impact if the performance difference is big enough.

Jan

Jan Beulich

2012-Sep-20 16:10 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

>>> On 20.09.12 at 16:13, Oliver Chick <oliver.chick@citrix.com>
wrote:
> 3) Having blkback inform blkfront on startup as to how many grefs it is
> willing to persistently-map. We then hit the same question again though:
> what should be do if blkfront ignores this limit?
Not really - if any information coming from the backend is being
ignored by the frontend, then it certainly can''t expect consistent
performance.

Jan

Konrad Rzeszutek Wilk

2012-Sep-20 21:24 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Thu, Sep 20, 2012 at 03:13:42PM +0100, Oliver Chick
wrote:> On Thu, 2012-09-20 at 14:49 +0100, Konrad Rzeszutek Wilk wrote:
> > On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich wrote:
> > > >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
> > > > The memory overhead, and fallback mode points are related:
> > > > -Firstly, it turns out that the overhead is actually 2.75MB,
not 11MB
> > > > per device. I made a mistake (pointed out by Jan) as the
maximum number
> > > > of requests that can fit into a single-page ring is 64, not
256.
> > > > -Clearly, this still scales linearly. So the problem of
memory footprint
> > > > will occur with more VMs, or block devices.
> > > > -Whilst 2.75MB per device is probably acceptable (?), if we
start using
> > > > multipage rings, then we might not want to have
> > > > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this
will cause the
> > > > memory overhead to increase. This is why I have implemented
the
> > > > ''fallback'' mode. With a multipage ring, it
seems reasonable to want the
> > > > first $x$ grefs seen by blkback to be treated as persistent,
and any
> > > > later ones to be non-persistent. Does that seem sensible?
> > > 
> > > From a resource usage pov, perhaps. But this will get the guest
> > > entirely unpredictable performance. Plus I don''t think
11Mb of
> > 
> > Wouldn''t it fall back to the older performance?
> 
> I guess it would be a bit more complex than that. It would be worse than
> the new performance because the grefs that get processed by the
> ''fallback'' mode will cause TLB shootdowns. But any early
grefs will
> still be processed by the persistent mode, so won''t have
shootdowns.
> Therefore, depending on the ratio of {persistent grants}:{non-persistent
> grants), allocated by blkfront, the performance will be somewhere
> inbetween the two extremes.
> 
> I guess that the choice is between
> 1) Compiling blk{front,back} with a pre-determined number of persistent
> grants, and failing if this limit is exceeded. This seems rather
> unflexible, as blk{front,back} must then both both use the same version,
> or you will get failures.
> 2 (current setup)) Have a recommended maximum number of
> persistently-mapped pages, and going into a ''fallback''
mode if blkfront
> exceeds this limit.
> 3) Having blkback inform blkfront on startup as to how many grefs it is
> willing to persistently-map. We then hit the same question again though:
> what should be do if blkfront ignores this limit?
How about 2 and 3 together? Meaning have a recommended maximmum number.
If we fall back due to memory pressure we can tell the guest that we
are entering fall-back mode. The frontend can decide what it wants to do
(throttle the amount of I/Os?) or just do a printk telling the user it
dropped the speed from "Insane Hot!" down to "Turbo!"... 

Or maybe not. Perhaps just reporting it in the backend that we are
hitting memory pressure and using the old-style-fallback mechanism
so the system admin can take actions (and tell his users why suddenly
their I/Os are so slow).

Jan Beulich

2012-Sep-21 07:18 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

>>> On 20.09.12 at 23:24, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Thu, Sep 20, 2012 at 03:13:42PM +0100, Oliver Chick wrote:
>> On Thu, 2012-09-20 at 14:49 +0100, Konrad Rzeszutek Wilk wrote:
>> > On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich wrote:
>> > > >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
>> > > > The memory overhead, and fallback mode points are
related:
>> > > > -Firstly, it turns out that the overhead is actually
2.75MB, not 11MB
>> > > > per device. I made a mistake (pointed out by Jan) as the
maximum number
>> > > > of requests that can fit into a single-page ring is 64,
not 256.
>> > > > -Clearly, this still scales linearly. So the problem of
memory footprint
>> > > > will occur with more VMs, or block devices.
>> > > > -Whilst 2.75MB per device is probably acceptable (?), if
we start using
>> > > > multipage rings, then we might not want to have
>> > > > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this
will cause the
>> > > > memory overhead to increase. This is why I have
implemented the
>> > > > ''fallback'' mode. With a multipage
ring, it seems reasonable to want the
>> > > > first $x$ grefs seen by blkback to be treated as
persistent, and any
>> > > > later ones to be non-persistent. Does that seem
sensible?
>> > > 
>> > > From a resource usage pov, perhaps. But this will get the
guest
>> > > entirely unpredictable performance. Plus I don''t
think 11Mb of
>> > 
>> > Wouldn''t it fall back to the older performance?
>> 
>> I guess it would be a bit more complex than that. It would be worse
than
>> the new performance because the grefs that get processed by the
>> ''fallback'' mode will cause TLB shootdowns. But any
early grefs will
>> still be processed by the persistent mode, so won''t have
shootdowns.
>> Therefore, depending on the ratio of {persistent
grants}:{non-persistent
>> grants), allocated by blkfront, the performance will be somewhere
>> inbetween the two extremes.
>> 
>> I guess that the choice is between
>> 1) Compiling blk{front,back} with a pre-determined number of persistent
>> grants, and failing if this limit is exceeded. This seems rather
>> unflexible, as blk{front,back} must then both both use the same
version,
>> or you will get failures.
>> 2 (current setup)) Have a recommended maximum number of
>> persistently-mapped pages, and going into a
''fallback'' mode if blkfront
>> exceeds this limit.
>> 3) Having blkback inform blkfront on startup as to how many grefs it is
>> willing to persistently-map. We then hit the same question again
though:
>> what should be do if blkfront ignores this limit?
> 
> How about 2 and 3 together? Meaning have a recommended maximmum number.
> If we fall back due to memory pressure we can tell the guest that we
> are entering fall-back mode. The frontend can decide what it wants to do
> (throttle the amount of I/Os?) or just do a printk telling the user it
> dropped the speed from "Insane Hot!" down to
"Turbo!"...
> 
> Or maybe not. Perhaps just reporting it in the backend that we are
> hitting memory pressure and using the old-style-fallback mechanism
> so the system admin can take actions (and tell his users why suddenly
> their I/Os are so slow).
So would either of you help me understand what memory pressure
we''re in need of dealing with here. So far, talk was only about
virtual address space that''s needed for mapping in the grants, and
even then I don''t see how this space requirement varies between
persistent and non-persistent grants - it''s being reserved during
backend initialization anyway.

Jan

Ian Campbell

2012-Sep-21 08:10 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Thu, 2012-09-20 at 22:24 +0100, Konrad Rzeszutek Wilk
wrote:> On Thu, Sep 20, 2012 at 03:13:42PM +0100, Oliver Chick wrote:
> > On Thu, 2012-09-20 at 14:49 +0100, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich wrote:
> > > > >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
> > > > > The memory overhead, and fallback mode points are
related:
> > > > > -Firstly, it turns out that the overhead is actually
2.75MB, not 11MB
> > > > > per device. I made a mistake (pointed out by Jan) as
the maximum number
> > > > > of requests that can fit into a single-page ring is 64,
not 256.
> > > > > -Clearly, this still scales linearly. So the problem of
memory footprint
> > > > > will occur with more VMs, or block devices.
> > > > > -Whilst 2.75MB per device is probably acceptable (?),
if we start using
> > > > > multipage rings, then we might not want to have
> > > > > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as
this will cause the
> > > > > memory overhead to increase. This is why I have
implemented the
> > > > > ''fallback'' mode. With a multipage
ring, it seems reasonable to want the
> > > > > first $x$ grefs seen by blkback to be treated as
persistent, and any
> > > > > later ones to be non-persistent. Does that seem
sensible?
> > > > 
> > > > From a resource usage pov, perhaps. But this will get the
guest
> > > > entirely unpredictable performance. Plus I don''t
think 11Mb of
> > > 
> > > Wouldn''t it fall back to the older performance?
> > 
> > I guess it would be a bit more complex than that. It would be worse
than
> > the new performance because the grefs that get processed by the
> > ''fallback'' mode will cause TLB shootdowns. But any
early grefs will
> > still be processed by the persistent mode, so won''t have
shootdowns.
> > Therefore, depending on the ratio of {persistent
grants}:{non-persistent
> > grants), allocated by blkfront, the performance will be somewhere
> > inbetween the two extremes.
> > 
> > I guess that the choice is between
> > 1) Compiling blk{front,back} with a pre-determined number of
persistent
> > grants, and failing if this limit is exceeded. This seems rather
> > unflexible, as blk{front,back} must then both both use the same
version,
> > or you will get failures.
> > 2 (current setup)) Have a recommended maximum number of
> > persistently-mapped pages, and going into a
''fallback'' mode if blkfront
> > exceeds this limit.
> > 3) Having blkback inform blkfront on startup as to how many grefs it
is
> > willing to persistently-map. We then hit the same question again
though:
> > what should be do if blkfront ignores this limit?
> 
> How about 2 and 3 together?
I think 1 is fine for a "phase 1" implementation, especially taking
into
consideration that the end of Oliver''s internship is next week.

Also it seems that the cases where there might be some disconnect
between the number of persistent grants supported by the backend and the
number of requests from the frontend are currently theoretical or
predicated on the existence of unmerged or as yet unwritten patches.

So lets say, for now, that the default number of persistent grants is
the same as the number of slots in the ring and that it is a bug for
netfront to try and use more than that if it has signed up to the use of
persistent grants. netback is at liberty to fail such "overflow"
requests. In practice this can''t happen with the current
implementations
given the default specified above.

If someone wants to implement something like 2 or 3 in the future then
they can do so by negotiating through a xenstore key for a non-default
number of pgrants.

I think that if/when the number of persistent grants can differ from the
number of ring slots that an LRU scheme would be best. i.e. if there are
N slots then when the N+1th unique grant comes in we discard the least
recently used N/M (M perhaps in {2,3,4}) of the persistently granted
pages. This way there is an incentive for the f.e. to try to reuse pages
as much as possible and we get good batching on the unmaps if not.

Ian.
>  Meaning have a recommended maximmum number.
> If we fall back due to memory pressure we can tell the guest that we
> are entering fall-back mode. The frontend can decide what it wants to do
> (throttle the amount of I/Os?) or just do a printk telling the user it
> dropped the speed from "Insane Hot!" down to
"Turbo!"...
> 
> Or maybe not. Perhaps just reporting it in the backend that we are
> hitting memory pressure and using the old-style-fallback mechanism
> so the system admin can take actions (and tell his users why suddenly
> their I/Os are so slow).
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Oliver Chick

2012-Sep-21 08:41 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Fri, 2012-09-21 at 08:18 +0100, Jan Beulich wrote:> >>> On 20.09.12 at 23:24, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> > On Thu, Sep 20, 2012 at 03:13:42PM +0100, Oliver Chick wrote:
> >> On Thu, 2012-09-20 at 14:49 +0100, Konrad Rzeszutek Wilk wrote:
> >> > On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich wrote:
> >> > > >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
> >> > > > The memory overhead, and fallback mode points are
related:
> >> > > > -Firstly, it turns out that the overhead is
actually 2.75MB, not 11MB
> >> > > > per device. I made a mistake (pointed out by Jan)
as the maximum number
> >> > > > of requests that can fit into a single-page ring is
64, not 256.
> >> > > > -Clearly, this still scales linearly. So the
problem of memory footprint
> >> > > > will occur with more VMs, or block devices.
> >> > > > -Whilst 2.75MB per device is probably acceptable
(?), if we start using
> >> > > > multipage rings, then we might not want to have
> >> > > > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as
this will cause the
> >> > > > memory overhead to increase. This is why I have
implemented the
> >> > > > ''fallback'' mode. With a multipage
ring, it seems reasonable to want the
> >> > > > first $x$ grefs seen by blkback to be treated as
persistent, and any
> >> > > > later ones to be non-persistent. Does that seem
sensible?
> >> > > 
> >> > > From a resource usage pov, perhaps. But this will get
the guest
> >> > > entirely unpredictable performance. Plus I
don''t think 11Mb of
> >> > 
> >> > Wouldn''t it fall back to the older performance?
> >> 
> >> I guess it would be a bit more complex than that. It would be
worse than
> >> the new performance because the grefs that get processed by the
> >> ''fallback'' mode will cause TLB shootdowns. But
any early grefs will
> >> still be processed by the persistent mode, so won''t have
shootdowns.
> >> Therefore, depending on the ratio of {persistent
grants}:{non-persistent
> >> grants), allocated by blkfront, the performance will be somewhere
> >> inbetween the two extremes.
> >> 
> >> I guess that the choice is between
> >> 1) Compiling blk{front,back} with a pre-determined number of
persistent
> >> grants, and failing if this limit is exceeded. This seems rather
> >> unflexible, as blk{front,back} must then both both use the same
version,
> >> or you will get failures.
> >> 2 (current setup)) Have a recommended maximum number of
> >> persistently-mapped pages, and going into a
''fallback'' mode if blkfront
> >> exceeds this limit.
> >> 3) Having blkback inform blkfront on startup as to how many grefs
it is
> >> willing to persistently-map. We then hit the same question again
though:
> >> what should be do if blkfront ignores this limit?
> > 
> > How about 2 and 3 together? Meaning have a recommended maximmum
number.
> > If we fall back due to memory pressure we can tell the guest that we
> > are entering fall-back mode. The frontend can decide what it wants to
do
> > (throttle the amount of I/Os?) or just do a printk telling the user it
> > dropped the speed from "Insane Hot!" down to
"Turbo!"...
> > 
> > Or maybe not. Perhaps just reporting it in the backend that we are
> > hitting memory pressure and using the old-style-fallback mechanism
> > so the system admin can take actions (and tell his users why suddenly
> > their I/Os are so slow).
> 
> So would either of you help me understand what memory pressure
> we''re in need of dealing with here. So far, talk was only about
> virtual address space that''s needed for mapping in the grants, and
> even then I don''t see how this space requirement varies between
> persistent and non-persistent grants - it''s being reserved during
> backend initialization anyway.
> 
> Jan
> 
IIRC, the pending_pages[] used by blkback is a static array of 256
pages, allocated during initialisation. Therefore, the memory mapped
does not increase with the number of block devices being backed, for
non-persistent operation. Whereas, when we become persistent, we don''t
unmap pages so can''t reuse pages for different block devices, therefore
we have to alloc more pages for each device.

Does that answer your question, and make sense?

Jan Beulich

2012-Sep-21 09:41 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

>>> On 21.09.12 at 10:41, Oliver Chick <oliver.chick@citrix.com>
wrote:
> On Fri, 2012-09-21 at 08:18 +0100, Jan Beulich wrote:
>> >>> On 20.09.12 at 23:24, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> > On Thu, Sep 20, 2012 at 03:13:42PM +0100, Oliver Chick wrote:
>> >> On Thu, 2012-09-20 at 14:49 +0100, Konrad Rzeszutek Wilk
wrote:
>> >> > On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich
wrote:
>> >> > > >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
>> >> > > > The memory overhead, and fallback mode points
are related:
>> >> > > > -Firstly, it turns out that the overhead is
actually 2.75MB, not 11MB
>> >> > > > per device. I made a mistake (pointed out by
Jan) as the maximum number
>> >> > > > of requests that can fit into a single-page
ring is 64, not 256.
>> >> > > > -Clearly, this still scales linearly. So the
problem of memory footprint
>> >> > > > will occur with more VMs, or block devices.
>> >> > > > -Whilst 2.75MB per device is probably
acceptable (?), if we start using
>> >> > > > multipage rings, then we might not want to have
>> >> > > >
BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will cause the
>> >> > > > memory overhead to increase. This is why I have
implemented the
>> >> > > > ''fallback'' mode. With a
multipage ring, it seems reasonable to want the
>> >> > > > first $x$ grefs seen by blkback to be treated
as persistent, and any
>> >> > > > later ones to be non-persistent. Does that seem
sensible?
>> >> > > 
>> >> > > From a resource usage pov, perhaps. But this will
get the guest
>> >> > > entirely unpredictable performance. Plus I
don''t think 11Mb of
>> >> > 
>> >> > Wouldn''t it fall back to the older performance?
>> >> 
>> >> I guess it would be a bit more complex than that. It would be
worse than
>> >> the new performance because the grefs that get processed by
the
>> >> ''fallback'' mode will cause TLB shootdowns.
But any early grefs will
>> >> still be processed by the persistent mode, so won''t
have shootdowns.
>> >> Therefore, depending on the ratio of {persistent
grants}:{non-persistent
>> >> grants), allocated by blkfront, the performance will be
somewhere
>> >> inbetween the two extremes.
>> >> 
>> >> I guess that the choice is between
>> >> 1) Compiling blk{front,back} with a pre-determined number of
persistent
>> >> grants, and failing if this limit is exceeded. This seems
rather
>> >> unflexible, as blk{front,back} must then both both use the
same version,
>> >> or you will get failures.
>> >> 2 (current setup)) Have a recommended maximum number of
>> >> persistently-mapped pages, and going into a
''fallback'' mode if blkfront
>> >> exceeds this limit.
>> >> 3) Having blkback inform blkfront on startup as to how many
grefs it is
>> >> willing to persistently-map. We then hit the same question
again though:
>> >> what should be do if blkfront ignores this limit?
>> > 
>> > How about 2 and 3 together? Meaning have a recommended maximmum
number.
>> > If we fall back due to memory pressure we can tell the guest that
we
>> > are entering fall-back mode. The frontend can decide what it wants
to do
>> > (throttle the amount of I/Os?) or just do a printk telling the
user it
>> > dropped the speed from "Insane Hot!" down to
"Turbo!"...
>> > 
>> > Or maybe not. Perhaps just reporting it in the backend that we are
>> > hitting memory pressure and using the old-style-fallback mechanism
>> > so the system admin can take actions (and tell his users why
suddenly
>> > their I/Os are so slow).
>> 
>> So would either of you help me understand what memory pressure
>> we''re in need of dealing with here. So far, talk was only
about
>> virtual address space that''s needed for mapping in the grants,
and
>> even then I don''t see how this space requirement varies
between
>> persistent and non-persistent grants - it''s being reserved
during
>> backend initialization anyway.
>> 
>> Jan
>> 
> 
> IIRC, the pending_pages[] used by blkback is a static array of 256
> pages, allocated during initialisation. Therefore, the memory mapped
> does not increase with the number of block devices being backed, for
> non-persistent operation. Whereas, when we become persistent, we
don''t
> unmap pages so can''t reuse pages for different block devices,
therefore
> we have to alloc more pages for each device.
> 
> Does that answer your question, and make sense?
This is an array of pointers to struct page, not an array of
pages. 11Mb worth of pages amounts to about 22k of memory
needed for this array. If we''re _that_ short of memory, of
course some throttling or even failing of requests is desirable.

Jan

David Vrabel

2012-Sep-21 12:23 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On 20/09/12 12:48, Jan Beulich wrote:>>>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
>> The memory overhead, and fallback mode points are related:
>> -Firstly, it turns out that the overhead is actually 2.75MB, not 11MB
>> per device. I made a mistake (pointed out by Jan) as the maximum number
>> of requests that can fit into a single-page ring is 64, not 256.
>> -Clearly, this still scales linearly. So the problem of memory
footprint
>> will occur with more VMs, or block devices.
>> -Whilst 2.75MB per device is probably acceptable (?), if we start using
>> multipage rings, then we might not want to have
>> BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will cause the
>> memory overhead to increase. This is why I have implemented the
>> ''fallback'' mode. With a multipage ring, it seems
reasonable to want the
>> first $x$ grefs seen by blkback to be treated as persistent, and any
>> later ones to be non-persistent. Does that seem sensible?
> 
>>From a resource usage pov, perhaps. But this will get the guest
> entirely unpredictable performance. Plus I don''t think 11Mb of
> _virtual_ space is unacceptable overhead in a 64-bit kernel. If
> you really want/need this in a 32-bit one, then perhaps some
> other alternatives would be needed (and persistent grants may
> not be the right approach there in the first place).
It''s not just virtual space.  blkback in pvops kernels allocates its
pages from the balloon and if there aren''t enough ballooned out pages
it
has to allocate real pages (releasing the MFN back to Xen).

Classic kernels didn''t need to do this as there was an API for allocate
new "empty" struct page''s that get mapped into kernel space.

David

Konrad Rzeszutek Wilk

2012-Sep-21 14:26 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Fri, Sep 21, 2012 at 09:10:44AM +0100, Ian Campbell
wrote:> On Thu, 2012-09-20 at 22:24 +0100, Konrad Rzeszutek Wilk wrote:
> > On Thu, Sep 20, 2012 at 03:13:42PM +0100, Oliver Chick wrote:
> > > On Thu, 2012-09-20 at 14:49 +0100, Konrad Rzeszutek Wilk wrote:
> > > > On Thu, Sep 20, 2012 at 12:48:41PM +0100, Jan Beulich wrote:
> > > > > >>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
> > > > > > The memory overhead, and fallback mode points are
related:
> > > > > > -Firstly, it turns out that the overhead is
actually 2.75MB, not 11MB
> > > > > > per device. I made a mistake (pointed out by Jan)
as the maximum number
> > > > > > of requests that can fit into a single-page ring
is 64, not 256.
> > > > > > -Clearly, this still scales linearly. So the
problem of memory footprint
> > > > > > will occur with more VMs, or block devices.
> > > > > > -Whilst 2.75MB per device is probably acceptable
(?), if we start using
> > > > > > multipage rings, then we might not want to have
> > > > > > BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE,
as this will cause the
> > > > > > memory overhead to increase. This is why I have
implemented the
> > > > > > ''fallback'' mode. With a
multipage ring, it seems reasonable to want the
> > > > > > first $x$ grefs seen by blkback to be treated as
persistent, and any
> > > > > > later ones to be non-persistent. Does that seem
sensible?
> > > > > 
> > > > > From a resource usage pov, perhaps. But this will get
the guest
> > > > > entirely unpredictable performance. Plus I
don''t think 11Mb of
> > > > 
> > > > Wouldn''t it fall back to the older performance?
> > > 
> > > I guess it would be a bit more complex than that. It would be
worse than
> > > the new performance because the grefs that get processed by the
> > > ''fallback'' mode will cause TLB shootdowns. But
any early grefs will
> > > still be processed by the persistent mode, so won''t have
shootdowns.
> > > Therefore, depending on the ratio of {persistent
grants}:{non-persistent
> > > grants), allocated by blkfront, the performance will be somewhere
> > > inbetween the two extremes.
> > > 
> > > I guess that the choice is between
> > > 1) Compiling blk{front,back} with a pre-determined number of
persistent
> > > grants, and failing if this limit is exceeded. This seems rather
> > > unflexible, as blk{front,back} must then both both use the same
version,
> > > or you will get failures.
> > > 2 (current setup)) Have a recommended maximum number of
> > > persistently-mapped pages, and going into a
''fallback'' mode if blkfront
> > > exceeds this limit.
> > > 3) Having blkback inform blkfront on startup as to how many grefs
it is
> > > willing to persistently-map. We then hit the same question again
though:
> > > what should be do if blkfront ignores this limit?
> > 
> > How about 2 and 3 together?
> 
> I think 1 is fine for a "phase 1" implementation, especially
taking into
> consideration that the end of Oliver''s internship is next week.
Ah yes. Lets do one and then we can deal with 2 later on. At the
same time when netback persistent grants come online. Seems like
both backends will have to deal with this.> 
> Also it seems that the cases where there might be some disconnect
> between the number of persistent grants supported by the backend and the
> number of requests from the frontend are currently theoretical or
> predicated on the existence of unmerged or as yet unwritten patches.
> 
> So lets say, for now, that the default number of persistent grants is
> the same as the number of slots in the ring and that it is a bug for
> netfront to try and use more than that if it has signed up to the use of
> persistent grants. netback is at liberty to fail such "overflow"
> requests. In practice this can''t happen with the current
implementations
> given the default specified above.
OK.

Konrad Rzeszutek Wilk

2012-Sep-21 14:27 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Fri, Sep 21, 2012 at 01:23:48PM +0100, David Vrabel
wrote:> On 20/09/12 12:48, Jan Beulich wrote:
> >>>> On 20.09.12 at 13:30, Oliver Chick
<oliver.chick@citrix.com> wrote:
> >> The memory overhead, and fallback mode points are related:
> >> -Firstly, it turns out that the overhead is actually 2.75MB, not
11MB
> >> per device. I made a mistake (pointed out by Jan) as the maximum
number
> >> of requests that can fit into a single-page ring is 64, not 256.
> >> -Clearly, this still scales linearly. So the problem of memory
footprint
> >> will occur with more VMs, or block devices.
> >> -Whilst 2.75MB per device is probably acceptable (?), if we start
using
> >> multipage rings, then we might not want to have
> >> BLKIF_MAX_PERS_REQUESTS_PER_DEVICE==__RING_SIZE, as this will
cause the
> >> memory overhead to increase. This is why I have implemented the
> >> ''fallback'' mode. With a multipage ring, it seems
reasonable to want the
> >> first $x$ grefs seen by blkback to be treated as persistent, and
any
> >> later ones to be non-persistent. Does that seem sensible?
> > 
> >>From a resource usage pov, perhaps. But this will get the guest
> > entirely unpredictable performance. Plus I don''t think 11Mb
of
> > _virtual_ space is unacceptable overhead in a 64-bit kernel. If
> > you really want/need this in a 32-bit one, then perhaps some
> > other alternatives would be needed (and persistent grants may
> > not be the right approach there in the first place).
> 
> It''s not just virtual space.  blkback in pvops kernels allocates
its
> pages from the balloon and if there aren''t enough ballooned out
pages it
> has to allocate real pages (releasing the MFN back to Xen).
> 
> Classic kernels didn''t need to do this as there was an API for
allocate
> new "empty" struct page''s that get mapped into kernel
space.
Can we ressurect/implement that in the new pvops? We could use the memory
hotplug mechanism to allocate "non-existent" memory.
> 
> David

David Vrabel

2012-Sep-21 16:17 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On 21/09/12 15:27, Konrad Rzeszutek Wilk wrote:> On Fri, Sep 21, 2012 at 01:23:48PM +0100, David Vrabel wrote:
>>
>> It''s not just virtual space.  blkback in pvops kernels
allocates its
>> pages from the balloon and if there aren''t enough ballooned
out pages it
>> has to allocate real pages (releasing the MFN back to Xen).
>>
>> Classic kernels didn''t need to do this as there was an API for
allocate
>> new "empty" struct page''s that get mapped into
kernel space.
> 
> Can we ressurect/implement that in the new pvops? We could use the memory
> hotplug mechanism to allocate "non-existent" memory. 
I don''t see why not.  Provided the limitations are documented I
wouldn''t
make merging the persistent grant stuff dependent on it.

David

Konrad Rzeszutek Wilk

2012-Sep-21 17:13 UTC

head link

Re: [PATCH] Persistent grant maps for xen blk drivers

On Fri, Sep 21, 2012 at 05:17:36PM +0100, David Vrabel
wrote:> On 21/09/12 15:27, Konrad Rzeszutek Wilk wrote:
> > On Fri, Sep 21, 2012 at 01:23:48PM +0100, David Vrabel wrote:
> >>
> >> It''s not just virtual space.  blkback in pvops kernels
allocates its
> >> pages from the balloon and if there aren''t enough
ballooned out pages it
> >> has to allocate real pages (releasing the MFN back to Xen).
> >>
> >> Classic kernels didn''t need to do this as there was an
API for allocate
> >> new "empty" struct page''s that get mapped into
kernel space.
> > 
> > Can we ressurect/implement that in the new pvops? We could use the
memory
> > hotplug mechanism to allocate "non-existent" memory. 
> 
> I don''t see why not.  Provided the limitations are documented I
wouldn''t
> make merging the persistent grant stuff dependent on it.
I was thinking about this in the future and not delay this [persistent
grants].> 
> David

Seemingly Similar Threads

Search for more seemingly similar threads

Xen devel - Sep 2012 - [PATCH] Persistent grant maps for xen blk drivers

[PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Re: [PATCH] Persistent grant maps for xen blk drivers

Seemingly Similar Threads