Roger Pau Monne
2012-Oct-18 11:22 UTC
[PATCH RFC] Persistent grant maps for xen blk drivers
This patch implements persistent grants for the xen-blk{front,back} mechanism. The effect of this change is to reduce the number of unmap operations performed, since they cause a (costly) TLB shootdown. This allows the I/O performance to scale better when a large number of VMs are performing I/O. Previously, the blkfront driver was supplied a bvec[] from the request queue. This was granted to dom0; dom0 performed the I/O and wrote directly into the grant-mapped memory and unmapped it; blkfront then removed foreign access for that grant. The cost of unmapping scales badly with the number of CPUs in Dom0. An experiment showed that when Dom0 has 24 VCPUs, and guests are performing parallel I/O to a ramdisk, the IPIs from performing unmap''s is a bottleneck at 5 guests (at which point 650,000 IOPS are being performed in total). If more than 5 guests are used, the performance declines. By 10 guests, only 400,000 IOPS are being performed. This patch improves performance by only unmapping when the connection between blkfront and back is broken. On startup blkfront notifies blkback that it is using persistent grants, and blkback will do the same. If blkback is not capable of persistent mapping, blkfront will still use the same grants, since it is compatible with the previous protocol, and simplifies the code complexity in blkfront. To perform a read, in persistent mode, blkfront uses a separate pool of pages that it maps to dom0. When a request comes in, blkfront transmutes the request so that blkback will write into one of these free pages. Blkback keeps note of which grefs it has already mapped. When a new ring request comes to blkback, it looks to see if it has already mapped that page. If so, it will not map it again. If the page hasn''t been previously mapped, it is mapped now, and a record is kept of this mapping. Blkback proceeds as usual. When blkfront is notified that blkback has completed a request, it memcpy''s from the shared memory, into the bvec supplied. A record that the {gref, page} tuple is mapped, and not inflight is kept. Writes are similar, except that the memcpy is peformed from the supplied bvecs, into the shared pages, before the request is put onto the ring. Blkback stores a mapping of grefs=>{page mapped to by gref} in a red-black tree. As the grefs are not known apriori, and provide no guarantees on their ordering, we have to perform a search through this tree to find the page, for every gref we receive. This operation takes O(log n) time in the worst case. The maximum number of grants that blkback will persistenly map is currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to prevent a malicios guest from attempting a DoS, by supplying fresh grefs, causing the Dom0 kernel to map excessively. If a guest is using persistent grants and exceeds the maximum number of grants to map persistenly the newly passed grefs will be mapped and unmaped. Using this approach, we can have requests that mix persistent and non-persistent grants, and we need to handle them correctly. This allows us to set the maximum number of persistent grants to a lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although setting it will lead to unpredictable performance. In writing this patch, the question arrises as to if the additional cost of performing memcpys in the guest (to/from the pool of granted pages) outweigh the gains of not performing TLB shootdowns. The answer to that question is `no''. There appears to be very little, if any additional cost to the guest of using persistent grants. There is perhaps a small saving, from the reduced number of hypercalls performed in granting, and ending foreign access. Signed-off-by: Oliver Chick <oliver.chick@citrix.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> Cc: <konrad.wilk@oracle.com> Cc: <linux-kernel@vger.kernel.org> --- Benchmarks showing the impact of this patch in blk performance can be found at: http://xenbits.xensource.com/people/royger/persistent_grants/ --- drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++--- drivers/block/xen-blkback/common.h | 17 ++ drivers/block/xen-blkback/xenbus.c | 16 ++- drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++---- 4 files changed, 442 insertions(+), 53 deletions(-) diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c index c6decb9..2b982b2 100644 --- a/drivers/block/xen-blkback/blkback.c +++ b/drivers/block/xen-blkback/blkback.c @@ -78,6 +78,7 @@ struct pending_req { unsigned short operation; int status; struct list_head free_list; + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; }; #define BLKBACK_INVALID_HANDLE (~0) @@ -98,6 +99,30 @@ struct xen_blkbk { static struct xen_blkbk *blkbk; /* + * Maximum number of grant pages that can be mapped in blkback. + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of + * pages that blkback will persistently map. + */ +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol) +{ + switch (protocol) { + case BLKIF_PROTOCOL_NATIVE: + return __CONST_RING_SIZE(blkif, PAGE_SIZE) * + BLKIF_MAX_SEGMENTS_PER_REQUEST; + case BLKIF_PROTOCOL_X86_32: + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) * + BLKIF_MAX_SEGMENTS_PER_REQUEST; + case BLKIF_PROTOCOL_X86_64: + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) * + BLKIF_MAX_SEGMENTS_PER_REQUEST; + default: + BUG(); + } + return 0; +} + + +/* * Little helpful macro to figure out the index and virtual address of the * pending_pages[..]. For each ''pending_req'' we have have up to * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, static void make_response(struct xen_blkif *blkif, u64 id, unsigned short op, int st); +#define foreach_grant(pos, rbtree, node) \ + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \ + &(pos)->node != NULL; \ + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node)) + + +static void add_persistent_gnt(struct rb_root *root, + struct persistent_gnt *persistent_gnt) +{ + struct rb_node **new = &(root->rb_node), *parent = NULL; + struct persistent_gnt *this; + + /* Figure out where to put new node */ + while (*new) { + this = container_of(*new, struct persistent_gnt, node); + + parent = *new; + if (persistent_gnt->gnt < this->gnt) + new = &((*new)->rb_left); + else if (persistent_gnt->gnt > this->gnt) + new = &((*new)->rb_right); + else { + pr_alert(DRV_PFX " trying to add a gref that''s already in the tree\n"); + BUG(); + } + } + + /* Add new node and rebalance tree. */ + rb_link_node(&(persistent_gnt->node), parent, new); + rb_insert_color(&(persistent_gnt->node), root); +} + +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, + grant_ref_t gref) +{ + struct persistent_gnt *data; + struct rb_node *node = root->rb_node; + + while (node) { + data = container_of(node, struct persistent_gnt, node); + + if (gref < data->gnt) + node = node->rb_left; + else if (gref > data->gnt) + node = node->rb_right; + else + return data; + } + return NULL; +} + /* * Retrieve from the ''pending_reqs'' a free pending_req structure to be used. */ @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg) { struct xen_blkif *blkif = arg; struct xen_vbd *vbd = &blkif->vbd; + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + struct persistent_gnt *persistent_gnt; + int ret = 0; + int segs_to_unmap = 0; xen_blkif_get(blkif); @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg) print_stats(blkif); } + /* Free all persistent grant pages */ + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) { + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE); + gnttab_set_unmap_op(&unmap[segs_to_unmap], + (unsigned long) pfn_to_kaddr(page_to_pfn( + persistent_gnt->page)), + GNTMAP_host_map, + persistent_gnt->handle); + + pages[segs_to_unmap] = persistent_gnt->page; + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts); + kfree(persistent_gnt); + blkif->persistent_gnt_c--; + + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST || + !rb_next(&persistent_gnt->node)) { + ret = gnttab_unmap_refs(unmap, NULL, pages, + segs_to_unmap); + BUG_ON(ret); + segs_to_unmap = 0; + } + } + + BUG_ON(blkif->persistent_gnt_c != 0); + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts)); + if (log_stats) print_stats(blkif); @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req) int ret; for (i = 0; i < req->nr_pages; i++) { + if (!req->unmap_seg[i]) + continue; handle = pending_handle(req, i); if (handle == BLKBACK_INVALID_HANDLE) continue; @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req) static int xen_blkbk_map(struct blkif_request *req, struct pending_req *pending_req, - struct seg_buf seg[]) + struct seg_buf seg[], + struct page *pages[]) { struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST]; - int i; + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + struct persistent_gnt *persistent_gnt = NULL; + struct xen_blkif *blkif = pending_req->blkif; + phys_addr_t addr = 0; + int i, j; + int new_map; int nseg = req->u.rw.nr_segments; + int segs_to_map = 0; int ret = 0; + int use_persistent_gnts; + + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent); + + BUG_ON(blkif->persistent_gnt_c > + max_mapped_grant_pages(pending_req->blkif->blk_protocol)); /* * Fill out preq.nr_sects with proper amount of sectors, and setup @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req, for (i = 0; i < nseg; i++) { uint32_t flags; - flags = GNTMAP_host_map; - if (pending_req->operation != BLKIF_OP_READ) - flags |= GNTMAP_readonly; - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags, - req->u.rw.seg[i].gref, - pending_req->blkif->domid); + if (use_persistent_gnts) + persistent_gnt = get_persistent_gnt( + &blkif->persistent_gnts, + req->u.rw.seg[i].gref); + + if (persistent_gnt) { + /* + * We are using persistent grants and + * the grant is already mapped + */ + new_map = 0; + } else if (use_persistent_gnts && + blkif->persistent_gnt_c < + max_mapped_grant_pages(blkif->blk_protocol)) { + /* + * We are using persistent grants, the grant is + * not mapped but we have room for it + */ + new_map = 1; + persistent_gnt = kzalloc( + sizeof(struct persistent_gnt), + GFP_KERNEL); + if (!persistent_gnt) + return -ENOMEM; + persistent_gnt->page = alloc_page(GFP_KERNEL); + if (!persistent_gnt->page) { + kfree(persistent_gnt); + return -ENOMEM; + } + persistent_gnt->gnt = req->u.rw.seg[i].gref; + + pages_to_gnt[segs_to_map] + persistent_gnt->page; + addr = (unsigned long) pfn_to_kaddr( + page_to_pfn(persistent_gnt->page)); + + add_persistent_gnt(&blkif->persistent_gnts, + persistent_gnt); + blkif->persistent_gnt_c++; + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n", + persistent_gnt->gnt, blkif->persistent_gnt_c, + max_mapped_grant_pages(blkif->blk_protocol)); + } else { + /* + * We are either using persistent grants and + * hit the maximum limit of grants mapped, + * or we are not using persistent grants. + */ + if (use_persistent_gnts && + !blkif->vbd.overflow_max_grants) { + blkif->vbd.overflow_max_grants = 1; + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n", + blkif->domid, blkif->vbd.handle); + } + new_map = 1; + pages[i] = blkbk->pending_page(pending_req, i); + addr = vaddr(pending_req, i); + pages_to_gnt[segs_to_map] + blkbk->pending_page(pending_req, i); + } + + if (persistent_gnt) { + pages[i] = persistent_gnt->page; + persistent_gnts[i] = persistent_gnt; + } else { + persistent_gnts[i] = NULL; + } + + if (new_map) { + flags = GNTMAP_host_map; + if (!persistent_gnt && + (pending_req->operation != BLKIF_OP_READ)) + flags |= GNTMAP_readonly; + gnttab_set_map_op(&map[segs_to_map++], addr, + flags, req->u.rw.seg[i].gref, + blkif->domid); + } } - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg); - BUG_ON(ret); + if (segs_to_map) { + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map); + BUG_ON(ret); + } /* * Now swizzle the MFN in our domain with the MFN from the other domain * so that when we access vaddr(pending_req,i) it has the contents of * the page from the other domain. */ - for (i = 0; i < nseg; i++) { - if (unlikely(map[i].status != 0)) { - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); - map[i].handle = BLKBACK_INVALID_HANDLE; - ret |= 1; + for (i = 0, j = 0; i < nseg; i++) { + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) { + /* This is a newly mapped grant */ + BUG_ON(j >= segs_to_map); + if (unlikely(map[j].status != 0)) { + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); + map[j].handle = BLKBACK_INVALID_HANDLE; + ret |= 1; + if (persistent_gnts[i]) { + rb_erase(&persistent_gnts[i]->node, + &blkif->persistent_gnts); + blkif->persistent_gnt_c--; + kfree(persistent_gnts[i]); + persistent_gnts[i] = NULL; + } + } + } + if (persistent_gnts[i]) { + if (!persistent_gnts[i]->handle) { + /* + * If this is a new persistent grant + * save the handler + */ + persistent_gnts[i]->handle = map[j].handle; + persistent_gnts[i]->dev_bus_addr + map[j++].dev_bus_addr; + } + pending_handle(pending_req, i) + persistent_gnts[i]->handle; + pending_req->unmap_seg[i] = 0; + + if (ret) + continue; + + seg[i].buf = persistent_gnts[i]->dev_bus_addr | + (req->u.rw.seg[i].first_sect << 9); + } else { + pending_handle(pending_req, i) = map[j].handle; + pending_req->unmap_seg[i] = 1; + + if (ret) + continue; + + seg[i].buf = map[j++].dev_bus_addr | + (req->u.rw.seg[i].first_sect << 9); } - - pending_handle(pending_req, i) = map[i].handle; - - if (ret) - continue; - - seg[i].buf = map[i].dev_bus_addr | - (req->u.rw.seg[i].first_sect << 9); } return ret; } @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, int operation; struct blk_plug plug; bool drain = false; + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; switch (req->operation) { case BLKIF_OP_READ: @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, * the hypercall to unmap the grants - that is all done in * xen_blkbk_unmap. */ - if (xen_blkbk_map(req, pending_req, seg)) + if (xen_blkbk_map(req, pending_req, seg, pages)) goto fail_flush; /* @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, for (i = 0; i < nseg; i++) { while ((bio == NULL) || (bio_add_page(bio, - blkbk->pending_page(pending_req, i), + pages[i], seg[i].nsec << 9, seg[i].buf & ~PAGE_MASK) == 0)) { diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h index 9ad3b5e..6c08ee9 100644 --- a/drivers/block/xen-blkback/common.h +++ b/drivers/block/xen-blkback/common.h @@ -34,6 +34,7 @@ #include <linux/vmalloc.h> #include <linux/wait.h> #include <linux/io.h> +#include <linux/rbtree.h> #include <asm/setup.h> #include <asm/pgalloc.h> #include <asm/hypervisor.h> @@ -160,10 +161,22 @@ struct xen_vbd { sector_t size; bool flush_support; bool discard_secure; + + unsigned int feature_gnt_persistent:1; + unsigned int overflow_max_grants:1; }; struct backend_info; + +struct persistent_gnt { + struct page *page; + grant_ref_t gnt; + grant_handle_t handle; + uint64_t dev_bus_addr; + struct rb_node node; +}; + struct xen_blkif { /* Unique identifier for this interface. */ domid_t domid; @@ -190,6 +203,10 @@ struct xen_blkif { struct task_struct *xenblkd; unsigned int waiting_reqs; + /* frontend feature information */ + struct rb_root persistent_gnts; + unsigned int persistent_gnt_c; + /* statistics */ unsigned long st_print; int st_rd_req; diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index 4f66171..9f88b4e 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) atomic_set(&blkif->drain, 0); blkif->st_print = jiffies; init_waitqueue_head(&blkif->waiting_to_free); + blkif->persistent_gnts.rb_node = NULL; return blkif; } @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be) struct xenbus_device *dev = be->dev; unsigned long ring_ref; unsigned int evtchn; + unsigned int pers_grants; char protocol[64] = ""; int err; @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be) xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol); return -1; } - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n", - ring_ref, evtchn, be->blkif->blk_protocol, protocol); + err = xenbus_gather(XBT_NIL, dev->otherend, + "feature-persistent-grants", "%u", + &pers_grants, NULL); + if (err) + pers_grants = 0; + + be->blkif->vbd.feature_gnt_persistent = pers_grants; + be->blkif->vbd.overflow_max_grants = 0; + + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n", + ring_ref, evtchn, be->blkif->blk_protocol, protocol, + pers_grants); /* Map the shared frame, irq etc. */ err = xen_blkif_map(be->blkif, ring_ref, evtchn); diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 2c2d2e5..206d422 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -44,6 +44,7 @@ #include <linux/mutex.h> #include <linux/scatterlist.h> #include <linux/bitmap.h> +#include <linux/llist.h> #include <xen/xen.h> #include <xen/xenbus.h> @@ -64,10 +65,17 @@ enum blkif_state { BLKIF_STATE_SUSPENDED, }; +struct grant { + grant_ref_t gref; + unsigned long pfn; + struct llist_node node; +}; + struct blk_shadow { struct blkif_request req; struct request *request; unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST]; }; static DEFINE_MUTEX(blkfront_mutex); @@ -97,6 +105,8 @@ struct blkfront_info struct work_struct work; struct gnttab_free_callback callback; struct blk_shadow shadow[BLK_RING_SIZE]; + struct llist_head persistent_gnts; + unsigned int persistent_gnts_c; unsigned long shadow_free; unsigned int feature_flush; unsigned int flush_op; @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req) unsigned long id; unsigned int fsect, lsect; int i, ref; + + /* + * Used to store if we are able to queue the request by just using + * existing persistent grants (0), or if we have to get new grants, + * as there are not sufficiently many free. + */ + int new_persistent_gnts; grant_ref_t gref_head; + struct page *granted_page; + struct grant *gnt_list_entry = NULL; struct scatterlist *sg; if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) return 1; - if (gnttab_alloc_grant_references( - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) { - gnttab_request_free_callback( - &info->callback, - blkif_restart_queue_callback, - info, - BLKIF_MAX_SEGMENTS_PER_REQUEST); - return 1; - } + /* Check if we have enought grants to allocate a requests */ + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) { + new_persistent_gnts = 1; + if (gnttab_alloc_grant_references( + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c, + &gref_head) < 0) { + gnttab_request_free_callback( + &info->callback, + blkif_restart_queue_callback, + info, + BLKIF_MAX_SEGMENTS_PER_REQUEST); + return 1; + } + } else + new_persistent_gnts = 0; /* Fill out a communications ring structure. */ ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req) BLKIF_MAX_SEGMENTS_PER_REQUEST); for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) { - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg))); fsect = sg->offset >> 9; lsect = fsect + (sg->length >> 9) - 1; - /* install a grant reference. */ - ref = gnttab_claim_grant_reference(&gref_head); - BUG_ON(ref == -ENOSPC); - gnttab_grant_foreign_access_ref( - ref, + if (info->persistent_gnts_c) { + BUG_ON(llist_empty(&info->persistent_gnts)); + gnt_list_entry = llist_entry( + llist_del_first(&info->persistent_gnts), + struct grant, node); + + ref = gnt_list_entry->gref; + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn); + info->persistent_gnts_c--; + } else { + ref = gnttab_claim_grant_reference(&gref_head); + BUG_ON(ref == -ENOSPC); + + gnt_list_entry + kmalloc(sizeof(struct grant), + GFP_ATOMIC); + if (!gnt_list_entry) + return -ENOMEM; + + granted_page = alloc_page(GFP_ATOMIC); + if (!granted_page) { + kfree(gnt_list_entry); + return -ENOMEM; + } + + gnt_list_entry->pfn + page_to_pfn(granted_page); + gnt_list_entry->gref = ref; + + buffer_mfn = pfn_to_mfn(page_to_pfn( + granted_page)); + gnttab_grant_foreign_access_ref(ref, info->xbdev->otherend_id, - buffer_mfn, - rq_data_dir(req)); + buffer_mfn, 0); + } + + info->shadow[id].grants_used[i] = gnt_list_entry; + + if (rq_data_dir(req)) { + char *bvec_data; + void *shared_data; + + BUG_ON(sg->offset + sg->length > PAGE_SIZE); + + shared_data = kmap_atomic( + pfn_to_page(gnt_list_entry->pfn)); + bvec_data = kmap_atomic(sg_page(sg)); + + /* + * this does not wipe data stored outside the + * range sg->offset..sg->offset+sg->length. + * Therefore, blkback *could* see data from + * previous requests. This is OK as long as + * persistent grants are shared with just one + * domain. It may need refactoring if This + * changes + */ + memcpy(shared_data + sg->offset, + bvec_data + sg->offset, + sg->length); + + kunmap_atomic(bvec_data); + kunmap_atomic(shared_data); + } info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn); ring_req->u.rw.seg[i] @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req) /* Keep a private copy so we can reissue requests when recovering. */ info->shadow[id].req = *ring_req; - gnttab_free_grant_references(gref_head); + if (new_persistent_gnts) + gnttab_free_grant_references(gref_head); return 0; } @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) static void xlvbd_flush(struct blkfront_info *info) { blk_queue_flush(info->rq, info->feature_flush); - printk(KERN_INFO "blkfront: %s: %s: %s\n", + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n", info->gd->disk_name, info->flush_op == BLKIF_OP_WRITE_BARRIER ? "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ? @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work) static void blkif_free(struct blkfront_info *info, int suspend) { + struct llist_node *all_gnts; + struct grant *persistent_gnt; + /* Prevent new requests being issued until we fix things up. */ spin_lock_irq(&info->io_lock); info->connected = suspend ? @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend) /* No more blkif_request(). */ if (info->rq) blk_stop_queue(info->rq); + + /* Remove all persistent grants */ + if (info->persistent_gnts_c) { + all_gnts = llist_del_all(&info->persistent_gnts); + llist_for_each_entry(persistent_gnt, all_gnts, node) { + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL); + kfree(persistent_gnt); + } + info->persistent_gnts_c = 0; + } + /* No more gnttab callback work. */ gnttab_cancel_free_callback(&info->callback); spin_unlock_irq(&info->io_lock); @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend) } -static void blkif_completion(struct blk_shadow *s) +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, + struct blkif_response *bret) { int i; - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place - * flag. */ - for (i = 0; i < s->req.u.rw.nr_segments; i++) - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL); + struct bio_vec *bvec; + struct req_iterator iter; + unsigned long flags; + char *bvec_data; + void *shared_data; + unsigned int offset = 0; + + if (bret->operation == BLKIF_OP_READ) { + /* + * Copy the data received from the backend into the bvec. + * Since bv_len can be different from PAGE_SIZE, we need to + * be sure we are actually copying the data from the right + * shared page. + */ + rq_for_each_segment(bvec, s->request, iter) { + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE); + i = offset >> PAGE_SHIFT; + shared_data = kmap_atomic( + pfn_to_page(s->grants_used[i]->pfn)); + bvec_data = bvec_kmap_irq(bvec, &flags); + memcpy(bvec_data, shared_data + bvec->bv_offset, + bvec->bv_len); + bvec_kunmap_irq(bvec_data, &flags); + kunmap_atomic(shared_data); + offset += bvec->bv_len; + } + } + /* Add the persistent grant into the list of free grants */ + for (i = 0; i < s->req.u.rw.nr_segments; i++) { + llist_add(&s->grants_used[i]->node, &info->persistent_gnts); + info->persistent_gnts_c++; + } } static irqreturn_t blkif_interrupt(int irq, void *dev_id) @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) req = info->shadow[id].request; if (bret->operation != BLKIF_OP_DISCARD) - blkif_completion(&info->shadow[id]); + blkif_completion(&info->shadow[id], info, bret); if (add_id_to_freelist(info, id)) { WARN(1, "%s: response to %s (id %ld) couldn''t be recycled!\n", @@ -942,6 +1066,11 @@ again: message = "writing protocol"; goto abort_transaction; } + err = xenbus_printf(xbt, dev->nodename, + "feature-persistent-grants", "%d", 1); + if (err) + dev_warn(&dev->dev, + "writing persistent grants feature to xenbus"); err = xenbus_transaction_end(xbt, 0); if (err) { @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev, spin_lock_init(&info->io_lock); info->xbdev = dev; info->vdevice = vdevice; + init_llist_head(&info->persistent_gnts); + info->persistent_gnts_c = 0; info->connected = BLKIF_STATE_DISCONNECTED; INIT_WORK(&info->work, blkif_restart_queue); @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info) req->u.rw.seg[j].gref, info->xbdev->otherend_id, pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]), - rq_data_dir(info->shadow[req->u.rw.id].request)); + 0); } info->shadow[req->u.rw.id].req = *req; -- 1.7.7.5 (Apple Git-26)
James Harper
2012-Oct-19 01:34 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
> > This patch implements persistent grants for the xen-blk{front,back} > mechanism. The effect of this change is to reduce the number of unmap > operations performed, since they cause a (costly) TLB shootdown. This allows > the I/O performance to scale better when a large number of VMs are > performing I/O. > > Previously, the blkfront driver was supplied a bvec[] from the request > queue. This was granted to dom0; dom0 performed the I/O and wrote > directly into the grant-mapped memory and unmapped it; blkfront then > removed foreign access for that grant. The cost of unmapping scales badly > with the number of CPUs in Dom0. An experiment showed that when > Dom0 has 24 VCPUs, and guests are performing parallel I/O to a ramdisk, the > IPIs from performing unmap''s is a bottleneck at 5 guests (at which point > 650,000 IOPS are being performed in total). If more than 5 guests are used, > the performance declines. By 10 guests, only > 400,000 IOPS are being performed. > > This patch improves performance by only unmapping when the connection > between blkfront and back is broken.I assume network drivers would suffer from the same affliction... Would a more general persistent map solution be worth considering (or be possible)? So a common interface to this persistent mapping allowing the persistent pool to be shared between all drivers in the DomU? James
Roger Pau Monné
2012-Oct-19 08:26 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On 19/10/12 03:34, James Harper wrote:>> >> This patch implements persistent grants for the xen-blk{front,back} >> mechanism. The effect of this change is to reduce the number of unmap >> operations performed, since they cause a (costly) TLB shootdown. This allows >> the I/O performance to scale better when a large number of VMs are >> performing I/O. >> >> Previously, the blkfront driver was supplied a bvec[] from the request >> queue. This was granted to dom0; dom0 performed the I/O and wrote >> directly into the grant-mapped memory and unmapped it; blkfront then >> removed foreign access for that grant. The cost of unmapping scales badly >> with the number of CPUs in Dom0. An experiment showed that when >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a ramdisk, the >> IPIs from performing unmap''s is a bottleneck at 5 guests (at which point >> 650,000 IOPS are being performed in total). If more than 5 guests are used, >> the performance declines. By 10 guests, only >> 400,000 IOPS are being performed. >> >> This patch improves performance by only unmapping when the connection >> between blkfront and back is broken. > > I assume network drivers would suffer from the same affliction... Would a more general persistent map solution be worth considering (or be possible)? So a common interface to this persistent mapping allowing the persistent pool to be shared between all drivers in the DomU?Yes, there are plans to implement the same for network drivers. I would generally avoid having a shared pool of grants for all the devices of a DomU, as said in the description of the patch: Blkback stores a mapping of grefs=>{page mapped to by gref} in a red-black tree. As the grefs are not known apriori, and provide no guarantees on their ordering, we have to perform a search through this tree to find the page, for every gref we receive. This operation takes O(log n) time in the worst case. Having a shared pool with all grants would mean that n will become much higher, and so the search time for a grant would increase. Also, if the pool is shared some kind of concurrency control should be added, which will make it even slower. What should be shared are the structures used to save the grants (struct grant and struct persistent_gnt for front and back drivers respectively), and the red-black tree implementation (foreach_grant, add_persistent_gnt and get_persistent_gnt). This functions should be moved to a common header for the net implementation.
James Harper
2012-Oct-19 10:46 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
> > On 19/10/12 03:34, James Harper wrote: > >> > >> This patch implements persistent grants for the xen-blk{front,back} > >> mechanism. The effect of this change is to reduce the number of unmap > >> operations performed, since they cause a (costly) TLB shootdown. This > >> allows the I/O performance to scale better when a large number of VMs > >> are performing I/O. > >> > >> Previously, the blkfront driver was supplied a bvec[] from the > >> request queue. This was granted to dom0; dom0 performed the I/O and > >> wrote directly into the grant-mapped memory and unmapped it; blkfront > >> then removed foreign access for that grant. The cost of unmapping > >> scales badly with the number of CPUs in Dom0. An experiment showed > >> that when > >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a > >> ramdisk, the IPIs from performing unmap''s is a bottleneck at 5 guests > >> (at which point > >> 650,000 IOPS are being performed in total). If more than 5 guests are > >> used, the performance declines. By 10 guests, only > >> 400,000 IOPS are being performed. > >> > >> This patch improves performance by only unmapping when the > connection > >> between blkfront and back is broken. > > > > I assume network drivers would suffer from the same affliction... Would a > more general persistent map solution be worth considering (or be possible)? > So a common interface to this persistent mapping allowing the persistent > pool to be shared between all drivers in the DomU? > > Yes, there are plans to implement the same for network drivers. I would > generally avoid having a shared pool of grants for all the devices of a DomU, > as said in the description of the patch: > > Blkback stores a mapping of grefs=>{page mapped to by gref} in a red-black > tree. As the grefs are not known apriori, and provide no guarantees on their > ordering, we have to perform a search through this tree to find the page, for > every gref we receive. This operation takes O(log n) time in the worst case. > > Having a shared pool with all grants would mean that n will become much > higher, and so the search time for a grant would increase.I''m asking because I vaguely started a similar project a while back, but didn''t get much further than investigating data structures. I had something like the following: . redefined gref so that high bit indicates a persistent mapping (on the basis that no DomU is ever going to have >2^31 grants). High bit set indicates a persistent grant which is handled differently. . New hypercall mem-op''s to allocate/deallocate a persistent grant, returning a handle from Dom0 (with high bit set). Dom0 maintains a table of mapped grants with the handle being the index. Ref counting tracks usage so that an unmap won''t be allowed when ref>0. I was taking the approach that a chunk of persistent grants would be allocated at boot time and so the actual map/unmap is not done often so the requirement of a hypercall wasn''t a big deal. I hadn''t figured out how to manage the size of this table yet. . Mapping a gref with the high bit set in Dom0 becomes a lookup into the persistent table and a ref++ rather than an actual mapping operation. Unmapping becomes a ref--.> Also, if the pool is > shared some kind of concurrency control should be added, which will make it > even slower. >Yes, but I think I only needed to worry about that for the actual alloc/dealloc of the persistent map entry which would be an infrequent event. As I said, I never got much further than the above concept so I hadn''t fully explored that - at the time I was chasing an imaginary problem with grant tables which turned out to be freelist contention in DomU. James
James Harper
2012-Oct-19 11:19 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
> > . New hypercall mem-op''s to allocate/deallocate a persistent grant, returning > a handle from Dom0 (with high bit set). Dom0 maintains a table of mapped > grants with the handle being the index. Ref counting tracks usage so that an > unmap won''t be allowed when ref>0. I was taking the approach that a chunk > of persistent grants would be allocated at boot time and so the actual > map/unmap is not done often so the requirement of a hypercall wasn''t a big > deal. I hadn''t figured out how to manage the size of this table yet. >Actually it was this bit that I didn''t progress any further... the hypercall wouldn''t do the job as Dom0 needed to get the message which would really need yet another shared ring which meant yet more complexity. James
Roger Pau Monné
2012-Oct-19 11:28 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On 19/10/12 12:46, James Harper wrote:>> >> On 19/10/12 03:34, James Harper wrote: >>>> >>>> This patch implements persistent grants for the xen-blk{front,back} >>>> mechanism. The effect of this change is to reduce the number of unmap >>>> operations performed, since they cause a (costly) TLB shootdown. This >>>> allows the I/O performance to scale better when a large number of VMs >>>> are performing I/O. >>>> >>>> Previously, the blkfront driver was supplied a bvec[] from the >>>> request queue. This was granted to dom0; dom0 performed the I/O and >>>> wrote directly into the grant-mapped memory and unmapped it; blkfront >>>> then removed foreign access for that grant. The cost of unmapping >>>> scales badly with the number of CPUs in Dom0. An experiment showed >>>> that when >>>> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a >>>> ramdisk, the IPIs from performing unmap''s is a bottleneck at 5 guests >>>> (at which point >>>> 650,000 IOPS are being performed in total). If more than 5 guests are >>>> used, the performance declines. By 10 guests, only >>>> 400,000 IOPS are being performed. >>>> >>>> This patch improves performance by only unmapping when the >> connection >>>> between blkfront and back is broken. >>> >>> I assume network drivers would suffer from the same affliction... Would a >> more general persistent map solution be worth considering (or be possible)? >> So a common interface to this persistent mapping allowing the persistent >> pool to be shared between all drivers in the DomU? >> >> Yes, there are plans to implement the same for network drivers. I would >> generally avoid having a shared pool of grants for all the devices of a DomU, >> as said in the description of the patch: >> >> Blkback stores a mapping of grefs=>{page mapped to by gref} in a red-black >> tree. As the grefs are not known apriori, and provide no guarantees on their >> ordering, we have to perform a search through this tree to find the page, for >> every gref we receive. This operation takes O(log n) time in the worst case. >> >> Having a shared pool with all grants would mean that n will become much >> higher, and so the search time for a grant would increase. > > I''m asking because I vaguely started a similar project a while back, but didn''t get much further than investigating data structures. I had something like the following: > > . redefined gref so that high bit indicates a persistent mapping (on the basis that no DomU is ever going to have >2^31 grants). High bit set indicates a persistent grant which is handled differently.I don''t understand why you need to change the way to pass a gref arround, this will break compatibility with non-persistent backends, unless you negotiate the use of persistent grants before actually starting the data tranfer, but if you do that you already know you are using persistent grants, so there''s no need to set any bit in the gref.> . New hypercall mem-op''s to allocate/deallocate a persistent grant, returning a handle from Dom0 (with high bit set). Dom0 maintains a table of mapped grants with the handle being the index. Ref counting tracks usage so that an unmap won''t be allowed when ref>0. I was taking the approach that a chunk of persistent grants would be allocated at boot time and so the actual map/unmap is not done often so the requirement of a hypercall wasn''t a big deal. I hadn''t figured out how to manage the size of this table yet.The so called persistent grants are no different from normal grants, it''s just that we agree in blk{front/back} that the same set of grants will be used for all transations, there''s no need to introduce any new hypercalls, since they are just "regular" grants. I agree that we could allocate them when initializing blkfront, but I prefer to allocate them on request, since we won''t probably use the maximum number (RING_SIZE * SEGMENTS_PER_REQUEST).> . Mapping a gref with the high bit set in Dom0 becomes a lookup into the persistent table and a ref++ rather than an actual mapping operation. Unmapping becomes a ref--. > >> Also, if the pool is >> shared some kind of concurrency control should be added, which will make it >> even slower. >> > > Yes, but I think I only needed to worry about that for the actual alloc/dealloc of the persistent map entry which would be an infrequent event. As I said, I never got much further than the above concept so I hadn''t fully explored that - at the time I was chasing an imaginary problem with grant tables which turned out to be freelist contention in DomU.As far as I can see (correct me if I''m wrong), you are proposing a solution that involves changes to both the guests and the hypervisor side, I think this introduces uncessary complexity to a problem that can be solved by merely changing the way blk{front/back} behaves, without requiring the hypervisor to know if we are using persistent grants or not.> James >
Konrad Rzeszutek Wilk
2012-Oct-22 13:47 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote:> This patch implements persistent grants for the xen-blk{front,back} > mechanism. The effect of this change is to reduce the number of unmap > operations performed, since they cause a (costly) TLB shootdown. This > allows the I/O performance to scale better when a large number of VMs > are performing I/O. > > Previously, the blkfront driver was supplied a bvec[] from the request > queue. This was granted to dom0; dom0 performed the I/O and wrote > directly into the grant-mapped memory and unmapped it; blkfront then > removed foreign access for that grant. The cost of unmapping scales > badly with the number of CPUs in Dom0. An experiment showed that when > Dom0 has 24 VCPUs, and guests are performing parallel I/O to a > ramdisk, the IPIs from performing unmap''s is a bottleneck at 5 guests > (at which point 650,000 IOPS are being performed in total). If more > than 5 guests are used, the performance declines. By 10 guests, only > 400,000 IOPS are being performed. > > This patch improves performance by only unmapping when the connection > between blkfront and back is broken. > > On startup blkfront notifies blkback that it is using persistent > grants, and blkback will do the same. If blkback is not capable of > persistent mapping, blkfront will still use the same grants, since it > is compatible with the previous protocol, and simplifies the code > complexity in blkfront. > > To perform a read, in persistent mode, blkfront uses a separate pool > of pages that it maps to dom0. When a request comes in, blkfront > transmutes the request so that blkback will write into one of these > free pages. Blkback keeps note of which grefs it has already > mapped. When a new ring request comes to blkback, it looks to see if > it has already mapped that page. If so, it will not map it again. If > the page hasn''t been previously mapped, it is mapped now, and a record > is kept of this mapping. Blkback proceeds as usual. When blkfront is > notified that blkback has completed a request, it memcpy''s from the > shared memory, into the bvec supplied. A record that the {gref, page} > tuple is mapped, and not inflight is kept. > > Writes are similar, except that the memcpy is peformed from the > supplied bvecs, into the shared pages, before the request is put onto > the ring. > > Blkback stores a mapping of grefs=>{page mapped to by gref} in > a red-black tree. As the grefs are not known apriori, and provide no > guarantees on their ordering, we have to perform a search > through this tree to find the page, for every gref we receive. This > operation takes O(log n) time in the worst case.Might want to mention how blkfront stores it as well. It looks to be using ''llist'' instead of ''list''? Any particular reason?> > The maximum number of grants that blkback will persistenly map is > currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to > prevent a malicios guest from attempting a DoS, by supplying fresh > grefs, causing the Dom0 kernel to map excessively. If a guest > is using persistent grants and exceeds the maximum number of grants to > map persistenly the newly passed grefs will be mapped and unmaped. > Using this approach, we can have requests that mix persistent and > non-persistent grants, and we need to handle them correctly. > This allows us to set the maximum number of persistent grants to a > lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although > setting it will lead to unpredictable performance. > > In writing this patch, the question arrises as to if the additional > cost of performing memcpys in the guest (to/from the pool of granted > pages) outweigh the gains of not performing TLB shootdowns. The answer > to that question is `no''. There appears to be very little, if any > additional cost to the guest of using persistent grants. There is > perhaps a small saving, from the reduced number of hypercalls > performed in granting, and ending foreign access. > > Signed-off-by: Oliver Chick <oliver.chick@citrix.com> > Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> > Cc: <konrad.wilk@oracle.com> > Cc: <linux-kernel@vger.kernel.org> > --- > Benchmarks showing the impact of this patch in blk performance can be > found at: > > http://xenbits.xensource.com/people/royger/persistent_grants/ > --- > drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++--- > drivers/block/xen-blkback/common.h | 17 ++ > drivers/block/xen-blkback/xenbus.c | 16 ++- > drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++---- > 4 files changed, 442 insertions(+), 53 deletions(-) > > diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c > index c6decb9..2b982b2 100644 > --- a/drivers/block/xen-blkback/blkback.c > +++ b/drivers/block/xen-blkback/blkback.c > @@ -78,6 +78,7 @@ struct pending_req { > unsigned short operation; > int status; > struct list_head free_list; > + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > }; > > #define BLKBACK_INVALID_HANDLE (~0) > @@ -98,6 +99,30 @@ struct xen_blkbk { > static struct xen_blkbk *blkbk; > > /* > + * Maximum number of grant pages that can be mapped in blkback. > + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of > + * pages that blkback will persistently map. > + */ > +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol) > +{ > + switch (protocol) { > + case BLKIF_PROTOCOL_NATIVE: > + return __CONST_RING_SIZE(blkif, PAGE_SIZE) * > + BLKIF_MAX_SEGMENTS_PER_REQUEST; > + case BLKIF_PROTOCOL_X86_32: > + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) * > + BLKIF_MAX_SEGMENTS_PER_REQUEST; > + case BLKIF_PROTOCOL_X86_64: > + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) * > + BLKIF_MAX_SEGMENTS_PER_REQUEST;Could you include in the comments what the size (bytes) you expect it to be? If that would require you re-doing some tests - then don''t worry - but I figured you have some notes scribbled away that have the exact values down.> + default: > + BUG(); > + } > + return 0; > +} > + > + > +/* > * Little helpful macro to figure out the index and virtual address of the > * pending_pages[..]. For each ''pending_req'' we have have up to > * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through > @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > static void make_response(struct xen_blkif *blkif, u64 id, > unsigned short op, int st); > > +#define foreach_grant(pos, rbtree, node) \ > + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \ > + &(pos)->node != NULL; \ > + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node)) > + > + > +static void add_persistent_gnt(struct rb_root *root, > + struct persistent_gnt *persistent_gnt) > +{ > + struct rb_node **new = &(root->rb_node), *parent = NULL; > + struct persistent_gnt *this; > + > + /* Figure out where to put new node */ > + while (*new) { > + this = container_of(*new, struct persistent_gnt, node); > + > + parent = *new; > + if (persistent_gnt->gnt < this->gnt) > + new = &((*new)->rb_left); > + else if (persistent_gnt->gnt > this->gnt) > + new = &((*new)->rb_right); > + else { > + pr_alert(DRV_PFX " trying to add a gref that''s already in the tree\n"); > + BUG(); > + } > + } > + > + /* Add new node and rebalance tree. */ > + rb_link_node(&(persistent_gnt->node), parent, new); > + rb_insert_color(&(persistent_gnt->node), root); > +} > + > +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, > + grant_ref_t gref) > +{ > + struct persistent_gnt *data; > + struct rb_node *node = root->rb_node; > + > + while (node) { > + data = container_of(node, struct persistent_gnt, node); > + > + if (gref < data->gnt) > + node = node->rb_left; > + else if (gref > data->gnt) > + node = node->rb_right; > + else > + return data; > + } > + return NULL; > +} > + > /* > * Retrieve from the ''pending_reqs'' a free pending_req structure to be used. > */ > @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg) > { > struct xen_blkif *blkif = arg; > struct xen_vbd *vbd = &blkif->vbd; > + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > + struct persistent_gnt *persistent_gnt; > + int ret = 0; > + int segs_to_unmap = 0; > > xen_blkif_get(blkif); > > @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg) > print_stats(blkif); > } > > + /* Free all persistent grant pages */ > + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {Just for sanity - you did check this with blkfronts that did not have persistent grants enabled, right?> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE); > + gnttab_set_unmap_op(&unmap[segs_to_unmap], > + (unsigned long) pfn_to_kaddr(page_to_pfn( > + persistent_gnt->page)), > + GNTMAP_host_map, > + persistent_gnt->handle); > + > + pages[segs_to_unmap] = persistent_gnt->page; > + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts); > + kfree(persistent_gnt); > + blkif->persistent_gnt_c--; > + > + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST || > + !rb_next(&persistent_gnt->node)) { > + ret = gnttab_unmap_refs(unmap, NULL, pages, > + segs_to_unmap); > + BUG_ON(ret); > + segs_to_unmap = 0; > + } > + } > + > + BUG_ON(blkif->persistent_gnt_c != 0); > + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts)); > + > if (log_stats) > print_stats(blkif); > > @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req) > int ret; > > for (i = 0; i < req->nr_pages; i++) { > + if (!req->unmap_seg[i]) > + continue;Perhaps there should be a #define for that array..> handle = pending_handle(req, i); > if (handle == BLKBACK_INVALID_HANDLE) > continue; > @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req) > > static int xen_blkbk_map(struct blkif_request *req, > struct pending_req *pending_req, > - struct seg_buf seg[]) > + struct seg_buf seg[], > + struct page *pages[]) > { > struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > - int i; > + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > + struct persistent_gnt *persistent_gnt = NULL; > + struct xen_blkif *blkif = pending_req->blkif; > + phys_addr_t addr = 0; > + int i, j; > + int new_map;Just use a bool for this.> int nseg = req->u.rw.nr_segments; > + int segs_to_map = 0; > int ret = 0; > + int use_persistent_gnts; > + > + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent); > + > + BUG_ON(blkif->persistent_gnt_c > > + max_mapped_grant_pages(pending_req->blkif->blk_protocol)); > > /* > * Fill out preq.nr_sects with proper amount of sectors, and setup > @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req, > for (i = 0; i < nseg; i++) { > uint32_t flags; > > - flags = GNTMAP_host_map; > - if (pending_req->operation != BLKIF_OP_READ) > - flags |= GNTMAP_readonly; > - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags, > - req->u.rw.seg[i].gref, > - pending_req->blkif->domid); > + if (use_persistent_gnts) > + persistent_gnt = get_persistent_gnt( > + &blkif->persistent_gnts, > + req->u.rw.seg[i].gref); > + > + if (persistent_gnt) { > + /* > + * We are using persistent grants and > + * the grant is already mapped > + */ > + new_map = 0; > + } else if (use_persistent_gnts && > + blkif->persistent_gnt_c < > + max_mapped_grant_pages(blkif->blk_protocol)) { > + /* > + * We are using persistent grants, the grant is > + * not mapped but we have room for it > + */ > + new_map = 1; > + persistent_gnt = kzalloc( > + sizeof(struct persistent_gnt), > + GFP_KERNEL); > + if (!persistent_gnt) > + return -ENOMEM; > + persistent_gnt->page = alloc_page(GFP_KERNEL); > + if (!persistent_gnt->page) { > + kfree(persistent_gnt); > + return -ENOMEM; > + } > + persistent_gnt->gnt = req->u.rw.seg[i].gref; > + > + pages_to_gnt[segs_to_map] > + persistent_gnt->page; > + addr = (unsigned long) pfn_to_kaddr( > + page_to_pfn(persistent_gnt->page)); > + > + add_persistent_gnt(&blkif->persistent_gnts, > + persistent_gnt); > + blkif->persistent_gnt_c++; > + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n", > + persistent_gnt->gnt, blkif->persistent_gnt_c, > + max_mapped_grant_pages(blkif->blk_protocol)); > + } else { > + /* > + * We are either using persistent grants and > + * hit the maximum limit of grants mapped, > + * or we are not using persistent grants. > + */ > + if (use_persistent_gnts && > + !blkif->vbd.overflow_max_grants) { > + blkif->vbd.overflow_max_grants = 1; > + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n", > + blkif->domid, blkif->vbd.handle); > + } > + new_map = 1; > + pages[i] = blkbk->pending_page(pending_req, i); > + addr = vaddr(pending_req, i); > + pages_to_gnt[segs_to_map] > + blkbk->pending_page(pending_req, i); > + } > + > + if (persistent_gnt) { > + pages[i] = persistent_gnt->page; > + persistent_gnts[i] = persistent_gnt; > + } else { > + persistent_gnts[i] = NULL; > + } > + > + if (new_map) { > + flags = GNTMAP_host_map; > + if (!persistent_gnt && > + (pending_req->operation != BLKIF_OP_READ)) > + flags |= GNTMAP_readonly; > + gnttab_set_map_op(&map[segs_to_map++], addr, > + flags, req->u.rw.seg[i].gref, > + blkif->domid); > + } > } > > - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg); > - BUG_ON(ret); > + if (segs_to_map) { > + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map); > + BUG_ON(ret); > + } > > /* > * Now swizzle the MFN in our domain with the MFN from the other domain > * so that when we access vaddr(pending_req,i) it has the contents of > * the page from the other domain. > */ > - for (i = 0; i < nseg; i++) { > - if (unlikely(map[i].status != 0)) { > - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); > - map[i].handle = BLKBACK_INVALID_HANDLE; > - ret |= 1; > + for (i = 0, j = 0; i < nseg; i++) { > + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) { > + /* This is a newly mapped grant */ > + BUG_ON(j >= segs_to_map); > + if (unlikely(map[j].status != 0)) {I am not seeing j being incremented anywhere? Should it?> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); > + map[j].handle = BLKBACK_INVALID_HANDLE; > + ret |= 1; > + if (persistent_gnts[i]) { > + rb_erase(&persistent_gnts[i]->node, > + &blkif->persistent_gnts); > + blkif->persistent_gnt_c--; > + kfree(persistent_gnts[i]); > + persistent_gnts[i] = NULL; > + } > + } > + } > + if (persistent_gnts[i]) { > + if (!persistent_gnts[i]->handle) { > + /* > + * If this is a new persistent grant > + * save the handler > + */ > + persistent_gnts[i]->handle = map[j].handle; > + persistent_gnts[i]->dev_bus_addr > + map[j++].dev_bus_addr; > + } > + pending_handle(pending_req, i) > + persistent_gnts[i]->handle; > + pending_req->unmap_seg[i] = 0;Could we have a #define for that?> + > + if (ret) > + continue; > + > + seg[i].buf = persistent_gnts[i]->dev_bus_addr | > + (req->u.rw.seg[i].first_sect << 9); > + } else { > + pending_handle(pending_req, i) = map[j].handle; > + pending_req->unmap_seg[i] = 1;And here as well?> + > + if (ret) > + continue; > + > + seg[i].buf = map[j++].dev_bus_addr | > + (req->u.rw.seg[i].first_sect << 9); > } > - > - pending_handle(pending_req, i) = map[i].handle; > - > - if (ret) > - continue; > - > - seg[i].buf = map[i].dev_bus_addr | > - (req->u.rw.seg[i].first_sect << 9); > } > return ret; > } > @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > int operation; > struct blk_plug plug; > bool drain = false; > + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > > switch (req->operation) { > case BLKIF_OP_READ: > @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > * the hypercall to unmap the grants - that is all done in > * xen_blkbk_unmap. > */ > - if (xen_blkbk_map(req, pending_req, seg)) > + if (xen_blkbk_map(req, pending_req, seg, pages)) > goto fail_flush; > > /* > @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > for (i = 0; i < nseg; i++) { > while ((bio == NULL) || > (bio_add_page(bio, > - blkbk->pending_page(pending_req, i), > + pages[i], > seg[i].nsec << 9, > seg[i].buf & ~PAGE_MASK) == 0)) { > > diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h > index 9ad3b5e..6c08ee9 100644 > --- a/drivers/block/xen-blkback/common.h > +++ b/drivers/block/xen-blkback/common.h > @@ -34,6 +34,7 @@ > #include <linux/vmalloc.h> > #include <linux/wait.h> > #include <linux/io.h> > +#include <linux/rbtree.h> > #include <asm/setup.h> > #include <asm/pgalloc.h> > #include <asm/hypervisor.h> > @@ -160,10 +161,22 @@ struct xen_vbd { > sector_t size; > bool flush_support; > bool discard_secure; > + > + unsigned int feature_gnt_persistent:1; > + unsigned int overflow_max_grants:1;I think the v3.7-rc1 has this structure changed just a tiny bit, so you might want to rebase it on top of that.> }; > > struct backend_info; > > + > +struct persistent_gnt { > + struct page *page; > + grant_ref_t gnt; > + grant_handle_t handle; > + uint64_t dev_bus_addr; > + struct rb_node node; > +}; > + > struct xen_blkif { > /* Unique identifier for this interface. */ > domid_t domid; > @@ -190,6 +203,10 @@ struct xen_blkif { > struct task_struct *xenblkd; > unsigned int waiting_reqs; > > + /* frontend feature information */Huh?> + struct rb_root persistent_gnts; > + unsigned int persistent_gnt_c; > + > /* statistics */ > unsigned long st_print; > int st_rd_req; > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c > index 4f66171..9f88b4e 100644 > --- a/drivers/block/xen-blkback/xenbus.c > +++ b/drivers/block/xen-blkback/xenbus.c > @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) > atomic_set(&blkif->drain, 0); > blkif->st_print = jiffies; > init_waitqueue_head(&blkif->waiting_to_free); > + blkif->persistent_gnts.rb_node = NULL; > > return blkif; > } > @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be) > struct xenbus_device *dev = be->dev; > unsigned long ring_ref; > unsigned int evtchn; > + unsigned int pers_grants; > char protocol[64] = ""; > int err; > > @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be) > xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol); > return -1; > } > - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n", > - ring_ref, evtchn, be->blkif->blk_protocol, protocol); > + err = xenbus_gather(XBT_NIL, dev->otherend, > + "feature-persistent-grants", "%u", > + &pers_grants, NULL); > + if (err) > + pers_grants = 0; > + > + be->blkif->vbd.feature_gnt_persistent = pers_grants; > + be->blkif->vbd.overflow_max_grants = 0; > + > + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n", > + ring_ref, evtchn, be->blkif->blk_protocol, protocol, > + pers_grants);Can you make that a string? So it is pers_grants ? "persistent grants" : "" instead of a zero of one value pls?> > /* Map the shared frame, irq etc. */ > err = xen_blkif_map(be->blkif, ring_ref, evtchn); > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > index 2c2d2e5..206d422 100644 > --- a/drivers/block/xen-blkfront.c > +++ b/drivers/block/xen-blkfront.c > @@ -44,6 +44,7 @@ > #include <linux/mutex.h> > #include <linux/scatterlist.h> > #include <linux/bitmap.h> > +#include <linux/llist.h> > > #include <xen/xen.h> > #include <xen/xenbus.h> > @@ -64,10 +65,17 @@ enum blkif_state { > BLKIF_STATE_SUSPENDED, > }; > > +struct grant { > + grant_ref_t gref; > + unsigned long pfn; > + struct llist_node node; > +}; > + > struct blk_shadow { > struct blkif_request req; > struct request *request; > unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > }; > > static DEFINE_MUTEX(blkfront_mutex); > @@ -97,6 +105,8 @@ struct blkfront_info > struct work_struct work; > struct gnttab_free_callback callback; > struct blk_shadow shadow[BLK_RING_SIZE]; > + struct llist_head persistent_gnts; > + unsigned int persistent_gnts_c; > unsigned long shadow_free; > unsigned int feature_flush; > unsigned int flush_op; > @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req) > unsigned long id; > unsigned int fsect, lsect; > int i, ref; > + > + /* > + * Used to store if we are able to queue the request by just using > + * existing persistent grants (0), or if we have to get new grants,What does the zero mean?> + * as there are not sufficiently many free. > + */ > + int new_persistent_gnts;I think this can be a bool?> grant_ref_t gref_head; > + struct page *granted_page; > + struct grant *gnt_list_entry = NULL; > struct scatterlist *sg; > > if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) > return 1; > > - if (gnttab_alloc_grant_references( > - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) { > - gnttab_request_free_callback( > - &info->callback, > - blkif_restart_queue_callback, > - info, > - BLKIF_MAX_SEGMENTS_PER_REQUEST); > - return 1; > - } > + /* Check if we have enought grants to allocate a requests */ > + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) { > + new_persistent_gnts = 1; > + if (gnttab_alloc_grant_references( > + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c, > + &gref_head) < 0) { > + gnttab_request_free_callback( > + &info->callback, > + blkif_restart_queue_callback, > + info, > + BLKIF_MAX_SEGMENTS_PER_REQUEST); > + return 1; > + } > + } else > + new_persistent_gnts = 0; > > /* Fill out a communications ring structure. */ > ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); > @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req) > BLKIF_MAX_SEGMENTS_PER_REQUEST); > > for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) { > - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg))); > fsect = sg->offset >> 9; > lsect = fsect + (sg->length >> 9) - 1; > - /* install a grant reference. */ > - ref = gnttab_claim_grant_reference(&gref_head); > - BUG_ON(ref == -ENOSPC); > > - gnttab_grant_foreign_access_ref( > - ref, > + if (info->persistent_gnts_c) { > + BUG_ON(llist_empty(&info->persistent_gnts)); > + gnt_list_entry = llist_entry( > + llist_del_first(&info->persistent_gnts), > + struct grant, node); > + > + ref = gnt_list_entry->gref; > + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn); > + info->persistent_gnts_c--; > + } else { > + ref = gnttab_claim_grant_reference(&gref_head); > + BUG_ON(ref == -ENOSPC); > + > + gnt_list_entry > + kmalloc(sizeof(struct grant), > + GFP_ATOMIC); > + if (!gnt_list_entry) > + return -ENOMEM; > + > + granted_page = alloc_page(GFP_ATOMIC); > + if (!granted_page) { > + kfree(gnt_list_entry); > + return -ENOMEM; > + } > + > + gnt_list_entry->pfn > + page_to_pfn(granted_page); > + gnt_list_entry->gref = ref; > + > + buffer_mfn = pfn_to_mfn(page_to_pfn( > + granted_page)); > + gnttab_grant_foreign_access_ref(ref, > info->xbdev->otherend_id, > - buffer_mfn, > - rq_data_dir(req)); > + buffer_mfn, 0); > + } > + > + info->shadow[id].grants_used[i] = gnt_list_entry; > + > + if (rq_data_dir(req)) { > + char *bvec_data; > + void *shared_data; > + > + BUG_ON(sg->offset + sg->length > PAGE_SIZE); > + > + shared_data = kmap_atomic( > + pfn_to_page(gnt_list_entry->pfn)); > + bvec_data = kmap_atomic(sg_page(sg)); > + > + /* > + * this does not wipe data stored outside the > + * range sg->offset..sg->offset+sg->length. > + * Therefore, blkback *could* see data from > + * previous requests. This is OK as long as > + * persistent grants are shared with just one > + * domain. It may need refactoring if This.. this (lowercase it pls)> + * changes > + */ > + memcpy(shared_data + sg->offset, > + bvec_data + sg->offset, > + sg->length); > + > + kunmap_atomic(bvec_data); > + kunmap_atomic(shared_data); > + } > > info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn); > ring_req->u.rw.seg[i] > @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req) > /* Keep a private copy so we can reissue requests when recovering. */ > info->shadow[id].req = *ring_req; > > - gnttab_free_grant_references(gref_head); > + if (new_persistent_gnts) > + gnttab_free_grant_references(gref_head); > > return 0; > } > @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) > static void xlvbd_flush(struct blkfront_info *info) > { > blk_queue_flush(info->rq, info->feature_flush); > - printk(KERN_INFO "blkfront: %s: %s: %s\n", > + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",HA! By default, eh?> info->gd->disk_name, > info->flush_op == BLKIF_OP_WRITE_BARRIER ? > "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ? > @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work) > > static void blkif_free(struct blkfront_info *info, int suspend) > { > + struct llist_node *all_gnts; > + struct grant *persistent_gnt; > + > /* Prevent new requests being issued until we fix things up. */ > spin_lock_irq(&info->io_lock); > info->connected = suspend ? > @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend) > /* No more blkif_request(). */ > if (info->rq) > blk_stop_queue(info->rq); > + > + /* Remove all persistent grants */ > + if (info->persistent_gnts_c) { > + all_gnts = llist_del_all(&info->persistent_gnts); > + llist_for_each_entry(persistent_gnt, all_gnts, node) { > + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL); > + kfree(persistent_gnt); > + } > + info->persistent_gnts_c = 0; > + } > + > /* No more gnttab callback work. */ > gnttab_cancel_free_callback(&info->callback); > spin_unlock_irq(&info->io_lock); > @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend) > > } > > -static void blkif_completion(struct blk_shadow *s) > +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, > + struct blkif_response *bret) > { > int i; > - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place > - * flag. */ > - for (i = 0; i < s->req.u.rw.nr_segments; i++) > - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL); > + struct bio_vec *bvec; > + struct req_iterator iter; > + unsigned long flags; > + char *bvec_data; > + void *shared_data; > + unsigned int offset = 0; > + > + if (bret->operation == BLKIF_OP_READ) { > + /* > + * Copy the data received from the backend into the bvec. > + * Since bv_len can be different from PAGE_SIZE, we need to > + * be sure we are actually copying the data from the right > + * shared page. > + */ > + rq_for_each_segment(bvec, s->request, iter) { > + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE); > + i = offset >> PAGE_SHIFT;Could you also include a comment about the bug you found here, pls?> + shared_data = kmap_atomic( > + pfn_to_page(s->grants_used[i]->pfn)); > + bvec_data = bvec_kmap_irq(bvec, &flags); > + memcpy(bvec_data, shared_data + bvec->bv_offset, > + bvec->bv_len); > + bvec_kunmap_irq(bvec_data, &flags); > + kunmap_atomic(shared_data); > + offset += bvec->bv_len; > + } > + } > + /* Add the persistent grant into the list of free grants */ > + for (i = 0; i < s->req.u.rw.nr_segments; i++) { > + llist_add(&s->grants_used[i]->node, &info->persistent_gnts); > + info->persistent_gnts_c++; > + } > } > > static irqreturn_t blkif_interrupt(int irq, void *dev_id) > @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) > req = info->shadow[id].request; > > if (bret->operation != BLKIF_OP_DISCARD) > - blkif_completion(&info->shadow[id]); > + blkif_completion(&info->shadow[id], info, bret); > > if (add_id_to_freelist(info, id)) { > WARN(1, "%s: response to %s (id %ld) couldn''t be recycled!\n", > @@ -942,6 +1066,11 @@ again: > message = "writing protocol"; > goto abort_transaction; > } > + err = xenbus_printf(xbt, dev->nodename, > + "feature-persistent-grants", "%d", 1);So its %u in blkback, but %d in here? Which one should it be?> + if (err) > + dev_warn(&dev->dev, > + "writing persistent grants feature to xenbus"); > > err = xenbus_transaction_end(xbt, 0); > if (err) { > @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev, > spin_lock_init(&info->io_lock); > info->xbdev = dev; > info->vdevice = vdevice; > + init_llist_head(&info->persistent_gnts); > + info->persistent_gnts_c = 0; > info->connected = BLKIF_STATE_DISCONNECTED; > INIT_WORK(&info->work, blkif_restart_queue); > > @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info) > req->u.rw.seg[j].gref, > info->xbdev->otherend_id, > pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]), > - rq_data_dir(info->shadow[req->u.rw.id].request)); > + 0); > } > info->shadow[req->u.rw.id].req = *req; > > -- > 1.7.7.5 (Apple Git-26) > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ >
Roger Pau Monné
2012-Oct-23 16:07 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote:> On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote: >> This patch implements persistent grants for the xen-blk{front,back} >> mechanism. The effect of this change is to reduce the number of unmap >> operations performed, since they cause a (costly) TLB shootdown. This >> allows the I/O performance to scale better when a large number of VMs >> are performing I/O. >> >> Previously, the blkfront driver was supplied a bvec[] from the request >> queue. This was granted to dom0; dom0 performed the I/O and wrote >> directly into the grant-mapped memory and unmapped it; blkfront then >> removed foreign access for that grant. The cost of unmapping scales >> badly with the number of CPUs in Dom0. An experiment showed that when >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a >> ramdisk, the IPIs from performing unmap''s is a bottleneck at 5 guests >> (at which point 650,000 IOPS are being performed in total). If more >> than 5 guests are used, the performance declines. By 10 guests, only >> 400,000 IOPS are being performed. >> >> This patch improves performance by only unmapping when the connection >> between blkfront and back is broken. >> >> On startup blkfront notifies blkback that it is using persistent >> grants, and blkback will do the same. If blkback is not capable of >> persistent mapping, blkfront will still use the same grants, since it >> is compatible with the previous protocol, and simplifies the code >> complexity in blkfront. >> >> To perform a read, in persistent mode, blkfront uses a separate pool >> of pages that it maps to dom0. When a request comes in, blkfront >> transmutes the request so that blkback will write into one of these >> free pages. Blkback keeps note of which grefs it has already >> mapped. When a new ring request comes to blkback, it looks to see if >> it has already mapped that page. If so, it will not map it again. If >> the page hasn''t been previously mapped, it is mapped now, and a record >> is kept of this mapping. Blkback proceeds as usual. When blkfront is >> notified that blkback has completed a request, it memcpy''s from the >> shared memory, into the bvec supplied. A record that the {gref, page} >> tuple is mapped, and not inflight is kept. >> >> Writes are similar, except that the memcpy is peformed from the >> supplied bvecs, into the shared pages, before the request is put onto >> the ring. >> >> Blkback stores a mapping of grefs=>{page mapped to by gref} in >> a red-black tree. As the grefs are not known apriori, and provide no >> guarantees on their ordering, we have to perform a search >> through this tree to find the page, for every gref we receive. This >> operation takes O(log n) time in the worst case. > > Might want to mention how blkfront stores it as well. It looks > to be using ''llist'' instead of ''list''? Any particular reason?Since we are just pushing and poping grant references, I went for what I think is the simplest one, a single linked list (list is a doubly linked list). Oliver in the previous version was using something similar, but custom made. I think it''s best to use the data structures provided by the kernel itself.> >> >> The maximum number of grants that blkback will persistenly map is >> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to >> prevent a malicios guest from attempting a DoS, by supplying fresh >> grefs, causing the Dom0 kernel to map excessively. If a guest >> is using persistent grants and exceeds the maximum number of grants to >> map persistenly the newly passed grefs will be mapped and unmaped. >> Using this approach, we can have requests that mix persistent and >> non-persistent grants, and we need to handle them correctly. >> This allows us to set the maximum number of persistent grants to a >> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although >> setting it will lead to unpredictable performance. >> >> In writing this patch, the question arrises as to if the additional >> cost of performing memcpys in the guest (to/from the pool of granted >> pages) outweigh the gains of not performing TLB shootdowns. The answer >> to that question is `no''. There appears to be very little, if any >> additional cost to the guest of using persistent grants. There is >> perhaps a small saving, from the reduced number of hypercalls >> performed in granting, and ending foreign access. >> >> Signed-off-by: Oliver Chick <oliver.chick@citrix.com> >> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> >> Cc: <konrad.wilk@oracle.com> >> Cc: <linux-kernel@vger.kernel.org> >> --- >> Benchmarks showing the impact of this patch in blk performance can be >> found at: >> >> http://xenbits.xensource.com/people/royger/persistent_grants/ >> --- >> drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++--- >> drivers/block/xen-blkback/common.h | 17 ++ >> drivers/block/xen-blkback/xenbus.c | 16 ++- >> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++---- >> 4 files changed, 442 insertions(+), 53 deletions(-) >> >> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c >> index c6decb9..2b982b2 100644 >> --- a/drivers/block/xen-blkback/blkback.c >> +++ b/drivers/block/xen-blkback/blkback.c >> @@ -78,6 +78,7 @@ struct pending_req { >> unsigned short operation; >> int status; >> struct list_head free_list; >> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> }; >> >> #define BLKBACK_INVALID_HANDLE (~0) >> @@ -98,6 +99,30 @@ struct xen_blkbk { >> static struct xen_blkbk *blkbk; >> >> /* >> + * Maximum number of grant pages that can be mapped in blkback. >> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of >> + * pages that blkback will persistently map. >> + */ >> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol) >> +{ >> + switch (protocol) { >> + case BLKIF_PROTOCOL_NATIVE: >> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) * >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; >> + case BLKIF_PROTOCOL_X86_32: >> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) * >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; >> + case BLKIF_PROTOCOL_X86_64: >> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) * >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > > Could you include in the comments what the size (bytes) you expect it to be? > If that would require you re-doing some tests - then don''t worry - but > I figured you have some notes scribbled away that have the exact values > down.As far as I know and remember (I''ve checked the ring size in the past), all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is always 11, and sizeof(struct persistent_gnt) is 48, so that''s 32 * 11 * 48 = 16896 bytes. I will add a comment with this calculation.> >> + default: >> + BUG(); >> + } >> + return 0; >> +} >> + >> + >> +/* >> * Little helpful macro to figure out the index and virtual address of the >> * pending_pages[..]. For each ''pending_req'' we have have up to >> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through >> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> static void make_response(struct xen_blkif *blkif, u64 id, >> unsigned short op, int st); >> >> +#define foreach_grant(pos, rbtree, node) \ >> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \ >> + &(pos)->node != NULL; \ >> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node)) >> + >> + >> +static void add_persistent_gnt(struct rb_root *root, >> + struct persistent_gnt *persistent_gnt) >> +{ >> + struct rb_node **new = &(root->rb_node), *parent = NULL; >> + struct persistent_gnt *this; >> + >> + /* Figure out where to put new node */ >> + while (*new) { >> + this = container_of(*new, struct persistent_gnt, node); >> + >> + parent = *new; >> + if (persistent_gnt->gnt < this->gnt) >> + new = &((*new)->rb_left); >> + else if (persistent_gnt->gnt > this->gnt) >> + new = &((*new)->rb_right); >> + else { >> + pr_alert(DRV_PFX " trying to add a gref that''s already in the tree\n"); >> + BUG(); >> + } >> + } >> + >> + /* Add new node and rebalance tree. */ >> + rb_link_node(&(persistent_gnt->node), parent, new); >> + rb_insert_color(&(persistent_gnt->node), root); >> +} >> + >> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, >> + grant_ref_t gref) >> +{ >> + struct persistent_gnt *data; >> + struct rb_node *node = root->rb_node; >> + >> + while (node) { >> + data = container_of(node, struct persistent_gnt, node); >> + >> + if (gref < data->gnt) >> + node = node->rb_left; >> + else if (gref > data->gnt) >> + node = node->rb_right; >> + else >> + return data; >> + } >> + return NULL; >> +} >> + >> /* >> * Retrieve from the ''pending_reqs'' a free pending_req structure to be used. >> */ >> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg) >> { >> struct xen_blkif *blkif = arg; >> struct xen_vbd *vbd = &blkif->vbd; >> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct persistent_gnt *persistent_gnt; >> + int ret = 0; >> + int segs_to_unmap = 0; >> >> xen_blkif_get(blkif); >> >> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg) >> print_stats(blkif); >> } >> >> + /* Free all persistent grant pages */ >> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) { > > Just for sanity - you did check this with blkfronts that did not have > persistent grants enabled, right?Yes, it doesn''t crash, but looking at foreach_grant it seems like it should. I''ve added a check before trying to iterate over the tree.> >> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE); >> + gnttab_set_unmap_op(&unmap[segs_to_unmap], >> + (unsigned long) pfn_to_kaddr(page_to_pfn( >> + persistent_gnt->page)), >> + GNTMAP_host_map, >> + persistent_gnt->handle); >> + >> + pages[segs_to_unmap] = persistent_gnt->page; >> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts); >> + kfree(persistent_gnt); >> + blkif->persistent_gnt_c--; >> + >> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST || >> + !rb_next(&persistent_gnt->node)) { >> + ret = gnttab_unmap_refs(unmap, NULL, pages, >> + segs_to_unmap); >> + BUG_ON(ret); >> + segs_to_unmap = 0; >> + } >> + } >> + >> + BUG_ON(blkif->persistent_gnt_c != 0); >> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts)); >> + >> if (log_stats) >> print_stats(blkif); >> >> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req) >> int ret; >> >> for (i = 0; i < req->nr_pages; i++) { >> + if (!req->unmap_seg[i]) >> + continue; > > Perhaps there should be a #define for that array..Do you mean something like: #define unmap(req, i) req->unmap_seg[i]>> handle = pending_handle(req, i); >> if (handle == BLKBACK_INVALID_HANDLE) >> continue; >> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req) >> >> static int xen_blkbk_map(struct blkif_request *req, >> struct pending_req *pending_req, >> - struct seg_buf seg[]) >> + struct seg_buf seg[], >> + struct page *pages[]) >> { >> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> - int i; >> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct persistent_gnt *persistent_gnt = NULL; >> + struct xen_blkif *blkif = pending_req->blkif; >> + phys_addr_t addr = 0; >> + int i, j; >> + int new_map; > > Just use a bool for this.Done>> int nseg = req->u.rw.nr_segments; >> + int segs_to_map = 0; >> int ret = 0; >> + int use_persistent_gnts; >> + >> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent); >> + >> + BUG_ON(blkif->persistent_gnt_c > >> + max_mapped_grant_pages(pending_req->blkif->blk_protocol)); >> >> /* >> * Fill out preq.nr_sects with proper amount of sectors, and setup >> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req, >> for (i = 0; i < nseg; i++) { >> uint32_t flags; >> >> - flags = GNTMAP_host_map; >> - if (pending_req->operation != BLKIF_OP_READ) >> - flags |= GNTMAP_readonly; >> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags, >> - req->u.rw.seg[i].gref, >> - pending_req->blkif->domid); >> + if (use_persistent_gnts) >> + persistent_gnt = get_persistent_gnt( >> + &blkif->persistent_gnts, >> + req->u.rw.seg[i].gref); >> + >> + if (persistent_gnt) { >> + /* >> + * We are using persistent grants and >> + * the grant is already mapped >> + */ >> + new_map = 0; >> + } else if (use_persistent_gnts && >> + blkif->persistent_gnt_c < >> + max_mapped_grant_pages(blkif->blk_protocol)) { >> + /* >> + * We are using persistent grants, the grant is >> + * not mapped but we have room for it >> + */ >> + new_map = 1; >> + persistent_gnt = kzalloc( >> + sizeof(struct persistent_gnt), >> + GFP_KERNEL); >> + if (!persistent_gnt) >> + return -ENOMEM; >> + persistent_gnt->page = alloc_page(GFP_KERNEL); >> + if (!persistent_gnt->page) { >> + kfree(persistent_gnt); >> + return -ENOMEM; >> + } >> + persistent_gnt->gnt = req->u.rw.seg[i].gref; >> + >> + pages_to_gnt[segs_to_map] >> + persistent_gnt->page; >> + addr = (unsigned long) pfn_to_kaddr( >> + page_to_pfn(persistent_gnt->page)); >> + >> + add_persistent_gnt(&blkif->persistent_gnts, >> + persistent_gnt); >> + blkif->persistent_gnt_c++; >> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n", >> + persistent_gnt->gnt, blkif->persistent_gnt_c, >> + max_mapped_grant_pages(blkif->blk_protocol)); >> + } else { >> + /* >> + * We are either using persistent grants and >> + * hit the maximum limit of grants mapped, >> + * or we are not using persistent grants. >> + */ >> + if (use_persistent_gnts && >> + !blkif->vbd.overflow_max_grants) { >> + blkif->vbd.overflow_max_grants = 1; >> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n", >> + blkif->domid, blkif->vbd.handle); >> + } >> + new_map = 1; >> + pages[i] = blkbk->pending_page(pending_req, i); >> + addr = vaddr(pending_req, i); >> + pages_to_gnt[segs_to_map] >> + blkbk->pending_page(pending_req, i); >> + } >> + >> + if (persistent_gnt) { >> + pages[i] = persistent_gnt->page; >> + persistent_gnts[i] = persistent_gnt; >> + } else { >> + persistent_gnts[i] = NULL; >> + } >> + >> + if (new_map) { >> + flags = GNTMAP_host_map; >> + if (!persistent_gnt && >> + (pending_req->operation != BLKIF_OP_READ)) >> + flags |= GNTMAP_readonly; >> + gnttab_set_map_op(&map[segs_to_map++], addr, >> + flags, req->u.rw.seg[i].gref, >> + blkif->domid); >> + } >> } >> >> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg); >> - BUG_ON(ret); >> + if (segs_to_map) { >> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map); >> + BUG_ON(ret); >> + } >> >> /* >> * Now swizzle the MFN in our domain with the MFN from the other domain >> * so that when we access vaddr(pending_req,i) it has the contents of >> * the page from the other domain. >> */ >> - for (i = 0; i < nseg; i++) { >> - if (unlikely(map[i].status != 0)) { >> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); >> - map[i].handle = BLKBACK_INVALID_HANDLE; >> - ret |= 1; >> + for (i = 0, j = 0; i < nseg; i++) { >> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) { >> + /* This is a newly mapped grant */ >> + BUG_ON(j >= segs_to_map); >> + if (unlikely(map[j].status != 0)) { > > I am not seeing j being incremented anywhere? Should it?Yes, it should be incremented, but not here. See the comment below to see what I''ve changed.> >> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); >> + map[j].handle = BLKBACK_INVALID_HANDLE; >> + ret |= 1; >> + if (persistent_gnts[i]) { >> + rb_erase(&persistent_gnts[i]->node, >> + &blkif->persistent_gnts); >> + blkif->persistent_gnt_c--; >> + kfree(persistent_gnts[i]); >> + persistent_gnts[i] = NULL; >> + } >> + } >> + } >> + if (persistent_gnts[i]) { >> + if (!persistent_gnts[i]->handle) { >> + /* >> + * If this is a new persistent grant >> + * save the handler >> + */ >> + persistent_gnts[i]->handle = map[j].handle; >> + persistent_gnts[i]->dev_bus_addr >> + map[j++].dev_bus_addr; >> + } >> + pending_handle(pending_req, i) >> + persistent_gnts[i]->handle; >> + pending_req->unmap_seg[i] = 0; > > Could we have a #define for that?Sure.>> + >> + if (ret) >> + continue;This should be if (ret) { j++; continue; }>> + >> + seg[i].buf = persistent_gnts[i]->dev_bus_addr | >> + (req->u.rw.seg[i].first_sect << 9); >> + } else { >> + pending_handle(pending_req, i) = map[j].handle; >> + pending_req->unmap_seg[i] = 1; > > And here as well?Done.>> + >> + if (ret) >> + continue; >> + >> + seg[i].buf = map[j++].dev_bus_addr | >> + (req->u.rw.seg[i].first_sect << 9); >> } >> - >> - pending_handle(pending_req, i) = map[i].handle; >> - >> - if (ret) >> - continue; >> - >> - seg[i].buf = map[i].dev_bus_addr | >> - (req->u.rw.seg[i].first_sect << 9); >> } >> return ret; >> } >> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> int operation; >> struct blk_plug plug; >> bool drain = false; >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> >> switch (req->operation) { >> case BLKIF_OP_READ: >> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> * the hypercall to unmap the grants - that is all done in >> * xen_blkbk_unmap. >> */ >> - if (xen_blkbk_map(req, pending_req, seg)) >> + if (xen_blkbk_map(req, pending_req, seg, pages)) >> goto fail_flush; >> >> /* >> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> for (i = 0; i < nseg; i++) { >> while ((bio == NULL) || >> (bio_add_page(bio, >> - blkbk->pending_page(pending_req, i), >> + pages[i], >> seg[i].nsec << 9, >> seg[i].buf & ~PAGE_MASK) == 0)) { >> >> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h >> index 9ad3b5e..6c08ee9 100644 >> --- a/drivers/block/xen-blkback/common.h >> +++ b/drivers/block/xen-blkback/common.h >> @@ -34,6 +34,7 @@ >> #include <linux/vmalloc.h> >> #include <linux/wait.h> >> #include <linux/io.h> >> +#include <linux/rbtree.h> >> #include <asm/setup.h> >> #include <asm/pgalloc.h> >> #include <asm/hypervisor.h> >> @@ -160,10 +161,22 @@ struct xen_vbd { >> sector_t size; >> bool flush_support; >> bool discard_secure; >> + >> + unsigned int feature_gnt_persistent:1; >> + unsigned int overflow_max_grants:1; > > I think the v3.7-rc1 has this structure changed just a tiny bit, so you > might want to rebase it on top of that.I''ve done the rebase on top of your linux-next branch, commit ad502612c173fff239250c9fe9bdfaaef70b9901.> >> }; >> >> struct backend_info; >> >> + >> +struct persistent_gnt { >> + struct page *page; >> + grant_ref_t gnt; >> + grant_handle_t handle; >> + uint64_t dev_bus_addr; >> + struct rb_node node; >> +}; >> + >> struct xen_blkif { >> /* Unique identifier for this interface. */ >> domid_t domid; >> @@ -190,6 +203,10 @@ struct xen_blkif { >> struct task_struct *xenblkd; >> unsigned int waiting_reqs; >> >> + /* frontend feature information */ > > Huh?Changed it to: /* tree to store persistent grants */>> + struct rb_root persistent_gnts; >> + unsigned int persistent_gnt_c; >> + >> /* statistics */ >> unsigned long st_print; >> int st_rd_req; >> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c >> index 4f66171..9f88b4e 100644 >> --- a/drivers/block/xen-blkback/xenbus.c >> +++ b/drivers/block/xen-blkback/xenbus.c >> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) >> atomic_set(&blkif->drain, 0); >> blkif->st_print = jiffies; >> init_waitqueue_head(&blkif->waiting_to_free); >> + blkif->persistent_gnts.rb_node = NULL; >> >> return blkif; >> } >> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be) >> struct xenbus_device *dev = be->dev; >> unsigned long ring_ref; >> unsigned int evtchn; >> + unsigned int pers_grants; >> char protocol[64] = ""; >> int err; >> >> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be) >> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol); >> return -1; >> } >> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n", >> - ring_ref, evtchn, be->blkif->blk_protocol, protocol); >> + err = xenbus_gather(XBT_NIL, dev->otherend, >> + "feature-persistent-grants", "%u", >> + &pers_grants, NULL); >> + if (err) >> + pers_grants = 0; >> + >> + be->blkif->vbd.feature_gnt_persistent = pers_grants; >> + be->blkif->vbd.overflow_max_grants = 0; >> + >> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n", >> + ring_ref, evtchn, be->blkif->blk_protocol, protocol, >> + pers_grants); > > Can you make that a string? So it is > pers_grants ? "persistent grants" : "" > > instead of a zero of one value pls?Yes, done.>> >> /* Map the shared frame, irq etc. */ >> err = xen_blkif_map(be->blkif, ring_ref, evtchn); >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c >> index 2c2d2e5..206d422 100644 >> --- a/drivers/block/xen-blkfront.c >> +++ b/drivers/block/xen-blkfront.c >> @@ -44,6 +44,7 @@ >> #include <linux/mutex.h> >> #include <linux/scatterlist.h> >> #include <linux/bitmap.h> >> +#include <linux/llist.h> >> >> #include <xen/xen.h> >> #include <xen/xenbus.h> >> @@ -64,10 +65,17 @@ enum blkif_state { >> BLKIF_STATE_SUSPENDED, >> }; >> >> +struct grant { >> + grant_ref_t gref; >> + unsigned long pfn; >> + struct llist_node node; >> +}; >> + >> struct blk_shadow { >> struct blkif_request req; >> struct request *request; >> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> }; >> >> static DEFINE_MUTEX(blkfront_mutex); >> @@ -97,6 +105,8 @@ struct blkfront_info >> struct work_struct work; >> struct gnttab_free_callback callback; >> struct blk_shadow shadow[BLK_RING_SIZE]; >> + struct llist_head persistent_gnts; >> + unsigned int persistent_gnts_c; >> unsigned long shadow_free; >> unsigned int feature_flush; >> unsigned int flush_op; >> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req) >> unsigned long id; >> unsigned int fsect, lsect; >> int i, ref; >> + >> + /* >> + * Used to store if we are able to queue the request by just using >> + * existing persistent grants (0), or if we have to get new grants, > > What does the zero mean?Frankly, no idea, I guess it was in Oliver''s patch and I missed to spot it.>> + * as there are not sufficiently many free. >> + */ >> + int new_persistent_gnts; > > I think this can be a bool?I agree.>> grant_ref_t gref_head; >> + struct page *granted_page; >> + struct grant *gnt_list_entry = NULL; >> struct scatterlist *sg; >> >> if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) >> return 1; >> >> - if (gnttab_alloc_grant_references( >> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) { >> - gnttab_request_free_callback( >> - &info->callback, >> - blkif_restart_queue_callback, >> - info, >> - BLKIF_MAX_SEGMENTS_PER_REQUEST); >> - return 1; >> - } >> + /* Check if we have enought grants to allocate a requests */ >> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) { >> + new_persistent_gnts = 1; >> + if (gnttab_alloc_grant_references( >> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c, >> + &gref_head) < 0) { >> + gnttab_request_free_callback( >> + &info->callback, >> + blkif_restart_queue_callback, >> + info, >> + BLKIF_MAX_SEGMENTS_PER_REQUEST); >> + return 1; >> + } >> + } else >> + new_persistent_gnts = 0; >> >> /* Fill out a communications ring structure. */ >> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); >> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req) >> BLKIF_MAX_SEGMENTS_PER_REQUEST); >> >> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) { >> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg))); >> fsect = sg->offset >> 9; >> lsect = fsect + (sg->length >> 9) - 1; >> - /* install a grant reference. */ >> - ref = gnttab_claim_grant_reference(&gref_head); >> - BUG_ON(ref == -ENOSPC); >> >> - gnttab_grant_foreign_access_ref( >> - ref, >> + if (info->persistent_gnts_c) { >> + BUG_ON(llist_empty(&info->persistent_gnts)); >> + gnt_list_entry = llist_entry( >> + llist_del_first(&info->persistent_gnts), >> + struct grant, node); >> + >> + ref = gnt_list_entry->gref; >> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn); >> + info->persistent_gnts_c--; >> + } else { >> + ref = gnttab_claim_grant_reference(&gref_head); >> + BUG_ON(ref == -ENOSPC); >> + >> + gnt_list_entry >> + kmalloc(sizeof(struct grant), >> + GFP_ATOMIC); >> + if (!gnt_list_entry) >> + return -ENOMEM; >> + >> + granted_page = alloc_page(GFP_ATOMIC); >> + if (!granted_page) { >> + kfree(gnt_list_entry); >> + return -ENOMEM; >> + } >> + >> + gnt_list_entry->pfn >> + page_to_pfn(granted_page); >> + gnt_list_entry->gref = ref; >> + >> + buffer_mfn = pfn_to_mfn(page_to_pfn( >> + granted_page)); >> + gnttab_grant_foreign_access_ref(ref, >> info->xbdev->otherend_id, >> - buffer_mfn, >> - rq_data_dir(req)); >> + buffer_mfn, 0); >> + } >> + >> + info->shadow[id].grants_used[i] = gnt_list_entry; >> + >> + if (rq_data_dir(req)) { >> + char *bvec_data; >> + void *shared_data; >> + >> + BUG_ON(sg->offset + sg->length > PAGE_SIZE); >> + >> + shared_data = kmap_atomic( >> + pfn_to_page(gnt_list_entry->pfn)); >> + bvec_data = kmap_atomic(sg_page(sg)); >> + >> + /* >> + * this does not wipe data stored outside the >> + * range sg->offset..sg->offset+sg->length. >> + * Therefore, blkback *could* see data from >> + * previous requests. This is OK as long as >> + * persistent grants are shared with just one >> + * domain. It may need refactoring if This > .. this (lowercase it pls)Done.> >> + * changes >> + */ >> + memcpy(shared_data + sg->offset, >> + bvec_data + sg->offset, >> + sg->length); >> + >> + kunmap_atomic(bvec_data); >> + kunmap_atomic(shared_data); >> + } >> >> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn); >> ring_req->u.rw.seg[i] >> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req) >> /* Keep a private copy so we can reissue requests when recovering. */ >> info->shadow[id].req = *ring_req; >> >> - gnttab_free_grant_references(gref_head); >> + if (new_persistent_gnts) >> + gnttab_free_grant_references(gref_head); >> >> return 0; >> } >> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) >> static void xlvbd_flush(struct blkfront_info *info) >> { >> blk_queue_flush(info->rq, info->feature_flush); >> - printk(KERN_INFO "blkfront: %s: %s: %s\n", >> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n", > > HA! By default, eh?Yes, you caught me, there''s a paragraph in the commit message that explains that we are using persistent grants in the frontend unconditionally, since the protocol is compatible (you can have a persistent blkfront and a non-persistent blkback). It simplifies the logic in blkfront. Are you OK with it?>> info->gd->disk_name, >> info->flush_op == BLKIF_OP_WRITE_BARRIER ? >> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ? >> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work) >> >> static void blkif_free(struct blkfront_info *info, int suspend) >> { >> + struct llist_node *all_gnts; >> + struct grant *persistent_gnt; >> + >> /* Prevent new requests being issued until we fix things up. */ >> spin_lock_irq(&info->io_lock); >> info->connected = suspend ? >> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend) >> /* No more blkif_request(). */ >> if (info->rq) >> blk_stop_queue(info->rq); >> + >> + /* Remove all persistent grants */ >> + if (info->persistent_gnts_c) { >> + all_gnts = llist_del_all(&info->persistent_gnts); >> + llist_for_each_entry(persistent_gnt, all_gnts, node) { >> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL); >> + kfree(persistent_gnt); >> + } >> + info->persistent_gnts_c = 0; >> + } >> + >> /* No more gnttab callback work. */ >> gnttab_cancel_free_callback(&info->callback); >> spin_unlock_irq(&info->io_lock); >> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend) >> >> } >> >> -static void blkif_completion(struct blk_shadow *s) >> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, >> + struct blkif_response *bret) >> { >> int i; >> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place >> - * flag. */ >> - for (i = 0; i < s->req.u.rw.nr_segments; i++) >> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL); >> + struct bio_vec *bvec; >> + struct req_iterator iter; >> + unsigned long flags; >> + char *bvec_data; >> + void *shared_data; >> + unsigned int offset = 0; >> + >> + if (bret->operation == BLKIF_OP_READ) { >> + /* >> + * Copy the data received from the backend into the bvec. >> + * Since bv_len can be different from PAGE_SIZE, we need to >> + * be sure we are actually copying the data from the right >> + * shared page. >> + */ >> + rq_for_each_segment(bvec, s->request, iter) { >> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE); >> + i = offset >> PAGE_SHIFT; > > Could you also include a comment about the bug you found here, pls?There''s a comment before the rq_for_each_segment loop, that tries to explain that, do you think the following is better? /* * Copy the data received from the backend into the bvec. * Since bv_offset can be different than 0, and bv_len different * than PAGE_SIZE, we have to keep track of the current offset, * to be sure we are copying the data from the right shared page. */>> + shared_data = kmap_atomic( >> + pfn_to_page(s->grants_used[i]->pfn)); >> + bvec_data = bvec_kmap_irq(bvec, &flags); >> + memcpy(bvec_data, shared_data + bvec->bv_offset, >> + bvec->bv_len); >> + bvec_kunmap_irq(bvec_data, &flags); >> + kunmap_atomic(shared_data); >> + offset += bvec->bv_len; >> + } >> + } >> + /* Add the persistent grant into the list of free grants */ >> + for (i = 0; i < s->req.u.rw.nr_segments; i++) { >> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts); >> + info->persistent_gnts_c++; >> + } >> } >> >> static irqreturn_t blkif_interrupt(int irq, void *dev_id) >> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) >> req = info->shadow[id].request; >> >> if (bret->operation != BLKIF_OP_DISCARD) >> - blkif_completion(&info->shadow[id]); >> + blkif_completion(&info->shadow[id], info, bret); >> >> if (add_id_to_freelist(info, id)) { >> WARN(1, "%s: response to %s (id %ld) couldn''t be recycled!\n", >> @@ -942,6 +1066,11 @@ again: >> message = "writing protocol"; >> goto abort_transaction; >> } >> + err = xenbus_printf(xbt, dev->nodename, >> + "feature-persistent-grants", "%d", 1); > > So its %u in blkback, but %d in here? Which one should it be?%u in both places.>> + if (err) >> + dev_warn(&dev->dev, >> + "writing persistent grants feature to xenbus"); >> >> err = xenbus_transaction_end(xbt, 0); >> if (err) { >> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev, >> spin_lock_init(&info->io_lock); >> info->xbdev = dev; >> info->vdevice = vdevice; >> + init_llist_head(&info->persistent_gnts); >> + info->persistent_gnts_c = 0; >> info->connected = BLKIF_STATE_DISCONNECTED; >> INIT_WORK(&info->work, blkif_restart_queue); >> >> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info) >> req->u.rw.seg[j].gref, >> info->xbdev->otherend_id, >> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]), >> - rq_data_dir(info->shadow[req->u.rw.id].request)); >> + 0); >> } >> info->shadow[req->u.rw.id].req = *req; >> >> -- >> 1.7.7.5 (Apple Git-26) >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >>
Konrad Rzeszutek Wilk
2012-Oct-23 17:20 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On Tue, Oct 23, 2012 at 06:07:36PM +0200, Roger Pau Monné wrote:> On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote: > > On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote: > >> This patch implements persistent grants for the xen-blk{front,back} > >> mechanism. The effect of this change is to reduce the number of unmap > >> operations performed, since they cause a (costly) TLB shootdown. This > >> allows the I/O performance to scale better when a large number of VMs > >> are performing I/O. > >> > >> Previously, the blkfront driver was supplied a bvec[] from the request > >> queue. This was granted to dom0; dom0 performed the I/O and wrote > >> directly into the grant-mapped memory and unmapped it; blkfront then > >> removed foreign access for that grant. The cost of unmapping scales > >> badly with the number of CPUs in Dom0. An experiment showed that when > >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a > >> ramdisk, the IPIs from performing unmap''s is a bottleneck at 5 guests > >> (at which point 650,000 IOPS are being performed in total). If more > >> than 5 guests are used, the performance declines. By 10 guests, only > >> 400,000 IOPS are being performed. > >> > >> This patch improves performance by only unmapping when the connection > >> between blkfront and back is broken. > >> > >> On startup blkfront notifies blkback that it is using persistent > >> grants, and blkback will do the same. If blkback is not capable of > >> persistent mapping, blkfront will still use the same grants, since it > >> is compatible with the previous protocol, and simplifies the code > >> complexity in blkfront. > >> > >> To perform a read, in persistent mode, blkfront uses a separate pool > >> of pages that it maps to dom0. When a request comes in, blkfront > >> transmutes the request so that blkback will write into one of these > >> free pages. Blkback keeps note of which grefs it has already > >> mapped. When a new ring request comes to blkback, it looks to see if > >> it has already mapped that page. If so, it will not map it again. If > >> the page hasn''t been previously mapped, it is mapped now, and a record > >> is kept of this mapping. Blkback proceeds as usual. When blkfront is > >> notified that blkback has completed a request, it memcpy''s from the > >> shared memory, into the bvec supplied. A record that the {gref, page} > >> tuple is mapped, and not inflight is kept. > >> > >> Writes are similar, except that the memcpy is peformed from the > >> supplied bvecs, into the shared pages, before the request is put onto > >> the ring. > >> > >> Blkback stores a mapping of grefs=>{page mapped to by gref} in > >> a red-black tree. As the grefs are not known apriori, and provide no > >> guarantees on their ordering, we have to perform a search > >> through this tree to find the page, for every gref we receive. This > >> operation takes O(log n) time in the worst case. > > > > Might want to mention how blkfront stores it as well. It looks > > to be using ''llist'' instead of ''list''? Any particular reason? > > Since we are just pushing and poping grant references, I went for what I > think is the simplest one, a single linked list (list is a doubly linked > list). Oliver in the previous version was using something similar, but > custom made. I think it''s best to use the data structures provided by > the kernel itself. > > > > >> > >> The maximum number of grants that blkback will persistenly map is > >> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to > >> prevent a malicios guest from attempting a DoS, by supplying fresh > >> grefs, causing the Dom0 kernel to map excessively. If a guest > >> is using persistent grants and exceeds the maximum number of grants to > >> map persistenly the newly passed grefs will be mapped and unmaped. > >> Using this approach, we can have requests that mix persistent and > >> non-persistent grants, and we need to handle them correctly. > >> This allows us to set the maximum number of persistent grants to a > >> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although > >> setting it will lead to unpredictable performance. > >> > >> In writing this patch, the question arrises as to if the additional > >> cost of performing memcpys in the guest (to/from the pool of granted > >> pages) outweigh the gains of not performing TLB shootdowns. The answer > >> to that question is `no''. There appears to be very little, if any > >> additional cost to the guest of using persistent grants. There is > >> perhaps a small saving, from the reduced number of hypercalls > >> performed in granting, and ending foreign access. > >> > >> Signed-off-by: Oliver Chick <oliver.chick@citrix.com> > >> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> > >> Cc: <konrad.wilk@oracle.com> > >> Cc: <linux-kernel@vger.kernel.org> > >> --- > >> Benchmarks showing the impact of this patch in blk performance can be > >> found at: > >> > >> http://xenbits.xensource.com/people/royger/persistent_grants/ > >> --- > >> drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++--- > >> drivers/block/xen-blkback/common.h | 17 ++ > >> drivers/block/xen-blkback/xenbus.c | 16 ++- > >> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++---- > >> 4 files changed, 442 insertions(+), 53 deletions(-) > >> > >> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c > >> index c6decb9..2b982b2 100644 > >> --- a/drivers/block/xen-blkback/blkback.c > >> +++ b/drivers/block/xen-blkback/blkback.c > >> @@ -78,6 +78,7 @@ struct pending_req { > >> unsigned short operation; > >> int status; > >> struct list_head free_list; > >> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> }; > >> > >> #define BLKBACK_INVALID_HANDLE (~0) > >> @@ -98,6 +99,30 @@ struct xen_blkbk { > >> static struct xen_blkbk *blkbk; > >> > >> /* > >> + * Maximum number of grant pages that can be mapped in blkback. > >> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of > >> + * pages that blkback will persistently map. > >> + */ > >> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol) > >> +{ > >> + switch (protocol) { > >> + case BLKIF_PROTOCOL_NATIVE: > >> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) * > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > >> + case BLKIF_PROTOCOL_X86_32: > >> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) * > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > >> + case BLKIF_PROTOCOL_X86_64: > >> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) * > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > > > > Could you include in the comments what the size (bytes) you expect it to be? > > If that would require you re-doing some tests - then don''t worry - but > > I figured you have some notes scribbled away that have the exact values > > down. > > As far as I know and remember (I''ve checked the ring size in the past), > all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is > always 11, and sizeof(struct persistent_gnt) is 48, so that''s 32 * 11 * > 48 = 16896 bytes. I will add a comment with this calculation. > > > > >> + default: > >> + BUG(); > >> + } > >> + return 0; > >> +} > >> + > >> + > >> +/* > >> * Little helpful macro to figure out the index and virtual address of the > >> * pending_pages[..]. For each ''pending_req'' we have have up to > >> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through > >> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > >> static void make_response(struct xen_blkif *blkif, u64 id, > >> unsigned short op, int st); > >> > >> +#define foreach_grant(pos, rbtree, node) \ > >> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \ > >> + &(pos)->node != NULL; \ > >> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node)) > >> + > >> + > >> +static void add_persistent_gnt(struct rb_root *root, > >> + struct persistent_gnt *persistent_gnt) > >> +{ > >> + struct rb_node **new = &(root->rb_node), *parent = NULL; > >> + struct persistent_gnt *this; > >> + > >> + /* Figure out where to put new node */ > >> + while (*new) { > >> + this = container_of(*new, struct persistent_gnt, node); > >> + > >> + parent = *new; > >> + if (persistent_gnt->gnt < this->gnt) > >> + new = &((*new)->rb_left); > >> + else if (persistent_gnt->gnt > this->gnt) > >> + new = &((*new)->rb_right); > >> + else { > >> + pr_alert(DRV_PFX " trying to add a gref that''s already in the tree\n"); > >> + BUG(); > >> + } > >> + } > >> + > >> + /* Add new node and rebalance tree. */ > >> + rb_link_node(&(persistent_gnt->node), parent, new); > >> + rb_insert_color(&(persistent_gnt->node), root); > >> +} > >> + > >> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, > >> + grant_ref_t gref) > >> +{ > >> + struct persistent_gnt *data; > >> + struct rb_node *node = root->rb_node; > >> + > >> + while (node) { > >> + data = container_of(node, struct persistent_gnt, node); > >> + > >> + if (gref < data->gnt) > >> + node = node->rb_left; > >> + else if (gref > data->gnt) > >> + node = node->rb_right; > >> + else > >> + return data; > >> + } > >> + return NULL; > >> +} > >> + > >> /* > >> * Retrieve from the ''pending_reqs'' a free pending_req structure to be used. > >> */ > >> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg) > >> { > >> struct xen_blkif *blkif = arg; > >> struct xen_vbd *vbd = &blkif->vbd; > >> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct persistent_gnt *persistent_gnt; > >> + int ret = 0; > >> + int segs_to_unmap = 0; > >> > >> xen_blkif_get(blkif); > >> > >> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg) > >> print_stats(blkif); > >> } > >> > >> + /* Free all persistent grant pages */ > >> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) { > > > > Just for sanity - you did check this with blkfronts that did not have > > persistent grants enabled, right? > > Yes, it doesn''t crash, but looking at foreach_grant it seems like it > should. I''ve added a check before trying to iterate over the tree. > > > > >> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE); > >> + gnttab_set_unmap_op(&unmap[segs_to_unmap], > >> + (unsigned long) pfn_to_kaddr(page_to_pfn( > >> + persistent_gnt->page)), > >> + GNTMAP_host_map, > >> + persistent_gnt->handle); > >> + > >> + pages[segs_to_unmap] = persistent_gnt->page; > >> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts); > >> + kfree(persistent_gnt); > >> + blkif->persistent_gnt_c--; > >> + > >> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST || > >> + !rb_next(&persistent_gnt->node)) { > >> + ret = gnttab_unmap_refs(unmap, NULL, pages, > >> + segs_to_unmap); > >> + BUG_ON(ret); > >> + segs_to_unmap = 0; > >> + } > >> + } > >> + > >> + BUG_ON(blkif->persistent_gnt_c != 0); > >> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts)); > >> + > >> if (log_stats) > >> print_stats(blkif); > >> > >> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req) > >> int ret; > >> > >> for (i = 0; i < req->nr_pages; i++) { > >> + if (!req->unmap_seg[i]) > >> + continue; > > > > Perhaps there should be a #define for that array.. > > Do you mean something like: > > #define unmap(req, i) req->unmap_seg[i]I was thinking that you just check for req->unamp_seg[i] to have an non-zero value. But since that array is just used as an check to see whether the functionality is enabled (or not), you might want to declerare the right values so: #define UNMAP_SG_ON 1 #define UNMAP_SG_OFF 0 or so.> > >> handle = pending_handle(req, i); > >> if (handle == BLKBACK_INVALID_HANDLE) > >> continue; > >> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req) > >> > >> static int xen_blkbk_map(struct blkif_request *req, > >> struct pending_req *pending_req, > >> - struct seg_buf seg[]) > >> + struct seg_buf seg[], > >> + struct page *pages[]) > >> { > >> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> - int i; > >> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct persistent_gnt *persistent_gnt = NULL; > >> + struct xen_blkif *blkif = pending_req->blkif; > >> + phys_addr_t addr = 0; > >> + int i, j; > >> + int new_map; > > > > Just use a bool for this. > > Done > > >> int nseg = req->u.rw.nr_segments; > >> + int segs_to_map = 0; > >> int ret = 0; > >> + int use_persistent_gnts; > >> + > >> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent); > >> + > >> + BUG_ON(blkif->persistent_gnt_c > > >> + max_mapped_grant_pages(pending_req->blkif->blk_protocol)); > >> > >> /* > >> * Fill out preq.nr_sects with proper amount of sectors, and setup > >> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req, > >> for (i = 0; i < nseg; i++) { > >> uint32_t flags; > >> > >> - flags = GNTMAP_host_map; > >> - if (pending_req->operation != BLKIF_OP_READ) > >> - flags |= GNTMAP_readonly; > >> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags, > >> - req->u.rw.seg[i].gref, > >> - pending_req->blkif->domid); > >> + if (use_persistent_gnts) > >> + persistent_gnt = get_persistent_gnt( > >> + &blkif->persistent_gnts, > >> + req->u.rw.seg[i].gref); > >> + > >> + if (persistent_gnt) { > >> + /* > >> + * We are using persistent grants and > >> + * the grant is already mapped > >> + */ > >> + new_map = 0; > >> + } else if (use_persistent_gnts && > >> + blkif->persistent_gnt_c < > >> + max_mapped_grant_pages(blkif->blk_protocol)) { > >> + /* > >> + * We are using persistent grants, the grant is > >> + * not mapped but we have room for it > >> + */ > >> + new_map = 1; > >> + persistent_gnt = kzalloc( > >> + sizeof(struct persistent_gnt), > >> + GFP_KERNEL); > >> + if (!persistent_gnt) > >> + return -ENOMEM; > >> + persistent_gnt->page = alloc_page(GFP_KERNEL); > >> + if (!persistent_gnt->page) { > >> + kfree(persistent_gnt); > >> + return -ENOMEM; > >> + } > >> + persistent_gnt->gnt = req->u.rw.seg[i].gref; > >> + > >> + pages_to_gnt[segs_to_map] > >> + persistent_gnt->page; > >> + addr = (unsigned long) pfn_to_kaddr( > >> + page_to_pfn(persistent_gnt->page)); > >> + > >> + add_persistent_gnt(&blkif->persistent_gnts, > >> + persistent_gnt); > >> + blkif->persistent_gnt_c++; > >> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n", > >> + persistent_gnt->gnt, blkif->persistent_gnt_c, > >> + max_mapped_grant_pages(blkif->blk_protocol)); > >> + } else { > >> + /* > >> + * We are either using persistent grants and > >> + * hit the maximum limit of grants mapped, > >> + * or we are not using persistent grants. > >> + */ > >> + if (use_persistent_gnts && > >> + !blkif->vbd.overflow_max_grants) { > >> + blkif->vbd.overflow_max_grants = 1; > >> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n", > >> + blkif->domid, blkif->vbd.handle); > >> + } > >> + new_map = 1; > >> + pages[i] = blkbk->pending_page(pending_req, i); > >> + addr = vaddr(pending_req, i); > >> + pages_to_gnt[segs_to_map] > >> + blkbk->pending_page(pending_req, i); > >> + } > >> + > >> + if (persistent_gnt) { > >> + pages[i] = persistent_gnt->page; > >> + persistent_gnts[i] = persistent_gnt; > >> + } else { > >> + persistent_gnts[i] = NULL; > >> + } > >> + > >> + if (new_map) { > >> + flags = GNTMAP_host_map; > >> + if (!persistent_gnt && > >> + (pending_req->operation != BLKIF_OP_READ)) > >> + flags |= GNTMAP_readonly; > >> + gnttab_set_map_op(&map[segs_to_map++], addr, > >> + flags, req->u.rw.seg[i].gref, > >> + blkif->domid); > >> + } > >> } > >> > >> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg); > >> - BUG_ON(ret); > >> + if (segs_to_map) { > >> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map); > >> + BUG_ON(ret); > >> + } > >> > >> /* > >> * Now swizzle the MFN in our domain with the MFN from the other domain > >> * so that when we access vaddr(pending_req,i) it has the contents of > >> * the page from the other domain. > >> */ > >> - for (i = 0; i < nseg; i++) { > >> - if (unlikely(map[i].status != 0)) { > >> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); > >> - map[i].handle = BLKBACK_INVALID_HANDLE; > >> - ret |= 1; > >> + for (i = 0, j = 0; i < nseg; i++) { > >> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) { > >> + /* This is a newly mapped grant */ > >> + BUG_ON(j >= segs_to_map); > >> + if (unlikely(map[j].status != 0)) { > > > > I am not seeing j being incremented anywhere? Should it? > > Yes, it should be incremented, but not here. See the comment below to > see what I''ve changed. > > > > >> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); > >> + map[j].handle = BLKBACK_INVALID_HANDLE; > >> + ret |= 1; > >> + if (persistent_gnts[i]) { > >> + rb_erase(&persistent_gnts[i]->node, > >> + &blkif->persistent_gnts); > >> + blkif->persistent_gnt_c--; > >> + kfree(persistent_gnts[i]); > >> + persistent_gnts[i] = NULL; > >> + } > >> + } > >> + } > >> + if (persistent_gnts[i]) { > >> + if (!persistent_gnts[i]->handle) { > >> + /* > >> + * If this is a new persistent grant > >> + * save the handler > >> + */ > >> + persistent_gnts[i]->handle = map[j].handle; > >> + persistent_gnts[i]->dev_bus_addr > >> + map[j++].dev_bus_addr; > >> + } > >> + pending_handle(pending_req, i) > >> + persistent_gnts[i]->handle; > >> + pending_req->unmap_seg[i] = 0; > > > > Could we have a #define for that? > > Sure. > > >> + > >> + if (ret) > >> + continue; > > This should be > > if (ret) { > j++; > continue; > }<nods>> > >> + > >> + seg[i].buf = persistent_gnts[i]->dev_bus_addr | > >> + (req->u.rw.seg[i].first_sect << 9); > >> + } else { > >> + pending_handle(pending_req, i) = map[j].handle; > >> + pending_req->unmap_seg[i] = 1; > > > > And here as well? > > Done. > > >> + > >> + if (ret) > >> + continue; > >> + > >> + seg[i].buf = map[j++].dev_bus_addr | > >> + (req->u.rw.seg[i].first_sect << 9); > >> } > >> - > >> - pending_handle(pending_req, i) = map[i].handle; > >> - > >> - if (ret) > >> - continue; > >> - > >> - seg[i].buf = map[i].dev_bus_addr | > >> - (req->u.rw.seg[i].first_sect << 9); > >> } > >> return ret; > >> } > >> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > >> int operation; > >> struct blk_plug plug; > >> bool drain = false; > >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> > >> switch (req->operation) { > >> case BLKIF_OP_READ: > >> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > >> * the hypercall to unmap the grants - that is all done in > >> * xen_blkbk_unmap. > >> */ > >> - if (xen_blkbk_map(req, pending_req, seg)) > >> + if (xen_blkbk_map(req, pending_req, seg, pages)) > >> goto fail_flush; > >> > >> /* > >> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, > >> for (i = 0; i < nseg; i++) { > >> while ((bio == NULL) || > >> (bio_add_page(bio, > >> - blkbk->pending_page(pending_req, i), > >> + pages[i], > >> seg[i].nsec << 9, > >> seg[i].buf & ~PAGE_MASK) == 0)) { > >> > >> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h > >> index 9ad3b5e..6c08ee9 100644 > >> --- a/drivers/block/xen-blkback/common.h > >> +++ b/drivers/block/xen-blkback/common.h > >> @@ -34,6 +34,7 @@ > >> #include <linux/vmalloc.h> > >> #include <linux/wait.h> > >> #include <linux/io.h> > >> +#include <linux/rbtree.h> > >> #include <asm/setup.h> > >> #include <asm/pgalloc.h> > >> #include <asm/hypervisor.h> > >> @@ -160,10 +161,22 @@ struct xen_vbd { > >> sector_t size; > >> bool flush_support; > >> bool discard_secure; > >> + > >> + unsigned int feature_gnt_persistent:1; > >> + unsigned int overflow_max_grants:1; > > > > I think the v3.7-rc1 has this structure changed just a tiny bit, so you > > might want to rebase it on top of that. > > I''ve done the rebase on top of your linux-next branch, commit > ad502612c173fff239250c9fe9bdfaaef70b9901.Thx> > > > >> }; > >> > >> struct backend_info; > >> > >> + > >> +struct persistent_gnt { > >> + struct page *page; > >> + grant_ref_t gnt; > >> + grant_handle_t handle; > >> + uint64_t dev_bus_addr; > >> + struct rb_node node; > >> +}; > >> + > >> struct xen_blkif { > >> /* Unique identifier for this interface. */ > >> domid_t domid; > >> @@ -190,6 +203,10 @@ struct xen_blkif { > >> struct task_struct *xenblkd; > >> unsigned int waiting_reqs; > >> > >> + /* frontend feature information */ > > > > Huh? > > Changed it to: > > /* tree to store persistent grants */ > > >> + struct rb_root persistent_gnts; > >> + unsigned int persistent_gnt_c; > >> + > >> /* statistics */ > >> unsigned long st_print; > >> int st_rd_req; > >> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c > >> index 4f66171..9f88b4e 100644 > >> --- a/drivers/block/xen-blkback/xenbus.c > >> +++ b/drivers/block/xen-blkback/xenbus.c > >> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) > >> atomic_set(&blkif->drain, 0); > >> blkif->st_print = jiffies; > >> init_waitqueue_head(&blkif->waiting_to_free); > >> + blkif->persistent_gnts.rb_node = NULL; > >> > >> return blkif; > >> } > >> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be) > >> struct xenbus_device *dev = be->dev; > >> unsigned long ring_ref; > >> unsigned int evtchn; > >> + unsigned int pers_grants; > >> char protocol[64] = ""; > >> int err; > >> > >> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be) > >> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol); > >> return -1; > >> } > >> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n", > >> - ring_ref, evtchn, be->blkif->blk_protocol, protocol); > >> + err = xenbus_gather(XBT_NIL, dev->otherend, > >> + "feature-persistent-grants", "%u", > >> + &pers_grants, NULL); > >> + if (err) > >> + pers_grants = 0; > >> + > >> + be->blkif->vbd.feature_gnt_persistent = pers_grants; > >> + be->blkif->vbd.overflow_max_grants = 0; > >> + > >> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n", > >> + ring_ref, evtchn, be->blkif->blk_protocol, protocol, > >> + pers_grants); > > > > Can you make that a string? So it is > > pers_grants ? "persistent grants" : "" > > > > instead of a zero of one value pls? > > Yes, done. > > >> > >> /* Map the shared frame, irq etc. */ > >> err = xen_blkif_map(be->blkif, ring_ref, evtchn); > >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > >> index 2c2d2e5..206d422 100644 > >> --- a/drivers/block/xen-blkfront.c > >> +++ b/drivers/block/xen-blkfront.c > >> @@ -44,6 +44,7 @@ > >> #include <linux/mutex.h> > >> #include <linux/scatterlist.h> > >> #include <linux/bitmap.h> > >> +#include <linux/llist.h> > >> > >> #include <xen/xen.h> > >> #include <xen/xenbus.h> > >> @@ -64,10 +65,17 @@ enum blkif_state { > >> BLKIF_STATE_SUSPENDED, > >> }; > >> > >> +struct grant { > >> + grant_ref_t gref; > >> + unsigned long pfn; > >> + struct llist_node node; > >> +}; > >> + > >> struct blk_shadow { > >> struct blkif_request req; > >> struct request *request; > >> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> }; > >> > >> static DEFINE_MUTEX(blkfront_mutex); > >> @@ -97,6 +105,8 @@ struct blkfront_info > >> struct work_struct work; > >> struct gnttab_free_callback callback; > >> struct blk_shadow shadow[BLK_RING_SIZE]; > >> + struct llist_head persistent_gnts; > >> + unsigned int persistent_gnts_c; > >> unsigned long shadow_free; > >> unsigned int feature_flush; > >> unsigned int flush_op; > >> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req) > >> unsigned long id; > >> unsigned int fsect, lsect; > >> int i, ref; > >> + > >> + /* > >> + * Used to store if we are able to queue the request by just using > >> + * existing persistent grants (0), or if we have to get new grants, > > > > What does the zero mean? > > Frankly, no idea, I guess it was in Oliver''s patch and I missed to spot it. > > >> + * as there are not sufficiently many free. > >> + */ > >> + int new_persistent_gnts; > > > > I think this can be a bool? > > I agree. > > >> grant_ref_t gref_head; > >> + struct page *granted_page; > >> + struct grant *gnt_list_entry = NULL; > >> struct scatterlist *sg; > >> > >> if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) > >> return 1; > >> > >> - if (gnttab_alloc_grant_references( > >> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) { > >> - gnttab_request_free_callback( > >> - &info->callback, > >> - blkif_restart_queue_callback, > >> - info, > >> - BLKIF_MAX_SEGMENTS_PER_REQUEST); > >> - return 1; > >> - } > >> + /* Check if we have enought grants to allocate a requests */ > >> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) { > >> + new_persistent_gnts = 1; > >> + if (gnttab_alloc_grant_references( > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c, > >> + &gref_head) < 0) { > >> + gnttab_request_free_callback( > >> + &info->callback, > >> + blkif_restart_queue_callback, > >> + info, > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST); > >> + return 1; > >> + } > >> + } else > >> + new_persistent_gnts = 0; > >> > >> /* Fill out a communications ring structure. */ > >> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); > >> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req) > >> BLKIF_MAX_SEGMENTS_PER_REQUEST); > >> > >> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) { > >> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg))); > >> fsect = sg->offset >> 9; > >> lsect = fsect + (sg->length >> 9) - 1; > >> - /* install a grant reference. */ > >> - ref = gnttab_claim_grant_reference(&gref_head); > >> - BUG_ON(ref == -ENOSPC); > >> > >> - gnttab_grant_foreign_access_ref( > >> - ref, > >> + if (info->persistent_gnts_c) { > >> + BUG_ON(llist_empty(&info->persistent_gnts)); > >> + gnt_list_entry = llist_entry( > >> + llist_del_first(&info->persistent_gnts), > >> + struct grant, node); > >> + > >> + ref = gnt_list_entry->gref; > >> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn); > >> + info->persistent_gnts_c--; > >> + } else { > >> + ref = gnttab_claim_grant_reference(&gref_head); > >> + BUG_ON(ref == -ENOSPC); > >> + > >> + gnt_list_entry > >> + kmalloc(sizeof(struct grant), > >> + GFP_ATOMIC); > >> + if (!gnt_list_entry) > >> + return -ENOMEM; > >> + > >> + granted_page = alloc_page(GFP_ATOMIC); > >> + if (!granted_page) { > >> + kfree(gnt_list_entry); > >> + return -ENOMEM; > >> + } > >> + > >> + gnt_list_entry->pfn > >> + page_to_pfn(granted_page); > >> + gnt_list_entry->gref = ref; > >> + > >> + buffer_mfn = pfn_to_mfn(page_to_pfn( > >> + granted_page)); > >> + gnttab_grant_foreign_access_ref(ref, > >> info->xbdev->otherend_id, > >> - buffer_mfn, > >> - rq_data_dir(req)); > >> + buffer_mfn, 0); > >> + } > >> + > >> + info->shadow[id].grants_used[i] = gnt_list_entry; > >> + > >> + if (rq_data_dir(req)) { > >> + char *bvec_data; > >> + void *shared_data; > >> + > >> + BUG_ON(sg->offset + sg->length > PAGE_SIZE); > >> + > >> + shared_data = kmap_atomic( > >> + pfn_to_page(gnt_list_entry->pfn)); > >> + bvec_data = kmap_atomic(sg_page(sg)); > >> + > >> + /* > >> + * this does not wipe data stored outside the > >> + * range sg->offset..sg->offset+sg->length. > >> + * Therefore, blkback *could* see data from > >> + * previous requests. This is OK as long as > >> + * persistent grants are shared with just one > >> + * domain. It may need refactoring if This > > .. this (lowercase it pls) > > Done. > > > > >> + * changes > >> + */ > >> + memcpy(shared_data + sg->offset, > >> + bvec_data + sg->offset, > >> + sg->length); > >> + > >> + kunmap_atomic(bvec_data); > >> + kunmap_atomic(shared_data); > >> + } > >> > >> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn); > >> ring_req->u.rw.seg[i] > >> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req) > >> /* Keep a private copy so we can reissue requests when recovering. */ > >> info->shadow[id].req = *ring_req; > >> > >> - gnttab_free_grant_references(gref_head); > >> + if (new_persistent_gnts) > >> + gnttab_free_grant_references(gref_head); > >> > >> return 0; > >> } > >> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) > >> static void xlvbd_flush(struct blkfront_info *info) > >> { > >> blk_queue_flush(info->rq, info->feature_flush); > >> - printk(KERN_INFO "blkfront: %s: %s: %s\n", > >> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n", > > > > HA! By default, eh? > > Yes, you caught me, there''s a paragraph in the commit message that > explains that we are using persistent grants in the frontend > unconditionally, since the protocol is compatible (you can have a > persistent blkfront and a non-persistent blkback). It simplifies the > logic in blkfront. Are you OK with it?It is OK, but you should be checking whether the backend supports it. I don''t see it checking the info->feature_persistent_grant to print that.> > >> info->gd->disk_name, > >> info->flush_op == BLKIF_OP_WRITE_BARRIER ? > >> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ? > >> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work) > >> > >> static void blkif_free(struct blkfront_info *info, int suspend) > >> { > >> + struct llist_node *all_gnts; > >> + struct grant *persistent_gnt; > >> + > >> /* Prevent new requests being issued until we fix things up. */ > >> spin_lock_irq(&info->io_lock); > >> info->connected = suspend ? > >> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend) > >> /* No more blkif_request(). */ > >> if (info->rq) > >> blk_stop_queue(info->rq); > >> + > >> + /* Remove all persistent grants */ > >> + if (info->persistent_gnts_c) { > >> + all_gnts = llist_del_all(&info->persistent_gnts); > >> + llist_for_each_entry(persistent_gnt, all_gnts, node) { > >> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL); > >> + kfree(persistent_gnt); > >> + } > >> + info->persistent_gnts_c = 0; > >> + } > >> + > >> /* No more gnttab callback work. */ > >> gnttab_cancel_free_callback(&info->callback); > >> spin_unlock_irq(&info->io_lock); > >> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend) > >> > >> } > >> > >> -static void blkif_completion(struct blk_shadow *s) > >> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, > >> + struct blkif_response *bret) > >> { > >> int i; > >> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place > >> - * flag. */ > >> - for (i = 0; i < s->req.u.rw.nr_segments; i++) > >> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL); > >> + struct bio_vec *bvec; > >> + struct req_iterator iter; > >> + unsigned long flags; > >> + char *bvec_data; > >> + void *shared_data; > >> + unsigned int offset = 0; > >> + > >> + if (bret->operation == BLKIF_OP_READ) { > >> + /* > >> + * Copy the data received from the backend into the bvec. > >> + * Since bv_len can be different from PAGE_SIZE, we need to > >> + * be sure we are actually copying the data from the right > >> + * shared page. > >> + */ > >> + rq_for_each_segment(bvec, s->request, iter) { > >> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE); > >> + i = offset >> PAGE_SHIFT; > > > > Could you also include a comment about the bug you found here, pls? > > There''s a comment before the rq_for_each_segment loop, that tries to > explain that, do you think the following is better? > > /* > * Copy the data received from the backend into the bvec. > * Since bv_offset can be different than 0, and bv_len different > * than PAGE_SIZE, we have to keep track of the current offset, > * to be sure we are copying the data from the right shared page.Yes. That is good.> */ > > >> + shared_data = kmap_atomic( > >> + pfn_to_page(s->grants_used[i]->pfn)); > >> + bvec_data = bvec_kmap_irq(bvec, &flags); > >> + memcpy(bvec_data, shared_data + bvec->bv_offset, > >> + bvec->bv_len); > >> + bvec_kunmap_irq(bvec_data, &flags); > >> + kunmap_atomic(shared_data); > >> + offset += bvec->bv_len; > >> + } > >> + } > >> + /* Add the persistent grant into the list of free grants */ > >> + for (i = 0; i < s->req.u.rw.nr_segments; i++) { > >> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts); > >> + info->persistent_gnts_c++; > >> + } > >> } > >> > >> static irqreturn_t blkif_interrupt(int irq, void *dev_id) > >> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) > >> req = info->shadow[id].request; > >> > >> if (bret->operation != BLKIF_OP_DISCARD) > >> - blkif_completion(&info->shadow[id]); > >> + blkif_completion(&info->shadow[id], info, bret); > >> > >> if (add_id_to_freelist(info, id)) { > >> WARN(1, "%s: response to %s (id %ld) couldn''t be recycled!\n", > >> @@ -942,6 +1066,11 @@ again: > >> message = "writing protocol"; > >> goto abort_transaction; > >> } > >> + err = xenbus_printf(xbt, dev->nodename, > >> + "feature-persistent-grants", "%d", 1); > > > > So its %u in blkback, but %d in here? Which one should it be? > > %u in both places. > > >> + if (err) > >> + dev_warn(&dev->dev, > >> + "writing persistent grants feature to xenbus"); > >> > >> err = xenbus_transaction_end(xbt, 0); > >> if (err) { > >> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev, > >> spin_lock_init(&info->io_lock); > >> info->xbdev = dev; > >> info->vdevice = vdevice; > >> + init_llist_head(&info->persistent_gnts); > >> + info->persistent_gnts_c = 0; > >> info->connected = BLKIF_STATE_DISCONNECTED; > >> INIT_WORK(&info->work, blkif_restart_queue); > >> > >> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info) > >> req->u.rw.seg[j].gref, > >> info->xbdev->otherend_id, > >> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]), > >> - rq_data_dir(info->shadow[req->u.rw.id].request)); > >> + 0); > >> } > >> info->shadow[req->u.rw.id].req = *req; > >> > >> -- > >> 1.7.7.5 (Apple Git-26) > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> Please read the FAQ at http://www.tux.org/lkml/ > >>
Roger Pau Monné
2012-Oct-23 18:09 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote:>>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c >>>> index c6decb9..2b982b2 100644 >>>> --- a/drivers/block/xen-blkback/blkback.c >>>> +++ b/drivers/block/xen-blkback/blkback.c >>>> @@ -78,6 +78,7 @@ struct pending_req { >>>> unsigned short operation; >>>> int status; >>>> struct list_head free_list; >>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];Should I change this to a bool? Since we are only setting it to 0 or 1.>>> Perhaps there should be a #define for that array.. >> >> Do you mean something like: >> >> #define unmap(req, i) req->unmap_seg[i] > > I was thinking that you just check for req->unamp_seg[i] to > have an non-zero value. But since that array is just used as an check > to see whether the functionality is enabled (or not), you might want > to declerare the right values so: > #define UNMAP_SG_ON 1 > #define UNMAP_SG_OFF 0 > > or so.Agreed, will add the defines.>>>> + if (persistent_gnts[i]) { >>>> + if (!persistent_gnts[i]->handle) { >>>> + /* >>>> + * If this is a new persistent grant >>>> + * save the handler >>>> + */ >>>> + persistent_gnts[i]->handle = map[j].handle; >>>> + persistent_gnts[i]->dev_bus_addr >>>> + map[j++].dev_bus_addr; >>>> + } >>>> + pending_handle(pending_req, i) >>>> + persistent_gnts[i]->handle; >>>> + pending_req->unmap_seg[i] = 0; >>> >>> Could we have a #define for that? >> >> Sure.I''ve used the previous macro, so it looks like: unmap(req, i) = UNMAP_SG_OFF; I''m not sure if this is what you meant, or if you where interested in defining a set of macros like: #define check_unmap(req, i) req->unmap_seg[i] #define unset_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_OFF #define set_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_ON I would go for the first option (the unmap macro that can be used here and in xen_blkbk_unmap).>>> HA! By default, eh? >> >> Yes, you caught me, there''s a paragraph in the commit message that >> explains that we are using persistent grants in the frontend >> unconditionally, since the protocol is compatible (you can have a >> persistent blkfront and a non-persistent blkback). It simplifies the >> logic in blkfront. Are you OK with it? > > It is OK, but you should be checking whether the backend supports it. > I don''t see it checking the info->feature_persistent_grant to print > that.I don''t understand why blkfront needs to check if the backend supports persisten grants, blkfront is going to use persistent grants anyway.
Konrad Rzeszutek Wilk
2012-Oct-23 18:50 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monné wrote:> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote: > >>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c > >>>> index c6decb9..2b982b2 100644 > >>>> --- a/drivers/block/xen-blkback/blkback.c > >>>> +++ b/drivers/block/xen-blkback/blkback.c > >>>> @@ -78,6 +78,7 @@ struct pending_req { > >>>> unsigned short operation; > >>>> int status; > >>>> struct list_head free_list; > >>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > > Should I change this to a bool? Since we are only setting it to 0 or 1.I would just keep it as ''int''. Eventually we can replace this with a bit-map, but that can be done later.> > >>> Perhaps there should be a #define for that array.. > >> > >> Do you mean something like: > >> > >> #define unmap(req, i) req->unmap_seg[i] > > > > I was thinking that you just check for req->unamp_seg[i] to > > have an non-zero value. But since that array is just used as an check > > to see whether the functionality is enabled (or not), you might want > > to declerare the right values so: > > #define UNMAP_SG_ON 1 > > #define UNMAP_SG_OFF 0 > > > > or so. > > Agreed, will add the defines. > > >>>> + if (persistent_gnts[i]) { > >>>> + if (!persistent_gnts[i]->handle) { > >>>> + /* > >>>> + * If this is a new persistent grant > >>>> + * save the handler > >>>> + */ > >>>> + persistent_gnts[i]->handle = map[j].handle; > >>>> + persistent_gnts[i]->dev_bus_addr > >>>> + map[j++].dev_bus_addr; > >>>> + } > >>>> + pending_handle(pending_req, i) > >>>> + persistent_gnts[i]->handle; > >>>> + pending_req->unmap_seg[i] = 0; > >>> > >>> Could we have a #define for that? > >> > >> Sure. > > I''ve used the previous macro, so it looks like: > > unmap(req, i) = UNMAP_SG_OFF; > > I''m not sure if this is what you meant, or if you where interested in > defining a set of macros like: > > #define check_unmap(req, i) req->unmap_seg[i] > #define unset_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_OFF > #define set_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_ON > > I would go for the first option (the unmap macro that can be used here > and in xen_blkbk_unmap).I was just thinking something as simple as if (reg->unmap_seg[i] == UNMAP_SG_OFF) continue; And the #defines are just for the hard-coded values of 0 or 1.> > >>> HA! By default, eh? > >> > >> Yes, you caught me, there''s a paragraph in the commit message that > >> explains that we are using persistent grants in the frontend > >> unconditionally, since the protocol is compatible (you can have a > >> persistent blkfront and a non-persistent blkback). It simplifies the > >> logic in blkfront. Are you OK with it? > > > > It is OK, but you should be checking whether the backend supports it. > > I don''t see it checking the info->feature_persistent_grant to print > > that. > > I don''t understand why blkfront needs to check if the backend supports > persisten grants, blkfront is going to use persistent grants anyway.What if it does not (say this guest runs on an older xen-blkback?)? Then you would be still printing ''persistent grants'' in the blkfront.> > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Jan Beulich
2012-Oct-24 07:40 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
>>> On 23.10.12 at 20:50, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote: > On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monné wrote: >> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote: >> >>>> diff --git a/drivers/block/xen-blkback/blkback.c > b/drivers/block/xen-blkback/blkback.c >> >>>> index c6decb9..2b982b2 100644 >> >>>> --- a/drivers/block/xen-blkback/blkback.c >> >>>> +++ b/drivers/block/xen-blkback/blkback.c >> >>>> @@ -78,6 +78,7 @@ struct pending_req { >> >>>> unsigned short operation; >> >>>> int status; >> >>>> struct list_head free_list; >> >>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> >> Should I change this to a bool? Since we are only setting it to 0 or 1. > > I would just keep it as 'int'. Eventually we can replace this with a > bit-map, but that can be done later.I think this should be a bitmap from the beginning - why would you want to waste 44 bytes per request for something that fits in a single unsigned long (and the picture would get worse with the number-of-segments extension)? Also - am I taking this work being done here as a silent agreement to not invest into blkif2 to streamline the various extensions floating around? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Roger Pau Monné
2012-Oct-24 10:45 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On 23/10/12 20:50, Konrad Rzeszutek Wilk wrote:> On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monné wrote: >> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote: >>>>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c >>>>>> index c6decb9..2b982b2 100644 >>>>>> --- a/drivers/block/xen-blkback/blkback.c >>>>>> +++ b/drivers/block/xen-blkback/blkback.c >>>>>> @@ -78,6 +78,7 @@ struct pending_req { >>>>>> unsigned short operation; >>>>>> int status; >>>>>> struct list_head free_list; >>>>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> >> Should I change this to a bool? Since we are only setting it to 0 or 1. > > I would just keep it as ''int''. Eventually we can replace this with a > bit-map, but that can be done later.I''ve already changed it to a bitmap.>>>>> Perhaps there should be a #define for that array.. >>>> >>>> Do you mean something like: >>>> >>>> #define unmap(req, i) req->unmap_seg[i] >>> >>> I was thinking that you just check for req->unamp_seg[i] to >>> have an non-zero value. But since that array is just used as an check >>> to see whether the functionality is enabled (or not), you might want >>> to declerare the right values so: >>> #define UNMAP_SG_ON 1 >>> #define UNMAP_SG_OFF 0 >>> >>> or so. >> >> Agreed, will add the defines. >> >>>>>> + if (persistent_gnts[i]) { >>>>>> + if (!persistent_gnts[i]->handle) { >>>>>> + /* >>>>>> + * If this is a new persistent grant >>>>>> + * save the handler >>>>>> + */ >>>>>> + persistent_gnts[i]->handle = map[j].handle; >>>>>> + persistent_gnts[i]->dev_bus_addr >>>>>> + map[j++].dev_bus_addr; >>>>>> + } >>>>>> + pending_handle(pending_req, i) >>>>>> + persistent_gnts[i]->handle; >>>>>> + pending_req->unmap_seg[i] = 0; >>>>> >>>>> Could we have a #define for that? >>>> >>>> Sure. >> >> I''ve used the previous macro, so it looks like: >> >> unmap(req, i) = UNMAP_SG_OFF; >> >> I''m not sure if this is what you meant, or if you where interested in >> defining a set of macros like: >> >> #define check_unmap(req, i) req->unmap_seg[i] >> #define unset_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_OFF >> #define set_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_ON >> >> I would go for the first option (the unmap macro that can be used here >> and in xen_blkbk_unmap). > > I was just thinking something as simple as > > if (reg->unmap_seg[i] == UNMAP_SG_OFF) > continue; > > And the #defines are just for the hard-coded values of 0 or 1. > >> >>>>> HA! By default, eh? >>>> >>>> Yes, you caught me, there''s a paragraph in the commit message that >>>> explains that we are using persistent grants in the frontend >>>> unconditionally, since the protocol is compatible (you can have a >>>> persistent blkfront and a non-persistent blkback). It simplifies the >>>> logic in blkfront. Are you OK with it? >>> >>> It is OK, but you should be checking whether the backend supports it. >>> I don''t see it checking the info->feature_persistent_grant to print >>> that. >> >> I don''t understand why blkfront needs to check if the backend supports >> persisten grants, blkfront is going to use persistent grants anyway. > > What if it does not (say this guest runs on an older xen-blkback?)? > Then you would be still printing ''persistent grants'' in the blkfront.Ok, I get your point. Now blkfront will only report the use of persistent grants if the backend supports it. If there are no further comments I will send v2 after doing some tests, thanks for the reviews.
Konrad Rzeszutek Wilk
2012-Oct-25 12:40 UTC
Re: [PATCH RFC] Persistent grant maps for xen blk drivers
On Wed, Oct 24, 2012 at 08:40:10AM +0100, Jan Beulich wrote:> >>> On 23.10.12 at 20:50, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote: > > On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monné wrote: > >> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote: > >> >>>> diff --git a/drivers/block/xen-blkback/blkback.c > > b/drivers/block/xen-blkback/blkback.c > >> >>>> index c6decb9..2b982b2 100644 > >> >>>> --- a/drivers/block/xen-blkback/blkback.c > >> >>>> +++ b/drivers/block/xen-blkback/blkback.c > >> >>>> @@ -78,6 +78,7 @@ struct pending_req { > >> >>>> unsigned short operation; > >> >>>> int status; > >> >>>> struct list_head free_list; > >> >>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> > >> Should I change this to a bool? Since we are only setting it to 0 or 1. > > > > I would just keep it as ''int''. Eventually we can replace this with a > > bit-map, but that can be done later. > > I think this should be a bitmap from the beginning - why would > you want to waste 44 bytes per request for something that fits > in a single unsigned long (and the picture would get worse with > the number-of-segments extension)? > > Also - am I taking this work being done here as a silent agreement > to not invest into blkif2 to streamline the various extensions > floating around?I haven''t been able (time-wise) to look at making blkif2 a possibility and I don''t want to hinder this work - which provides great performance benefits.