Daniel Stodden
2010-Nov-12 23:31 UTC
[Xen-devel] blktap: Sync with XCP, dropping zero-copy.
Hi all. This is the better half of what XCP developments and testing brought for blktap. It''s fairly a big change in how I/O buffers are managed. Prior to this series, we had zero-copy I/O down to userspace. Unfortunately, blktap2 always had to jump through a couple of extra loops to do so. Present state of that is that we dropped that, so all tapdev I/O is bounced to/from a bunch of normal pages. Essentially replacing the old VMA management with a couple insert/zap VM calls. One issue was that the kernel can''t cope with recursive I/O. Submitting an iovec on a tapdev, passing it to userspace and then reissuing the same vector via AIO apparently doesn''t fit well with the lock protocol applied to those pages. This is the main reason why blktap had to deal a lot with grant refs. About as much as blkback already does before passing requests on. What happens there is that it''s aliasing those granted pages under a different PFN, thereby in a separate page struct. Not pretty, but it worked, so it''s not the reason why we chose to drop that at some point. The more prevalent problem was network storage, especially anything involving TCP. That includes VHD on both NFS and iSCSI. The problem with those is that retransmits (by the transport) and I/O op completion (on the application layer) are never synchronized. With sufficiently bad timing and bit of jitter on the network, it''s perfectly common for the kernel to complete an AIO request with a late ack on the input queue just when retransmission timer is about to fire underneath. The completion will unmap the granted frame, crashing any uncanceled retransmission on an empty page frame. There are different ways to deal with that. Page destructors might be one way, but as far as I heard they are not particularly popular upstream. Issuing the block I/O on dom0 memory is straightforward and avoids the hassle. One could go argue that retransmits after DIO completion are still a potential privacy problem (I did), but it''s not Xen''s problem after all. If zero-copy becomes more attractive again, the plan would be to rather use grantdev in userspace, such as a filter driver for tapdisk instead. Until then, there''s presumably a notable difference in L2 cache footprint. Then again, there''s also a whole number of cycles not spent in redundant hypercalls now, to deal with the pseudophysical map. There are also benefits or non-issues. - This blktap is rather xen-independent. Certainly depends on the common ring macros, but lacking grant stuff it compiles on bare metal Linux with no CONFIG_XEN. Not consummated here, because that''s going to move the source tree out of drivers/xen. But I''d like to post a new branch proposing to do so. - Blktaps size in dom0 didn''t really change. Frames (now pages) were always pooled. We used to balloon memory to claim space for redundant grant mappings. Now we reserve, by default, the same volume in normal memory. - The previous code would runs all I/O on a single pool. Typically two rings worth of requests. Sufficient for a whole lot of systems, especially with single storage backends, but not so nice when I/O on a number of otherwise independent filers or volumes collides. Pools are refcounted kobjects in sysfs. Toolstacks using the new code can thereby choose to elimitate bottlenecks by grouping taps on different buffer pools. Pools can also be resized, to accomodate greater queue depths. [Note that blkback still has the same issue, so guests won''t take advantage of that before that''s resolved as well.] - XCP started to make some use of stacking tapdevs. Think pointing the image chain of a bunch of "leaf" taps to a shared parent node. That works fairly well, but definitely takes independent resource pools to avoid deadlock by parent starvation then. Please pull upstream/xen/dom0/backend/blktap2 from git://xenbits.xensource.com/people/dstodden/linux.git .. and/or upstream/xen/next for a merge. I also pulled in the pending warning fix from Teck Choon Giam. Cheers, Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-12 23:31 UTC
[Xen-devel] [PATCH 1/5] blktap: Manage segment buffers in mempools.
- Replaces the request free list with a (mempooled) slab. - Replaces request buckets with a mempool. No buckets, because we''re doing full s/g on page granularity anyway, so can gfp() independent pages everywhere. Allocations are 1-11 page-sized segments. - Adds support for multiple page pools. - Adds pools to sysfs. Linked as a ''pools'' kset to blktap-control. - Makes the per-tap pool selectable. Attribute ''pool'' on the tap device. - Make pools online-resizeable. Attributes free/size on the pool kobj. Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> --- drivers/xen/blktap/blktap.h | 35 ++- drivers/xen/blktap/control.c | 80 ++++++-- drivers/xen/blktap/device.c | 2 +- drivers/xen/blktap/request.c | 509 +++++++++++++++++++++++++----------------- drivers/xen/blktap/ring.c | 10 +- drivers/xen/blktap/sysfs.c | 36 +++ 6 files changed, 433 insertions(+), 239 deletions(-) diff --git a/drivers/xen/blktap/blktap.h b/drivers/xen/blktap/blktap.h index a29b509..ad79c15 100644 --- a/drivers/xen/blktap/blktap.h +++ b/drivers/xen/blktap/blktap.h @@ -121,17 +121,19 @@ struct blktap_statistics { }; struct blktap_request { + struct blktap *tap; struct request *rq; uint16_t usr_idx; uint8_t status; atomic_t pendcnt; - uint8_t nr_pages; unsigned short operation; struct timeval time; struct grant_handle_pair handles[BLKIF_MAX_SEGMENTS_PER_REQUEST]; - struct list_head free_list; + + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + int nr_pages; }; struct blktap { @@ -140,6 +142,7 @@ struct blktap { struct blktap_ring ring; struct blktap_device device; + struct blktap_page_pool *pool; int pending_cnt; struct blktap_request *pending_requests[MAX_PENDING_REQS]; @@ -152,6 +155,13 @@ struct blktap { struct blktap_statistics stats; }; +struct blktap_page_pool { + struct mempool_s *bufs; + spinlock_t lock; + struct kobject kobj; + wait_queue_head_t wait; +}; + extern struct mutex blktap_lock; extern struct blktap **blktaps; extern int blktap_max_minor; @@ -165,7 +175,6 @@ size_t blktap_ring_debug(struct blktap *, char *, size_t); int blktap_ring_create(struct blktap *); int blktap_ring_destroy(struct blktap *); void blktap_ring_kick_user(struct blktap *); -void blktap_ring_kick_all(void); int blktap_sysfs_init(void); void blktap_sysfs_exit(void); @@ -181,19 +190,23 @@ void blktap_device_destroy_sync(struct blktap *); int blktap_device_run_queue(struct blktap *); void blktap_device_end_request(struct blktap *, struct blktap_request *, int); -int blktap_request_pool_init(void); -void blktap_request_pool_free(void); -int blktap_request_pool_grow(void); -int blktap_request_pool_shrink(void); -struct blktap_request *blktap_request_allocate(struct blktap *); +int blktap_page_pool_init(struct kobject *); +void blktap_page_pool_exit(void); +struct blktap_page_pool *blktap_page_pool_get(const char *); + +size_t blktap_request_debug(struct blktap *, char *, size_t); +struct blktap_request *blktap_request_alloc(struct blktap *); +int blktap_request_get_pages(struct blktap *, struct blktap_request *, int); void blktap_request_free(struct blktap *, struct blktap_request *); -struct page *request_to_page(struct blktap_request *, int); +void blktap_request_bounce(struct blktap *, struct blktap_request *, int, int); static inline unsigned long request_to_kaddr(struct blktap_request *req, int seg) { - unsigned long pfn = page_to_pfn(request_to_page(req, seg)); - return (unsigned long)pfn_to_kaddr(pfn); + return (unsigned long)page_address(req->pages[seg]); } +#define request_to_page(_request, _seg) ((_request)->pages[_seg]) + + #endif diff --git a/drivers/xen/blktap/control.c b/drivers/xen/blktap/control.c index ef54fa1..8652e07 100644 --- a/drivers/xen/blktap/control.c +++ b/drivers/xen/blktap/control.c @@ -1,7 +1,7 @@ #include <linux/module.h> #include <linux/sched.h> #include <linux/miscdevice.h> - +#include <linux/device.h> #include <asm/uaccess.h> #include "blktap.h" @@ -10,6 +10,7 @@ DEFINE_MUTEX(blktap_lock); struct blktap **blktaps; int blktap_max_minor; +static struct blktap_page_pool *default_pool; static struct blktap * blktap_control_get_minor(void) @@ -83,6 +84,9 @@ blktap_control_create_tap(void) if (!tap) return NULL; + kobject_get(&default_pool->kobj); + tap->pool = default_pool; + err = blktap_ring_create(tap); if (err) goto fail_tap; @@ -110,6 +114,8 @@ blktap_control_destroy_tap(struct blktap *tap) if (err) return err; + kobject_put(&tap->pool->kobj); + blktap_sysfs_destroy(tap); blktap_control_put_minor(tap); @@ -166,12 +172,43 @@ static struct file_operations blktap_control_file_operations = { .ioctl = blktap_control_ioctl, }; -static struct miscdevice blktap_misc = { +static struct miscdevice blktap_control = { .minor = MISC_DYNAMIC_MINOR, .name = "blktap-control", .fops = &blktap_control_file_operations, }; +static struct device *control_device; + +static ssize_t +blktap_control_show_default_pool(struct device *device, + struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "%s", kobject_name(&default_pool->kobj)); +} + +static ssize_t +blktap_control_store_default_pool(struct device *device, + struct device_attribute *attr, + const char *buf, size_t size) +{ + struct blktap_page_pool *pool, *tmp = default_pool; + + pool = blktap_page_pool_get(buf); + if (IS_ERR(pool)) + return PTR_ERR(pool); + + default_pool = pool; + kobject_put(&tmp->kobj); + + return size; +} + +static DEVICE_ATTR(default_pool, S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH, + blktap_control_show_default_pool, + blktap_control_store_default_pool); + size_t blktap_control_debug(struct blktap *tap, char *buf, size_t size) { @@ -190,12 +227,11 @@ blktap_control_init(void) { int err; - err = misc_register(&blktap_misc); - if (err) { - blktap_misc.minor = MISC_DYNAMIC_MINOR; - BTERR("misc_register failed for control device"); + err = misc_register(&blktap_control); + if (err) return err; - } + + control_device = blktap_control.this_device; blktap_max_minor = min(64, MAX_BLKTAP_DEVICE); blktaps = kzalloc(blktap_max_minor * sizeof(blktaps[0]), GFP_KERNEL); @@ -204,20 +240,39 @@ blktap_control_init(void) return -ENOMEM; } + err = blktap_page_pool_init(&control_device->kobj); + if (err) + return err; + + default_pool = blktap_page_pool_get("default"); + if (!default_pool) + return -ENOMEM; + + err = device_create_file(control_device, &dev_attr_default_pool); + if (err) + return err; + return 0; } static void blktap_control_exit(void) { + if (default_pool) { + kobject_put(&default_pool->kobj); + default_pool = NULL; + } + + blktap_page_pool_exit(); + if (blktaps) { kfree(blktaps); blktaps = NULL; } - if (blktap_misc.minor != MISC_DYNAMIC_MINOR) { - misc_deregister(&blktap_misc); - blktap_misc.minor = MISC_DYNAMIC_MINOR; + if (control_device) { + misc_deregister(&blktap_control); + control_device = NULL; } } @@ -228,7 +283,6 @@ blktap_exit(void) blktap_ring_exit(); blktap_sysfs_exit(); blktap_device_exit(); - blktap_request_pool_free(); } static int __init @@ -239,10 +293,6 @@ blktap_init(void) if (!xen_pv_domain()) return -ENODEV; - err = blktap_request_pool_init(); - if (err) - return err; - err = blktap_device_init(); if (err) goto fail; diff --git a/drivers/xen/blktap/device.c b/drivers/xen/blktap/device.c index 3acb8fa..ed95548 100644 --- a/drivers/xen/blktap/device.c +++ b/drivers/xen/blktap/device.c @@ -605,7 +605,7 @@ blktap_device_run_queue(struct blktap *tap) break; } - request = blktap_request_allocate(tap); + request = blktap_request_alloc(tap); if (!request) { tap->stats.st_oo_req++; goto wait; diff --git a/drivers/xen/blktap/request.c b/drivers/xen/blktap/request.c index eee7100..ca12442 100644 --- a/drivers/xen/blktap/request.c +++ b/drivers/xen/blktap/request.c @@ -1,297 +1,400 @@ +#include <linux/mempool.h> #include <linux/spinlock.h> -#include <xen/balloon.h> +#include <linux/mutex.h> #include <linux/sched.h> +#include <linux/device.h> +#include <xen/balloon.h> #include "blktap.h" -#define MAX_BUCKETS 8 -#define BUCKET_SIZE MAX_PENDING_REQS +/* max pages per shared pool. just to prevent accidental dos. */ +#define POOL_MAX_PAGES (256*BLKIF_MAX_SEGMENTS_PER_REQUEST) -#define BLKTAP_POOL_CLOSING 1 +/* default page pool size. when considering to shrink a shared pool, + * note that paused tapdisks may grab a whole lot of pages for a long + * time. */ +#define POOL_DEFAULT_PAGES (2 * MMAP_PAGES) -struct blktap_request_bucket; +/* max number of pages allocatable per request. */ +#define POOL_MAX_REQUEST_PAGES BLKIF_MAX_SEGMENTS_PER_REQUEST -struct blktap_request_handle { - int slot; - uint8_t inuse; - struct blktap_request request; - struct blktap_request_bucket *bucket; -}; +/* min request structs per pool. These grow dynamically. */ +#define POOL_MIN_REQS BLK_RING_SIZE -struct blktap_request_bucket { - atomic_t reqs_in_use; - struct blktap_request_handle handles[BUCKET_SIZE]; - struct page **foreign_pages; -}; +static struct kset *pool_set; -struct blktap_request_pool { - spinlock_t lock; - uint8_t status; - struct list_head free_list; - atomic_t reqs_in_use; - wait_queue_head_t wait_queue; - struct blktap_request_bucket *buckets[MAX_BUCKETS]; -}; +#define kobj_to_pool(_kobj) \ + container_of(_kobj, struct blktap_page_pool, kobj) -static struct blktap_request_pool pool; - -static inline struct blktap_request_handle * -blktap_request_to_handle(struct blktap_request *req) -{ - return container_of(req, struct blktap_request_handle, request); -} +static struct kmem_cache *request_cache; +static mempool_t *request_pool; static void -blktap_request_pool_init_request(struct blktap_request *request) +__page_pool_wake(struct blktap_page_pool *pool) { - int i; - - request->usr_idx = -1; - request->nr_pages = 0; - request->status = BLKTAP_REQUEST_FREE; - INIT_LIST_HEAD(&request->free_list); - for (i = 0; i < ARRAY_SIZE(request->handles); i++) { - request->handles[i].user = INVALID_GRANT_HANDLE; - request->handles[i].kernel = INVALID_GRANT_HANDLE; - } + mempool_t *mem = pool->bufs; + + /* + NB. slightly wasteful to always wait for a full segment + set. but this ensures the next disk makes + progress. presently, the repeated request struct + alloc/release cycles would otherwise keep everyone spinning. + */ + + if (mem->curr_nr >= POOL_MAX_REQUEST_PAGES) + wake_up(&pool->wait); } -static int -blktap_request_pool_allocate_bucket(void) +int +blktap_request_get_pages(struct blktap *tap, + struct blktap_request *request, int nr_pages) { - int i, idx; - unsigned long flags; - struct blktap_request *request; - struct blktap_request_handle *handle; - struct blktap_request_bucket *bucket; + struct blktap_page_pool *pool = tap->pool; + mempool_t *mem = pool->bufs; + struct page *page; - bucket = kzalloc(sizeof(struct blktap_request_bucket), GFP_KERNEL); - if (!bucket) - goto fail; + BUG_ON(request->nr_pages != 0); + BUG_ON(nr_pages > POOL_MAX_REQUEST_PAGES); - bucket->foreign_pages = alloc_empty_pages_and_pagevec(MMAP_PAGES); - if (!bucket->foreign_pages) - goto fail; + if (mem->curr_nr < nr_pages) + return -ENOMEM; - spin_lock_irqsave(&pool.lock, flags); + /* NB. avoid thundering herds of tapdisks colliding. */ + spin_lock(&pool->lock); - idx = -1; - for (i = 0; i < MAX_BUCKETS; i++) { - if (!pool.buckets[i]) { - idx = i; - pool.buckets[idx] = bucket; - break; - } + if (mem->curr_nr < nr_pages) { + spin_unlock(&pool->lock); + return -ENOMEM; } - if (idx == -1) { - spin_unlock_irqrestore(&pool.lock, flags); - goto fail; + while (request->nr_pages < nr_pages) { + page = mempool_alloc(mem, GFP_NOWAIT); + BUG_ON(!page); + request->pages[request->nr_pages++] = page; } - for (i = 0; i < BUCKET_SIZE; i++) { - handle = bucket->handles + i; - request = &handle->request; + spin_unlock(&pool->lock); - handle->slot = i; - handle->inuse = 0; - handle->bucket = bucket; + return 0; +} + +static void +blktap_request_put_pages(struct blktap *tap, + struct blktap_request *request) +{ + struct blktap_page_pool *pool = tap->pool; + struct page *page; - blktap_request_pool_init_request(request); - list_add_tail(&request->free_list, &pool.free_list); + while (request->nr_pages) { + page = request->pages[--request->nr_pages]; + mempool_free(page, pool->bufs); } +} - spin_unlock_irqrestore(&pool.lock, flags); +size_t +blktap_request_debug(struct blktap *tap, char *buf, size_t size) +{ + struct blktap_page_pool *pool = tap->pool; + mempool_t *mem = pool->bufs; + char *s = buf, *end = buf + size; - return 0; + s += snprintf(buf, end - s, + "pool:%s pages:%d free:%d\n", + kobject_name(&pool->kobj), + mem->min_nr, mem->curr_nr); -fail: - if (bucket && bucket->foreign_pages) - free_empty_pages_and_pagevec(bucket->foreign_pages, MMAP_PAGES); - kfree(bucket); - return -ENOMEM; + return s - buf; } -static void -blktap_request_pool_free_bucket(struct blktap_request_bucket *bucket) +struct blktap_request* +blktap_request_alloc(struct blktap *tap) { - if (!bucket) - return; + struct blktap_request *request; - BTDBG("freeing bucket %p\n", bucket); + request = mempool_alloc(request_pool, GFP_NOWAIT); + if (request) + request->tap = tap; - free_empty_pages_and_pagevec(bucket->foreign_pages, MMAP_PAGES); - kfree(bucket); + return request; } -struct page * -request_to_page(struct blktap_request *req, int seg) +void +blktap_request_free(struct blktap *tap, + struct blktap_request *request) { - struct blktap_request_handle *handle = blktap_request_to_handle(req); - int idx = handle->slot * BLKIF_MAX_SEGMENTS_PER_REQUEST + seg; - return handle->bucket->foreign_pages[idx]; + blktap_request_put_pages(tap, request); + + mempool_free(request, request_pool); + + __page_pool_wake(tap->pool); } -int -blktap_request_pool_shrink(void) +static void +blktap_request_ctor(void *obj) +{ + struct blktap_request *request = obj; + + memset(request, 0, sizeof(*request)); + sg_init_table(request->sg_table, ARRAY_SIZE(request->sg_table)); +} + +static int +blktap_page_pool_resize(struct blktap_page_pool *pool, int target) { - int i, err; - unsigned long flags; - struct blktap_request_bucket *bucket; + mempool_t *bufs = pool->bufs; + int err; + + /* NB. mempool asserts min_nr >= 1 */ + target = max(1, target); + + err = mempool_resize(bufs, target, GFP_KERNEL); + if (err) + return err; - err = -EAGAIN; + __page_pool_wake(pool); - spin_lock_irqsave(&pool.lock, flags); + return 0; +} - /* always keep at least one bucket */ - for (i = 1; i < MAX_BUCKETS; i++) { - bucket = pool.buckets[i]; - if (!bucket) - continue; +struct pool_attribute { + struct attribute attr; - if (atomic_read(&bucket->reqs_in_use)) - continue; + ssize_t (*show)(struct blktap_page_pool *pool, + char *buf); - blktap_request_pool_free_bucket(bucket); - pool.buckets[i] = NULL; - err = 0; - break; - } + ssize_t (*store)(struct blktap_page_pool *pool, + const char *buf, size_t count); +}; - spin_unlock_irqrestore(&pool.lock, flags); +#define kattr_to_pool_attr(_kattr) \ + container_of(_kattr, struct pool_attribute, attr) - return err; +static ssize_t +blktap_page_pool_show_size(struct blktap_page_pool *pool, + char *buf) +{ + mempool_t *mem = pool->bufs; + return sprintf(buf, "%d", mem->min_nr); } -int -blktap_request_pool_grow(void) +static ssize_t +blktap_page_pool_store_size(struct blktap_page_pool *pool, + const char *buf, size_t size) { - return blktap_request_pool_allocate_bucket(); + int target; + + /* + * NB. target fixup to avoid undesired results. less than a + * full segment set can wedge the disk. much more than a + * couple times the physical queue depth is rarely useful. + */ + + target = simple_strtoul(buf, NULL, 0); + target = max(POOL_MAX_REQUEST_PAGES, target); + target = min(target, POOL_MAX_PAGES); + + return blktap_page_pool_resize(pool, target) ? : size; } -struct blktap_request * -blktap_request_allocate(struct blktap *tap) +static struct pool_attribute blktap_page_pool_attr_size + __ATTR(size, S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH, + blktap_page_pool_show_size, + blktap_page_pool_store_size); + +static ssize_t +blktap_page_pool_show_free(struct blktap_page_pool *pool, + char *buf) { - int i; - uint16_t usr_idx; - unsigned long flags; - struct blktap_request *request; + mempool_t *mem = pool->bufs; + return sprintf(buf, "%d", mem->curr_nr); +} - usr_idx = -1; - request = NULL; +static struct pool_attribute blktap_page_pool_attr_free + __ATTR(free, S_IRUSR|S_IRGRP|S_IROTH, + blktap_page_pool_show_free, + NULL); - spin_lock_irqsave(&pool.lock, flags); +static struct attribute *blktap_page_pool_attrs[] = { + &blktap_page_pool_attr_size.attr, + &blktap_page_pool_attr_free.attr, + NULL, +}; - if (pool.status == BLKTAP_POOL_CLOSING) - goto out; +static inline struct kobject* +__blktap_kset_find_obj(struct kset *kset, const char *name) +{ + struct kobject *k; + struct kobject *ret = NULL; - for (i = 0; i < ARRAY_SIZE(tap->pending_requests); i++) - if (!tap->pending_requests[i]) { - usr_idx = i; + spin_lock(&kset->list_lock); + list_for_each_entry(k, &kset->list, entry) { + if (kobject_name(k) && !strcmp(kobject_name(k), name)) { + ret = kobject_get(k); break; } - - if (usr_idx == (uint16_t)-1) - goto out; - - if (!list_empty(&pool.free_list)) { - request = list_entry(pool.free_list.next, - struct blktap_request, free_list); - list_del(&request->free_list); } + spin_unlock(&kset->list_lock); + return ret; +} - if (request) { - struct blktap_request_handle *handle; +static ssize_t +blktap_page_pool_show_attr(struct kobject *kobj, struct attribute *kattr, + char *buf) +{ + struct blktap_page_pool *pool = kobj_to_pool(kobj); + struct pool_attribute *attr = kattr_to_pool_attr(kattr); - atomic_inc(&pool.reqs_in_use); + if (attr->show) + return attr->show(pool, buf); - handle = blktap_request_to_handle(request); - atomic_inc(&handle->bucket->reqs_in_use); - handle->inuse = 1; + return -EIO; +} - request->usr_idx = usr_idx; +static ssize_t +blktap_page_pool_store_attr(struct kobject *kobj, struct attribute *kattr, + const char *buf, size_t size) +{ + struct blktap_page_pool *pool = kobj_to_pool(kobj); + struct pool_attribute *attr = kattr_to_pool_attr(kattr); - tap->pending_requests[usr_idx] = request; - tap->pending_cnt++; - } + if (attr->show) + return attr->store(pool, buf, size); -out: - spin_unlock_irqrestore(&pool.lock, flags); - return request; + return -EIO; } -void -blktap_request_free(struct blktap *tap, struct blktap_request *request) +static struct sysfs_ops blktap_page_pool_sysfs_ops = { + .show = blktap_page_pool_show_attr, + .store = blktap_page_pool_store_attr, +}; + +static void +blktap_page_pool_release(struct kobject *kobj) { - int free; - unsigned long flags; - struct blktap_request_handle *handle; + struct blktap_page_pool *pool = kobj_to_pool(kobj); + mempool_destroy(pool->bufs); + kfree(pool); +} - BUG_ON(request->usr_idx >= ARRAY_SIZE(tap->pending_requests)); - handle = blktap_request_to_handle(request); +struct kobj_type blktap_page_pool_ktype = { + .release = blktap_page_pool_release, + .sysfs_ops = &blktap_page_pool_sysfs_ops, + .default_attrs = blktap_page_pool_attrs, +}; + +static void* +__mempool_page_alloc(gfp_t gfp_mask, void *pool_data) +{ + struct page *page; - spin_lock_irqsave(&pool.lock, flags); + if (!(gfp_mask & __GFP_WAIT)) + return NULL; - handle->inuse = 0; - tap->pending_requests[request->usr_idx] = NULL; - blktap_request_pool_init_request(request); - list_add(&request->free_list, &pool.free_list); - atomic_dec(&handle->bucket->reqs_in_use); - free = atomic_dec_and_test(&pool.reqs_in_use); - tap->pending_cnt--; + page = alloc_page(gfp_mask); + if (page) + SetPageReserved(page); - spin_unlock_irqrestore(&pool.lock, flags); + return page; +} - if (free) - wake_up(&pool.wait_queue); +static void +__mempool_page_free(void *element, void *pool_data) +{ + struct page *page = element; - blktap_ring_kick_all(); + ClearPageReserved(page); + put_page(page); } -void -blktap_request_pool_free(void) +static struct kobject* +blktap_page_pool_create(const char *name, int nr_pages) { - int i; - unsigned long flags; + struct blktap_page_pool *pool; + int err; - spin_lock_irqsave(&pool.lock, flags); + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) + goto fail; - pool.status = BLKTAP_POOL_CLOSING; - while (atomic_read(&pool.reqs_in_use)) { - spin_unlock_irqrestore(&pool.lock, flags); - wait_event(pool.wait_queue, !atomic_read(&pool.reqs_in_use)); - spin_lock_irqsave(&pool.lock, flags); - } + spin_lock_init(&pool->lock); + init_waitqueue_head(&pool->wait); - for (i = 0; i < MAX_BUCKETS; i++) { - blktap_request_pool_free_bucket(pool.buckets[i]); - pool.buckets[i] = NULL; - } + pool->bufs = mempool_create(nr_pages, + __mempool_page_alloc, __mempool_page_free, + pool); + if (!pool->bufs) + goto fail_pool; + + kobject_init(&pool->kobj, &blktap_page_pool_ktype); + pool->kobj.kset = pool_set; + err = kobject_add(&pool->kobj, &pool_set->kobj, "%s", name); + if (err) + goto fail_bufs; + + return &pool->kobj; - spin_unlock_irqrestore(&pool.lock, flags); + kobject_del(&pool->kobj); +fail_bufs: + mempool_destroy(pool->bufs); +fail_pool: + kfree(pool); +fail: + return NULL; } -int __init -blktap_request_pool_init(void) +struct blktap_page_pool* +blktap_page_pool_get(const char *name) { - int i, err; + struct kobject *kobj; + + kobj = __blktap_kset_find_obj(pool_set, name); + if (!kobj) + kobj = blktap_page_pool_create(name, + POOL_DEFAULT_PAGES); + if (!kobj) + return ERR_PTR(-ENOMEM); - memset(&pool, 0, sizeof(pool)); + return kobj_to_pool(kobj); +} + +int __init +blktap_page_pool_init(struct kobject *parent) +{ + request_cache + kmem_cache_create("blktap-request", + sizeof(struct blktap_request), 0, + 0, blktap_request_ctor); + if (!request_cache) + return -ENOMEM; + + request_pool + mempool_create_slab_pool(POOL_MIN_REQS, request_cache); + if (!request_pool) + return -ENOMEM; + + pool_set = kset_create_and_add("pools", NULL, parent); + if (!pool_set) + return -ENOMEM; - spin_lock_init(&pool.lock); - INIT_LIST_HEAD(&pool.free_list); - atomic_set(&pool.reqs_in_use, 0); - init_waitqueue_head(&pool.wait_queue); + return 0; +} - for (i = 0; i < 2; i++) { - err = blktap_request_pool_allocate_bucket(); - if (err) - goto fail; +void +blktap_page_pool_exit(void) +{ + if (pool_set) { + BUG_ON(!list_empty(&pool_set->list)); + kset_unregister(pool_set); + pool_set = NULL; } - return 0; + if (request_pool) { + mempool_destroy(request_pool); + request_pool = NULL; + } -fail: - blktap_request_pool_free(); - return err; + if (request_cache) { + kmem_cache_destroy(request_cache); + request_cache = NULL; + } } diff --git a/drivers/xen/blktap/ring.c b/drivers/xen/blktap/ring.c index 057e97f..a72a1b3 100644 --- a/drivers/xen/blktap/ring.c +++ b/drivers/xen/blktap/ring.c @@ -17,8 +17,6 @@ int blktap_ring_major; static struct cdev blktap_ring_cdev; -static DECLARE_WAIT_QUEUE_HEAD(blktap_poll_wait); - static inline struct blktap * vma_to_blktap(struct vm_area_struct *vma) { @@ -409,7 +407,7 @@ static unsigned int blktap_ring_poll(struct file *filp, poll_table *wait) struct blktap_ring *ring = &tap->ring; int work = 0; - poll_wait(filp, &blktap_poll_wait, wait); + poll_wait(filp, &tap->pool->wait, wait); poll_wait(filp, &ring->poll_wait, wait); down_read(¤t->mm->mmap_sem); @@ -440,12 +438,6 @@ blktap_ring_kick_user(struct blktap *tap) wake_up(&tap->ring.poll_wait); } -void -blktap_ring_kick_all(void) -{ - wake_up(&blktap_poll_wait); -} - int blktap_ring_destroy(struct blktap *tap) { diff --git a/drivers/xen/blktap/sysfs.c b/drivers/xen/blktap/sysfs.c index e573549..7bbfea8 100644 --- a/drivers/xen/blktap/sysfs.c +++ b/drivers/xen/blktap/sysfs.c @@ -104,6 +104,8 @@ blktap_sysfs_debug_device(struct device *dev, struct device_attribute *attr, cha s += blktap_control_debug(tap, s, end - s); + s += blktap_request_debug(tap, s, end - s); + s += blktap_device_debug(tap, s, end - s); s += blktap_ring_debug(tap, s, end - s); @@ -129,6 +131,38 @@ blktap_sysfs_show_task(struct device *dev, struct device_attribute *attr, char * } static DEVICE_ATTR(task, S_IRUGO, blktap_sysfs_show_task, NULL); +static ssize_t +blktap_sysfs_show_pool(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct blktap *tap = dev_get_drvdata(dev); + return sprintf(buf, "%s", kobject_name(&tap->pool->kobj)); +} + +static ssize_t +blktap_sysfs_store_pool(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t size) +{ + struct blktap *tap = dev_get_drvdata(dev); + struct blktap_page_pool *pool, *tmp = tap->pool; + + if (tap->device.gd) + return -EBUSY; + + pool = blktap_page_pool_get(buf); + if (IS_ERR(pool)) + return PTR_ERR(pool); + + tap->pool = pool; + kobject_put(&tmp->kobj); + + return size; +} +DEVICE_ATTR(pool, S_IRUSR|S_IWUSR, + blktap_sysfs_show_pool, blktap_sysfs_store_pool); + int blktap_sysfs_create(struct blktap *tap) { @@ -151,6 +185,8 @@ blktap_sysfs_create(struct blktap *tap) if (!err) err = device_create_file(dev, &dev_attr_task); if (!err) + err = device_create_file(dev, &dev_attr_pool); + if (!err) ring->dev = dev; else device_unregister(dev); -- 1.7.0.4 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-12 23:31 UTC
[Xen-devel] [PATCH 2/5] blktap: Make VMAs non-foreign and bounce buffered.
Drop zero-copy I/O. Removes all grantmap mechanism from blktap, bouncing I/O from/to a pool of dom0 memory and request SGs. Essentially renders blktap without any residual dependencies on Xen whatsoever. Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> --- drivers/xen/blktap/blktap.h | 43 ++-- drivers/xen/blktap/control.c | 8 +- drivers/xen/blktap/device.c | 564 ++++++------------------------------------ drivers/xen/blktap/request.c | 20 ++- drivers/xen/blktap/ring.c | 319 +++++++++++++------------ 5 files changed, 285 insertions(+), 669 deletions(-) diff --git a/drivers/xen/blktap/blktap.h b/drivers/xen/blktap/blktap.h index ad79c15..fe63fc9 100644 --- a/drivers/xen/blktap/blktap.h +++ b/drivers/xen/blktap/blktap.h @@ -7,7 +7,6 @@ #include <linux/init.h> #include <linux/scatterlist.h> #include <xen/blkif.h> -#include <xen/grant_table.h> extern int blktap_debug_level; extern int blktap_ring_major; @@ -27,7 +26,6 @@ extern int blktap_device_major; #define MAX_BLKTAP_DEVICE 1024 -#define BLKTAP_CONTROL 1 #define BLKTAP_DEVICE 4 #define BLKTAP_DEVICE_CLOSED 5 #define BLKTAP_SHUTDOWN_REQUESTED 8 @@ -94,11 +92,13 @@ struct blktap_ring { struct task_struct *task; struct vm_area_struct *vma; - struct blkif_front_ring ring; - struct vm_foreign_map foreign_map; + struct blkif_front_ring ring; unsigned long ring_vstart; unsigned long user_vstart; + int n_pending; + struct blktap_request *pending[MAX_PENDING_REQS]; + wait_queue_head_t poll_wait; dev_t devno; @@ -123,19 +123,21 @@ struct blktap_statistics { struct blktap_request { struct blktap *tap; struct request *rq; - uint16_t usr_idx; - - uint8_t status; - atomic_t pendcnt; - unsigned short operation; + int usr_idx; + int operation; struct timeval time; - struct grant_handle_pair handles[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + struct scatterlist sg_table[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; int nr_pages; }; +#define blktap_for_each_sg(_sg, _req, _i) \ + for (_sg = (_req)->sg_table, _i = 0; \ + _i < (_req)->nr_pages; \ + (_sg)++, (_i)++) + struct blktap { int minor; unsigned long dev_inuse; @@ -144,10 +146,6 @@ struct blktap { struct blktap_device device; struct blktap_page_pool *pool; - int pending_cnt; - struct blktap_request *pending_requests[MAX_PENDING_REQS]; - struct scatterlist sg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; - wait_queue_head_t remove_wait; struct work_struct remove_work; char name[BLKTAP2_MAX_MESSAGE_LEN]; @@ -174,6 +172,13 @@ void blktap_ring_exit(void); size_t blktap_ring_debug(struct blktap *, char *, size_t); int blktap_ring_create(struct blktap *); int blktap_ring_destroy(struct blktap *); +struct blktap_request *blktap_ring_make_request(struct blktap *); +void blktap_ring_free_request(struct blktap *,struct blktap_request *); +void blktap_ring_submit_request(struct blktap *, struct blktap_request *); +int blktap_ring_map_request_segment(struct blktap *, struct blktap_request *, int); +int blktap_ring_map_request(struct blktap *, struct blktap_request *); +void blktap_ring_unmap_request(struct blktap *, struct blktap_request *); +void blktap_ring_set_message(struct blktap *, int); void blktap_ring_kick_user(struct blktap *); int blktap_sysfs_init(void); @@ -187,7 +192,7 @@ size_t blktap_device_debug(struct blktap *, char *, size_t); int blktap_device_create(struct blktap *, struct blktap_params *); int blktap_device_destroy(struct blktap *); void blktap_device_destroy_sync(struct blktap *); -int blktap_device_run_queue(struct blktap *); +void blktap_device_run_queue(struct blktap *); void blktap_device_end_request(struct blktap *, struct blktap_request *, int); int blktap_page_pool_init(struct kobject *); @@ -200,13 +205,5 @@ int blktap_request_get_pages(struct blktap *, struct blktap_request *, int); void blktap_request_free(struct blktap *, struct blktap_request *); void blktap_request_bounce(struct blktap *, struct blktap_request *, int, int); -static inline unsigned long -request_to_kaddr(struct blktap_request *req, int seg) -{ - return (unsigned long)page_address(req->pages[seg]); -} - -#define request_to_page(_request, _seg) ((_request)->pages[_seg]) - #endif diff --git a/drivers/xen/blktap/control.c b/drivers/xen/blktap/control.c index 8652e07..f339bba 100644 --- a/drivers/xen/blktap/control.c +++ b/drivers/xen/blktap/control.c @@ -18,13 +18,10 @@ blktap_control_get_minor(void) int minor; struct blktap *tap; - tap = kmalloc(sizeof(*tap), GFP_KERNEL); + tap = kzalloc(sizeof(*tap), GFP_KERNEL); if (unlikely(!tap)) return NULL; - memset(tap, 0, sizeof(*tap)); - sg_init_table(tap->sg, BLKIF_MAX_SEGMENTS_PER_REQUEST); - mutex_lock(&blktap_lock); for (minor = 0; minor < blktap_max_minor; minor++) @@ -290,9 +287,6 @@ blktap_init(void) { int err; - if (!xen_pv_domain()) - return -ENODEV; - err = blktap_device_init(); if (err) goto fail; diff --git a/drivers/xen/blktap/device.c b/drivers/xen/blktap/device.c index ed95548..02e1fc8 100644 --- a/drivers/xen/blktap/device.c +++ b/drivers/xen/blktap/device.c @@ -2,27 +2,11 @@ #include <linux/blkdev.h> #include <linux/cdrom.h> #include <linux/hdreg.h> -#include <linux/module.h> -#include <asm/tlbflush.h> - #include <scsi/scsi.h> #include <scsi/scsi_ioctl.h> -#include <xen/xenbus.h> -#include <xen/interface/io/blkif.h> - -#include <asm/xen/page.h> -#include <asm/xen/hypercall.h> - #include "blktap.h" -#include "../blkback/blkback-pagemap.h" - -struct blktap_grant_table { - int cnt; - struct gnttab_map_grant_ref grants[BLKIF_MAX_SEGMENTS_PER_REQUEST * 2]; -}; - int blktap_device_major; #define dev_to_blktap(_dev) container_of(_dev, struct blktap, device) @@ -119,526 +103,136 @@ static struct block_device_operations blktap_device_file_operations = { .getgeo = blktap_device_getgeo }; -static int -blktap_map_uaddr_fn(pte_t *ptep, struct page *pmd_page, - unsigned long addr, void *data) -{ - pte_t *pte = (pte_t *)data; - - BTDBG("ptep %p -> %012llx\n", ptep, pte_val(*pte)); - set_pte(ptep, *pte); - return 0; -} - -static int -blktap_map_uaddr(struct mm_struct *mm, unsigned long address, pte_t pte) -{ - return apply_to_page_range(mm, address, - PAGE_SIZE, blktap_map_uaddr_fn, &pte); -} - -static int -blktap_umap_uaddr_fn(pte_t *ptep, struct page *pmd_page, - unsigned long addr, void *data) -{ - struct mm_struct *mm = (struct mm_struct *)data; - - BTDBG("ptep %p\n", ptep); - pte_clear(mm, addr, ptep); - return 0; -} - -static int -blktap_umap_uaddr(struct mm_struct *mm, unsigned long address) -{ - return apply_to_page_range(mm, address, - PAGE_SIZE, blktap_umap_uaddr_fn, mm); -} - -static inline void -flush_tlb_kernel_page(unsigned long kvaddr) -{ - flush_tlb_kernel_range(kvaddr, kvaddr + PAGE_SIZE); -} - -static void -blktap_device_end_dequeued_request(struct blktap_device *dev, - struct request *req, int error) -{ - unsigned long flags; - int ret; - - //spin_lock_irq(&dev->lock); - spin_lock_irqsave(dev->gd->queue->queue_lock, flags); - ret = __blk_end_request(req, error, blk_rq_bytes(req)); - spin_unlock_irqrestore(dev->gd->queue->queue_lock, flags); - //spin_unlock_irq(&dev->lock); - - BUG_ON(ret); -} - -static void -blktap_device_fast_flush(struct blktap *tap, struct blktap_request *request) -{ - uint64_t ptep; - int ret, usr_idx; - unsigned int i, cnt; - struct page **map, *page; - struct blktap_ring *ring; - struct grant_handle_pair *khandle; - unsigned long kvaddr, uvaddr, offset; - struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST * 2]; - - cnt = 0; - ring = &tap->ring; - usr_idx = request->usr_idx; - map = ring->foreign_map.map; - - if (!ring->vma) - return; - - if (xen_feature(XENFEAT_auto_translated_physmap)) - zap_page_range(ring->vma, - MMAP_VADDR(ring->user_vstart, usr_idx, 0), - request->nr_pages << PAGE_SHIFT, NULL); - - for (i = 0; i < request->nr_pages; i++) { - kvaddr = request_to_kaddr(request, i); - uvaddr = MMAP_VADDR(ring->user_vstart, usr_idx, i); - - khandle = request->handles + i; - - if (khandle->kernel != INVALID_GRANT_HANDLE) { - gnttab_set_unmap_op(&unmap[cnt], kvaddr, - GNTMAP_host_map, khandle->kernel); - cnt++; - set_phys_to_machine(__pa(kvaddr) >> PAGE_SHIFT, - INVALID_P2M_ENTRY); - } - - if (khandle->user != INVALID_GRANT_HANDLE) { - BUG_ON(xen_feature(XENFEAT_auto_translated_physmap)); - if (create_lookup_pte_addr(ring->vma->vm_mm, - uvaddr, &ptep) != 0) { - BTERR("Couldn''t get a pte addr!\n"); - return; - } - - gnttab_set_unmap_op(&unmap[cnt], ptep, - GNTMAP_host_map - | GNTMAP_application_map - | GNTMAP_contains_pte, - khandle->user); - cnt++; - } - - offset = (uvaddr - ring->vma->vm_start) >> PAGE_SHIFT; - - BTDBG("offset: 0x%08lx, page: %p, request: %p, usr_idx: %d, " - "seg: %d, kvaddr: 0x%08lx, khandle: %u, uvaddr: " - "0x%08lx, handle: %u\n", offset, map[offset], request, - usr_idx, i, kvaddr, khandle->kernel, uvaddr, - khandle->user); - - page = map[offset]; - if (page && blkback_pagemap_contains_page(page)) - set_page_private(page, 0); - - map[offset] = NULL; - - khandle->kernel = INVALID_GRANT_HANDLE; - khandle->user = INVALID_GRANT_HANDLE; - } - - if (cnt) { - ret = HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, - unmap, cnt); - BUG_ON(ret); - } - - if (!xen_feature(XENFEAT_auto_translated_physmap)) - zap_page_range(ring->vma, - MMAP_VADDR(ring->user_vstart, usr_idx, 0), - request->nr_pages << PAGE_SHIFT, NULL); -} - -static void -blktap_unmap(struct blktap *tap, struct blktap_request *request) -{ - int i, usr_idx; - unsigned long kvaddr; - - usr_idx = request->usr_idx; - - for (i = 0; i < request->nr_pages; i++) { - kvaddr = request_to_kaddr(request, i); - BTDBG("request: %p, seg: %d, kvaddr: 0x%08lx, khandle: %u, " - "uvaddr: 0x%08lx, uhandle: %u\n", request, i, - kvaddr, request->handles[i].kernel, - MMAP_VADDR(tap->ring.user_vstart, usr_idx, i), - request->handles[i].user); - - if (request->handles[i].kernel == INVALID_GRANT_HANDLE) { - blktap_umap_uaddr(current->mm, kvaddr); - flush_tlb_kernel_page(kvaddr); - set_phys_to_machine(__pa(kvaddr) >> PAGE_SHIFT, - INVALID_P2M_ENTRY); - } - } - - blktap_device_fast_flush(tap, request); -} - void blktap_device_end_request(struct blktap *tap, struct blktap_request *request, int error) { struct blktap_device *tapdev = &tap->device; + struct request *rq = request->rq; + + blktap_ring_unmap_request(tap, request); + + blktap_ring_free_request(tap, request); - blktap_unmap(tap, request); + dev_dbg(&tapdev->gd->dev, + "end_request: op=%d error=%d bytes=%d\n", + rq_data_dir(rq), error, blk_rq_bytes(rq)); spin_lock_irq(&tapdev->lock); end_request(request->rq, !error); spin_unlock_irq(&tapdev->lock); - - blktap_request_free(tap, request); } -static int -blktap_prep_foreign(struct blktap *tap, - struct blktap_request *request, - struct blkif_request *blkif_req, - unsigned int seg, struct page *page, - struct blktap_grant_table *table) -{ - uint64_t ptep; - uint32_t flags; -#ifdef BLKTAP_CHAINED_BLKTAP - struct page *tap_page; -#endif - struct blktap_ring *ring; - struct blkback_pagemap map; - unsigned long uvaddr, kvaddr; - - ring = &tap->ring; - map = blkback_pagemap_read(page); - blkif_req->seg[seg].gref = map.gref; - - uvaddr = MMAP_VADDR(ring->user_vstart, request->usr_idx, seg); - kvaddr = request_to_kaddr(request, seg); - flags = GNTMAP_host_map | - (request->operation == BLKIF_OP_WRITE ? GNTMAP_readonly : 0); - - gnttab_set_map_op(&table->grants[table->cnt], - kvaddr, flags, map.gref, map.domid); - table->cnt++; - - -#ifdef BLKTAP_CHAINED_BLKTAP - /* enable chained tap devices */ - tap_page = request_to_page(request, seg); - set_page_private(tap_page, page_private(page)); - SetPageBlkback(tap_page); -#endif - - if (xen_feature(XENFEAT_auto_translated_physmap)) - return 0; - - if (create_lookup_pte_addr(ring->vma->vm_mm, uvaddr, &ptep)) { - BTERR("couldn''t get a pte addr!\n"); - return -1; - } - - flags |= GNTMAP_application_map | GNTMAP_contains_pte; - gnttab_set_map_op(&table->grants[table->cnt], - ptep, flags, map.gref, map.domid); - table->cnt++; - - return 0; -} - -static int -blktap_map_foreign(struct blktap *tap, - struct blktap_request *request, - struct blkif_request *blkif_req, - struct blktap_grant_table *table) +int +blktap_device_make_request(struct blktap *tap, struct request *rq) { - struct page *page; - int i, grant, err, usr_idx; - struct blktap_ring *ring; - unsigned long uvaddr, foreign_mfn; - - if (!table->cnt) - return 0; + struct blktap_device *tapdev = &tap->device; + struct blktap_request *request; + int write, nsegs; + int err; - err = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, - table->grants, table->cnt); - BUG_ON(err); + request = blktap_ring_make_request(tap); + if (IS_ERR(request)) { + err = PTR_ERR(request); + request = NULL; - grant = 0; - usr_idx = request->usr_idx; - ring = &tap->ring; + if (err == -ENOSPC || err == -ENOMEM) + goto stop; - for (i = 0; i < request->nr_pages; i++) { - if (!blkif_req->seg[i].gref) - continue; + goto fail; + } - uvaddr = MMAP_VADDR(ring->user_vstart, usr_idx, i); + write = rq_data_dir(rq) == WRITE; + nsegs = blk_rq_map_sg(rq->q, rq, request->sg_table); - if (unlikely(table->grants[grant].status)) { - BTERR("invalid kernel buffer: could not remap it\n"); - err |= 1; - table->grants[grant].handle = INVALID_GRANT_HANDLE; - } + dev_dbg(&tapdev->gd->dev, + "make_request: op=%c bytes=%d nsegs=%d\n", + write ? ''w'' : ''r'', blk_rq_bytes(rq), nsegs); - request->handles[i].kernel = table->grants[grant].handle; - foreign_mfn = table->grants[grant].dev_bus_addr >> PAGE_SHIFT; - grant++; + request->rq = rq; + request->operation = write ? BLKIF_OP_WRITE : BLKIF_OP_READ; - if (xen_feature(XENFEAT_auto_translated_physmap)) - goto done; - - if (unlikely(table->grants[grant].status)) { - BTERR("invalid user buffer: could not remap it\n"); - err |= 1; - table->grants[grant].handle = INVALID_GRANT_HANDLE; - } + err = blktap_request_get_pages(tap, request, nsegs); + if (err) + goto stop; - request->handles[i].user = table->grants[grant].handle; - grant++; + err = blktap_ring_map_request(tap, request); + if (err) + goto fail; - done: - if (err) - continue; + blktap_ring_submit_request(tap, request); - page = request_to_page(request, i); + return 0; - if (!xen_feature(XENFEAT_auto_translated_physmap)) - set_phys_to_machine(page_to_pfn(page), - FOREIGN_FRAME(foreign_mfn)); - else if (vm_insert_page(ring->vma, uvaddr, page)) - err |= 1; +stop: + tap->stats.st_oo_req++; + err = -EBUSY; - BTDBG("pending_req: %p, seg: %d, page: %p, " - "kvaddr: 0x%p, khandle: %u, uvaddr: 0x%08lx, " - "uhandle: %u\n", request, i, page, - pfn_to_kaddr(page_to_pfn(page)), - request->handles[i].kernel, - uvaddr, request->handles[i].user); - } +_out: + if (request) + blktap_ring_free_request(tap, request); return err; -} - -static void -blktap_map(struct blktap *tap, - struct blktap_request *request, - unsigned int seg, struct page *page) -{ - pte_t pte; - int usr_idx; - struct blktap_ring *ring; - unsigned long uvaddr, kvaddr; - - ring = &tap->ring; - usr_idx = request->usr_idx; - uvaddr = MMAP_VADDR(ring->user_vstart, usr_idx, seg); - kvaddr = request_to_kaddr(request, seg); - - pte = mk_pte(page, ring->vma->vm_page_prot); - blktap_map_uaddr(ring->vma->vm_mm, uvaddr, pte_mkwrite(pte)); - flush_tlb_page(ring->vma, uvaddr); - blktap_map_uaddr(ring->vma->vm_mm, kvaddr, mk_pte(page, PAGE_KERNEL)); - flush_tlb_kernel_page(kvaddr); - - set_phys_to_machine(__pa(kvaddr) >> PAGE_SHIFT, pte_mfn(pte)); - request->handles[seg].kernel = INVALID_GRANT_HANDLE; - request->handles[seg].user = INVALID_GRANT_HANDLE; - - BTDBG("pending_req: %p, seg: %d, page: %p, kvaddr: 0x%08lx, " - "uvaddr: 0x%08lx\n", request, seg, page, kvaddr, - uvaddr); -} - -static int -blktap_device_process_request(struct blktap *tap, - struct blktap_request *request, - struct request *req) -{ - struct page *page; - int i, usr_idx, err; - struct blktap_ring *ring; - struct scatterlist *sg; - struct blktap_grant_table table; - unsigned int fsect, lsect, nr_sects; - unsigned long offset, uvaddr; - struct blkif_request blkif_req, *target; - - err = -1; - memset(&table, 0, sizeof(table)); - - ring = &tap->ring; - usr_idx = request->usr_idx; - blkif_req.id = usr_idx; - blkif_req.sector_number = (blkif_sector_t)req->sector; - blkif_req.handle = 0; - blkif_req.operation = rq_data_dir(req) ? - BLKIF_OP_WRITE : BLKIF_OP_READ; - - request->rq = req; - request->operation = blkif_req.operation; - request->status = BLKTAP_REQUEST_PENDING; - do_gettimeofday(&request->time); - - nr_sects = 0; - request->nr_pages = 0; - blkif_req.nr_segments = blk_rq_map_sg(req->q, req, tap->sg); - BUG_ON(blkif_req.nr_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST); - for (i = 0; i < blkif_req.nr_segments; ++i) { - sg = tap->sg + i; - fsect = sg->offset >> 9; - lsect = fsect + (sg->length >> 9) - 1; - nr_sects += sg->length >> 9; - - blkif_req.seg[i] - (struct blkif_request_segment) { - .gref = 0, - .first_sect = fsect, - .last_sect = lsect }; - - if (blkback_pagemap_contains_page(sg_page(sg))) { - /* foreign page -- use xen */ - if (blktap_prep_foreign(tap, - request, - &blkif_req, - i, - sg_page(sg), - &table)) - goto out; - } else { - /* do it the old fashioned way */ - blktap_map(tap, - request, - i, - sg_page(sg)); - } - - uvaddr = MMAP_VADDR(ring->user_vstart, usr_idx, i); - offset = (uvaddr - ring->vma->vm_start) >> PAGE_SHIFT; - page = request_to_page(request, i); - ring->foreign_map.map[offset] = page; - SetPageReserved(page); - - BTDBG("mapped uaddr %08lx to page %p pfn 0x%lx\n", - uvaddr, page, page_to_pfn(page)); - BTDBG("offset: 0x%08lx, pending_req: %p, seg: %d, " - "page: %p, kvaddr: %p, uvaddr: 0x%08lx\n", - offset, request, i, - page, pfn_to_kaddr(page_to_pfn(page)), uvaddr); - - request->nr_pages++; - } - - if (blktap_map_foreign(tap, request, &blkif_req, &table)) - goto out; - - /* Finally, write the request message to the user ring. */ - target = RING_GET_REQUEST(&ring->ring, ring->ring.req_prod_pvt); - memcpy(target, &blkif_req, sizeof(blkif_req)); - target->id = request->usr_idx; - wmb(); /* blktap_poll() reads req_prod_pvt asynchronously */ - ring->ring.req_prod_pvt++; - - if (rq_data_dir(req)) { - tap->stats.st_wr_sect += nr_sects; - tap->stats.st_wr_req++; - } else { - tap->stats.st_rd_sect += nr_sects; - tap->stats.st_rd_req++; - } - - err = 0; - -out: - if (err) - blktap_device_fast_flush(tap, request); - return err; +fail: + if (printk_ratelimit()) + dev_warn(&tapdev->gd->dev, + "make request: %d, failing\n", err); + goto _out; } /* * called from tapdisk context */ -int +void blktap_device_run_queue(struct blktap *tap) { - int err, rv; - struct request_queue *rq; - struct request *req; - struct blktap_ring *ring; - struct blktap_device *dev; - struct blktap_request *request; - - ring = &tap->ring; - dev = &tap->device; - rq = dev->gd->queue; + struct blktap_device *tapdev = &tap->device; + struct request_queue *q; + struct request *rq; + int err; - BTDBG("running queue for %d\n", tap->minor); - spin_lock_irq(&dev->lock); - queue_flag_clear(QUEUE_FLAG_STOPPED, rq); + if (!tapdev->gd) + return; - while ((req = elv_next_request(rq)) != NULL) { - if (!blk_fs_request(req)) { - end_request(req, 0); - continue; - } + q = tapdev->gd->queue; - if (blk_empty_barrier_rq(req)) { - end_request(req, 1); - continue; - } + spin_lock_irq(&tapdev->lock); + queue_flag_clear(QUEUE_FLAG_STOPPED, q); - if (RING_FULL(&ring->ring)) { - wait: - /* Avoid pointless unplugs. */ - blk_stop_queue(rq); + do { + rq = elv_next_request(q); + if (!rq) break; + + if (!blk_fs_request(rq)) { + end_queued_request(rq, 0); + continue; } - request = blktap_request_alloc(tap); - if (!request) { - tap->stats.st_oo_req++; - goto wait; + if (blk_empty_barrier(rq)) { + end_queued_request(rq, 1); + continue; } - BTDBG("req %p: dev %d cmd %p, sec 0x%llx, (0x%x/0x%lx) " - "buffer:%p [%s], pending: %p\n", req, tap->minor, - req->cmd, (unsigned long long)req->sector, - req->current_nr_sectors, - req->nr_sectors, req->buffer, - rq_data_dir(req) ? "write" : "read", request); + spin_unlock_irq(&tapdev->lock); - blkdev_dequeue_request(req); + err = blktap_device_make_request(tap, rq); - spin_unlock_irq(&dev->lock); + spin_lock_irq(&tapdev->lock); - err = blktap_device_process_request(tap, request, req); - if (err) { - blktap_device_end_dequeued_request(dev, req, -EIO); - blktap_request_free(tap, request); + if (err == -EBUSY) { + blk_stop_queue(q); + break; } - spin_lock_irq(&dev->lock); - } - - spin_unlock_irq(&dev->lock); - - rv = ring->ring.req_prod_pvt - - ring->ring.sring->req_prod; + blkdev_dequeue_request(req); - RING_PUSH_REQUESTS(&ring->ring); + if (unlikely(err)) + end_request(rq, 0); + } while (1); - return rv; + spin_unlock_irq(&tapdev->lock); } static void diff --git a/drivers/xen/blktap/request.c b/drivers/xen/blktap/request.c index ca12442..9bef48c 100644 --- a/drivers/xen/blktap/request.c +++ b/drivers/xen/blktap/request.c @@ -3,7 +3,6 @@ #include <linux/mutex.h> #include <linux/sched.h> #include <linux/device.h> -#include <xen/balloon.h> #include "blktap.h" @@ -129,6 +128,25 @@ blktap_request_free(struct blktap *tap, __page_pool_wake(tap->pool); } +void +blktap_request_bounce(struct blktap *tap, + struct blktap_request *request, + int seg, int write) +{ + struct scatterlist *sg = &request->sg_table[seg]; + void *s, *p; + + BUG_ON(seg >= request->nr_pages); + + s = sg_virt(sg); + p = page_address(request->pages[seg]) + sg->offset; + + if (write) + memcpy(p, s, sg->length); + else + memcpy(s, p, sg->length); +} + static void blktap_request_ctor(void *obj) { diff --git a/drivers/xen/blktap/ring.c b/drivers/xen/blktap/ring.c index a72a1b3..38896e7 100644 --- a/drivers/xen/blktap/ring.c +++ b/drivers/xen/blktap/ring.c @@ -1,30 +1,15 @@ + #include <linux/device.h> #include <linux/signal.h> #include <linux/sched.h> #include <linux/poll.h> - -#include <asm/xen/page.h> -#include <asm/xen/hypercall.h> +#include <linux/blkdev.h> #include "blktap.h" -#ifdef CONFIG_XEN_BLKDEV_BACKEND -#include "../blkback/blkback-pagemap.h" -#else -#define blkback_pagemap_contains_page(page) 0 -#endif - int blktap_ring_major; static struct cdev blktap_ring_cdev; -static inline struct blktap * -vma_to_blktap(struct vm_area_struct *vma) -{ - struct vm_foreign_map *m = vma->vm_private_data; - struct blktap_ring *r = container_of(m, struct blktap_ring, foreign_map); - return container_of(r, struct blktap, ring); -} - /* * BLKTAP - immediately before the mmap area, * we have a bunch of pages reserved for shared memory rings. @@ -47,7 +32,7 @@ blktap_ring_read_response(struct blktap *tap, goto invalid; } - request = tap->pending_requests[usr_idx]; + request = ring->pending[usr_idx]; if (!request) { err = -ESRCH; @@ -110,90 +95,15 @@ static int blktap_ring_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; } -static pte_t -blktap_ring_clear_pte(struct vm_area_struct *vma, - unsigned long uvaddr, - pte_t *ptep, int is_fullmm) -{ - pte_t copy; - struct blktap *tap; - unsigned long kvaddr; - struct page **map, *page; - struct blktap_ring *ring; - struct blktap_request *request; - struct grant_handle_pair *khandle; - struct gnttab_unmap_grant_ref unmap[2]; - int offset, seg, usr_idx, count = 0; - - tap = vma_to_blktap(vma); - ring = &tap->ring; - map = ring->foreign_map.map; - BUG_ON(!map); /* TODO Should this be changed to if statement? */ - - /* - * Zap entry if the address is before the start of the grant - * mapped region. - */ - if (uvaddr < ring->user_vstart) - return ptep_get_and_clear_full(vma->vm_mm, uvaddr, - ptep, is_fullmm); - - offset = (int)((uvaddr - ring->user_vstart) >> PAGE_SHIFT); - usr_idx = offset / BLKIF_MAX_SEGMENTS_PER_REQUEST; - seg = offset % BLKIF_MAX_SEGMENTS_PER_REQUEST; - - offset = (int)((uvaddr - vma->vm_start) >> PAGE_SHIFT); - page = map[offset]; - if (page && blkback_pagemap_contains_page(page)) - set_page_private(page, 0); - map[offset] = NULL; - - request = tap->pending_requests[usr_idx]; - kvaddr = request_to_kaddr(request, seg); - khandle = request->handles + seg; - - if (khandle->kernel != INVALID_GRANT_HANDLE) { - gnttab_set_unmap_op(&unmap[count], kvaddr, - GNTMAP_host_map, khandle->kernel); - count++; - - set_phys_to_machine(__pa(kvaddr) >> PAGE_SHIFT, - INVALID_P2M_ENTRY); - } - - if (khandle->user != INVALID_GRANT_HANDLE) { - BUG_ON(xen_feature(XENFEAT_auto_translated_physmap)); - - copy = *ptep; - gnttab_set_unmap_op(&unmap[count], virt_to_machine(ptep).maddr, - GNTMAP_host_map - | GNTMAP_application_map - | GNTMAP_contains_pte, - khandle->user); - count++; - } else - copy = ptep_get_and_clear_full(vma->vm_mm, uvaddr, ptep, - is_fullmm); - - if (count) - if (HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, - unmap, count)) - BUG(); - - khandle->kernel = INVALID_GRANT_HANDLE; - khandle->user = INVALID_GRANT_HANDLE; - - return copy; -} - static void blktap_ring_fail_pending(struct blktap *tap) { + struct blktap_ring *ring = &tap->ring; struct blktap_request *request; int usr_idx; for (usr_idx = 0; usr_idx < MAX_PENDING_REQS; usr_idx++) { - request = tap->pending_requests[usr_idx]; + request = ring->pending[usr_idx]; if (!request) continue; @@ -204,15 +114,12 @@ blktap_ring_fail_pending(struct blktap *tap) static void blktap_ring_vm_close(struct vm_area_struct *vma) { - struct blktap *tap = vma_to_blktap(vma); + struct blktap *tap = vma->vm_private_data; struct blktap_ring *ring = &tap->ring; struct page *page = virt_to_page(ring->ring.sring); blktap_ring_fail_pending(tap); - kfree(ring->foreign_map.map); - ring->foreign_map.map = NULL; - zap_page_range(vma, vma->vm_start, PAGE_SIZE, NULL); ClearPageReserved(page); __free_page(page); @@ -226,9 +133,154 @@ blktap_ring_vm_close(struct vm_area_struct *vma) static struct vm_operations_struct blktap_ring_vm_operations = { .close = blktap_ring_vm_close, .fault = blktap_ring_fault, - .zap_pte = blktap_ring_clear_pte, }; +int +blktap_ring_map_segment(struct blktap *tap, + struct blktap_request *request, + int seg) +{ + struct blktap_ring *ring = &tap->ring; + unsigned long uaddr; + + uaddr = MMAP_VADDR(ring->user_vstart, request->usr_idx, seg); + return vm_insert_page(ring->vma, uaddr, request->pages[seg]); +} + +int +blktap_ring_map_request(struct blktap *tap, + struct blktap_request *request) +{ + int seg, err = 0; + int write; + + write = request->operation == BLKIF_OP_WRITE; + + for (seg = 0; seg < request->nr_pages; seg++) { + if (write) + blktap_request_bounce(tap, request, seg, write); + + err = blktap_ring_map_segment(tap, request, seg); + if (err) + break; + } + + if (err) + blktap_ring_unmap_request(tap, request); + + return err; +} + +void +blktap_ring_unmap_request(struct blktap *tap, + struct blktap_request *request) +{ + struct blktap_ring *ring = &tap->ring; + unsigned long uaddr; + unsigned size; + int seg, read; + + uaddr = MMAP_VADDR(ring->user_vstart, request->usr_idx, 0); + size = request->nr_pages << PAGE_SHIFT; + read = request->operation == BLKIF_OP_READ; + + if (read) + for (seg = 0; seg < request->nr_pages; seg++) + blktap_request_bounce(tap, request, seg, !read); + + zap_page_range(ring->vma, uaddr, size, NULL); +} + +void +blktap_ring_free_request(struct blktap *tap, + struct blktap_request *request) +{ + struct blktap_ring *ring = &tap->ring; + + ring->pending[request->usr_idx] = NULL; + ring->n_pending--; + + blktap_request_free(tap, request); +} + +struct blktap_request* +blktap_ring_make_request(struct blktap *tap) +{ + struct blktap_ring *ring = &tap->ring; + struct blktap_request *request; + int usr_idx; + + if (RING_FULL(&ring->ring)) + return ERR_PTR(-ENOSPC); + + request = blktap_request_alloc(tap); + if (!request) + return ERR_PTR(-ENOMEM); + + for (usr_idx = 0; usr_idx < BLK_RING_SIZE; usr_idx++) + if (!ring->pending[usr_idx]) + break; + + BUG_ON(usr_idx >= BLK_RING_SIZE); + + request->tap = tap; + request->usr_idx = usr_idx; + + ring->pending[usr_idx] = request; + ring->n_pending++; + + return request; +} + +void +blktap_ring_submit_request(struct blktap *tap, + struct blktap_request *request) +{ + struct blktap_ring *ring = &tap->ring; + struct blkif_request *breq; + struct scatterlist *sg; + int i, nsecs = 0; + + dev_dbg(ring->dev, + "request %d [%p] submit\n", request->usr_idx, request); + + breq = RING_GET_REQUEST(&ring->ring, ring->ring.req_prod_pvt); + + breq->id = request->usr_idx; + breq->sector_number = request->rq->sector; + breq->handle = 0; + breq->operation = request->operation; + breq->nr_segments = request->nr_pages; + + blktap_for_each_sg(sg, request, i) { + struct blkif_request_segment *seg = &breq->seg[i]; + int first, count; + + count = sg->length >> 9; + first = sg->offset >> 9; + + seg->first_sect = first; + seg->last_sect = first + count - 1; + + nsecs += count; + } + + ring->ring.req_prod_pvt++; + + do_gettimeofday(&request->time); + + + if (request->operation == BLKIF_OP_WRITE) { + tap->stats.st_wr_sect += nsecs; + tap->stats.st_wr_req++; + } + + if (request->operation == BLKIF_OP_READ) { + tap->stats.st_rd_sect += nsecs; + tap->stats.st_rd_req++; + } +} + static int blktap_ring_open(struct inode *inode, struct file *filp) { @@ -270,51 +322,21 @@ blktap_ring_release(struct inode *inode, struct file *filp) return 0; } -/* Note on mmap: - * We need to map pages to user space in a way that will allow the block - * subsystem set up direct IO to them. This couldn''t be done before, because - * there isn''t really a sane way to translate a user virtual address down to a - * physical address when the page belongs to another domain. - * - * My first approach was to map the page in to kernel memory, add an entry - * for it in the physical frame list (using alloc_lomem_region as in blkback) - * and then attempt to map that page up to user space. This is disallowed - * by xen though, which realizes that we don''t really own the machine frame - * underlying the physical page. - * - * The new approach is to provide explicit support for this in xen linux. - * The VMA now has a flag, VM_FOREIGN, to indicate that it contains pages - * mapped from other vms. vma->vm_private_data is set up as a mapping - * from pages to actual page structs. There is a new clause in get_user_pages - * that does the right thing for this sort of mapping. - */ static int blktap_ring_mmap(struct file *filp, struct vm_area_struct *vma) { struct blktap *tap = filp->private_data; struct blktap_ring *ring = &tap->ring; struct blkif_sring *sring; - struct page *page; - int size, err; - struct page **map; - - map = NULL; - sring = NULL; + struct page *page = NULL; + int err; if (ring->vma) return -EBUSY; - size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; - if (size != (MMAP_PAGES + RING_PAGES)) { - BTERR("you _must_ map exactly %lu pages!\n", - MMAP_PAGES + RING_PAGES); - return -EAGAIN; - } - - /* allocate the shared ring */ page = alloc_page(GFP_KERNEL|__GFP_ZERO); if (!page) - goto fail; + return -ENOMEM; SetPageReserved(page); @@ -329,22 +351,12 @@ blktap_ring_mmap(struct file *filp, struct vm_area_struct *vma) ring->ring_vstart = vma->vm_start; ring->user_vstart = ring->ring_vstart + PAGE_SIZE; - /* allocate the foreign map */ - map = kzalloc(size * sizeof(struct page *), GFP_KERNEL); - if (!map) - goto fail; + vma->vm_private_data = tap; - /* Mark this VM as containing foreign pages, and set up mappings. */ - ring->foreign_map.map = map; - vma->vm_private_data = &ring->foreign_map; - vma->vm_flags |= VM_FOREIGN; vma->vm_flags |= VM_DONTCOPY; vma->vm_flags |= VM_RESERVED; - vma->vm_ops = &blktap_ring_vm_operations; -#ifdef CONFIG_X86 - vma->vm_mm->context.has_foreign_mappings = 1; -#endif + vma->vm_ops = &blktap_ring_vm_operations; ring->vma = vma; return 0; @@ -356,10 +368,7 @@ fail: __free_page(page); } - if (map) - kfree(map); - - return -ENOMEM; + return err; } static int @@ -405,16 +414,19 @@ static unsigned int blktap_ring_poll(struct file *filp, poll_table *wait) { struct blktap *tap = filp->private_data; struct blktap_ring *ring = &tap->ring; - int work = 0; + int work; poll_wait(filp, &tap->pool->wait, wait); poll_wait(filp, &ring->poll_wait, wait); down_read(¤t->mm->mmap_sem); if (ring->vma && tap->device.gd) - work = blktap_device_run_queue(tap); + blktap_device_run_queue(tap); up_read(¤t->mm->mmap_sem); + work = ring->ring.req_prod_pvt - ring->ring.sring->req_prod; + RING_PUSH_REQUESTS(&ring->ring); + if (work || ring->ring.sring->private.tapif_user.msg || test_and_clear_bit(BLKTAP_DEVICE_CLOSED, &tap->dev_inuse)) @@ -463,18 +475,19 @@ blktap_ring_create(struct blktap *tap) size_t blktap_ring_debug(struct blktap *tap, char *buf, size_t size) { + struct blktap_ring *ring = &tap->ring; char *s = buf, *end = buf + size; int usr_idx; s += snprintf(s, end - s, - "begin pending:%d\n", tap->pending_cnt); + "begin pending:%d\n", ring->n_pending); for (usr_idx = 0; usr_idx < MAX_PENDING_REQS; usr_idx++) { struct blktap_request *request; struct timeval *time; int write; - request = tap->pending_requests[usr_idx]; + request = ring->pending[usr_idx]; if (!request) continue; -- 1.7.0.4 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-12 23:31 UTC
[Xen-devel] [PATCH 3/5] blktap: Add queue access macros.
Reduce request queue access in different contexts to a small set of macros. Also simplifies transitions between different kernel versions. Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> --- drivers/xen/blktap/device.c | 54 +++++++++++++++++++++++++++++++++++-------- 1 files changed, 44 insertions(+), 10 deletions(-) diff --git a/drivers/xen/blktap/device.c b/drivers/xen/blktap/device.c index 02e1fc8..4c4d682 100644 --- a/drivers/xen/blktap/device.c +++ b/drivers/xen/blktap/device.c @@ -103,6 +103,42 @@ static struct block_device_operations blktap_device_file_operations = { .getgeo = blktap_device_getgeo }; +/* NB. __blktap holding the queue lock; blktap where unlocked */ + +static inline struct request* +__blktap_next_queued_rq(struct request_queue *q) +{ + return elv_next_request(q); +} + +static inline void +__blktap_dequeue_rq(struct request *rq) +{ + blkdev_dequeue_request(rq); +} + +/* NB. err == 0 indicates success, failures < 0 */ + +static inline void +__blktap_end_queued_rq(struct request *rq, int err) +{ + __blk_end_request(rq, err, blk_rq_bytes(rq)); +} + +static inline void +__blktap_end_rq(struct request *rq, int err) +{ + __blk_end_request(rq, err, blk_rq_bytes(rq)); +} + +static inline void +blktap_end_rq(struct request *rq, int err) +{ + spin_lock_irq(rq->q->queue_lock); + __blktap_end_rq(rq, err); + spin_unlock_irq(rq->q->queue_lock); +} + void blktap_device_end_request(struct blktap *tap, struct blktap_request *request, @@ -119,9 +155,7 @@ blktap_device_end_request(struct blktap *tap, "end_request: op=%d error=%d bytes=%d\n", rq_data_dir(rq), error, blk_rq_bytes(rq)); - spin_lock_irq(&tapdev->lock); - end_request(request->rq, !error); - spin_unlock_irq(&tapdev->lock); + blktap_end_rq(rq, error); } int @@ -201,17 +235,17 @@ blktap_device_run_queue(struct blktap *tap) queue_flag_clear(QUEUE_FLAG_STOPPED, q); do { - rq = elv_next_request(q); + rq = __blktap_next_queued_rq(q); if (!rq) break; if (!blk_fs_request(rq)) { - end_queued_request(rq, 0); + __blktap_end_queued_rq(rq, -EOPNOTSUPP); continue; } if (blk_empty_barrier(rq)) { - end_queued_request(rq, 1); + __blktap_end_queued_rq(rq, 0); continue; } @@ -226,10 +260,10 @@ blktap_device_run_queue(struct blktap *tap) break; } - blkdev_dequeue_request(req); + __blktap_dequeue_rq(rq); if (unlikely(err)) - end_request(rq, 0); + __blktap_end_rq(rq, err); } while (1); spin_unlock_irq(&tapdev->lock); @@ -365,11 +399,11 @@ blktap_device_fail_queue(struct blktap *tap) queue_flag_clear(QUEUE_FLAG_STOPPED, q); do { - struct request *rq = elv_next_request(q); + struct request *rq = __blktap_next_queued_rq(q); if (!rq) break; - end_request(rq, -EIO); + __blktap_end_queued_rq(rq, -EIO); } while (1); spin_unlock_irq(&tapdev->lock); -- 1.7.0.4 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-12 23:31 UTC
[Xen-devel] [PATCH 4/5] blktap: Forward port to 2.6.32
Some dust off the topic branch. Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> --- drivers/xen/blktap/device.c | 22 +++++++++------------- drivers/xen/blktap/ring.c | 2 +- 2 files changed, 10 insertions(+), 14 deletions(-) diff --git a/drivers/xen/blktap/device.c b/drivers/xen/blktap/device.c index 4c4d682..c94a0e9 100644 --- a/drivers/xen/blktap/device.c +++ b/drivers/xen/blktap/device.c @@ -108,13 +108,13 @@ static struct block_device_operations blktap_device_file_operations = { static inline struct request* __blktap_next_queued_rq(struct request_queue *q) { - return elv_next_request(q); + return blk_peek_request(q); } static inline void __blktap_dequeue_rq(struct request *rq) { - blkdev_dequeue_request(rq); + blk_start_request(rq); } /* NB. err == 0 indicates success, failures < 0 */ @@ -122,6 +122,7 @@ __blktap_dequeue_rq(struct request *rq) static inline void __blktap_end_queued_rq(struct request *rq, int err) { + blk_start_request(rq); __blk_end_request(rq, err, blk_rq_bytes(rq)); } @@ -151,7 +152,7 @@ blktap_device_end_request(struct blktap *tap, blktap_ring_free_request(tap, request); - dev_dbg(&tapdev->gd->dev, + dev_dbg(disk_to_dev(tapdev->gd), "end_request: op=%d error=%d bytes=%d\n", rq_data_dir(rq), error, blk_rq_bytes(rq)); @@ -180,7 +181,7 @@ blktap_device_make_request(struct blktap *tap, struct request *rq) write = rq_data_dir(rq) == WRITE; nsegs = blk_rq_map_sg(rq->q, rq, request->sg_table); - dev_dbg(&tapdev->gd->dev, + dev_dbg(disk_to_dev(tapdev->gd), "make_request: op=%c bytes=%d nsegs=%d\n", write ? ''w'' : ''r'', blk_rq_bytes(rq), nsegs); @@ -210,7 +211,7 @@ _out: return err; fail: if (printk_ratelimit()) - dev_warn(&tapdev->gd->dev, + dev_warn(disk_to_dev(tapdev->gd), "make request: %d, failing\n", err); goto _out; } @@ -244,11 +245,6 @@ blktap_device_run_queue(struct blktap *tap) continue; } - if (blk_empty_barrier(rq)) { - __blktap_end_queued_rq(rq, 0); - continue; - } - spin_unlock_irq(&tapdev->lock); err = blktap_device_make_request(tap, rq); @@ -491,8 +487,8 @@ blktap_device_create(struct blktap *tap, struct blktap_params *params) set_bit(BLKTAP_DEVICE, &tap->dev_inuse); - dev_info(&gd->dev, "sector-size: %u capacity: %llu\n", - rq->hardsect_size, get_capacity(gd)); + dev_info(disk_to_dev(gd), "sector-size: %u capacity: %llu\n", + queue_logical_block_size(rq), get_capacity(gd)); return 0; @@ -520,7 +516,7 @@ blktap_device_debug(struct blktap *tap, char *buf, size_t size) s += snprintf(s, end - s, "disk capacity:%llu sector size:%u\n", - get_capacity(disk), queue_hardsect_size(q)); + get_capacity(disk), queue_logical_block_size(q)); s += snprintf(s, end - s, "queue flags:%#lx plugged:%d stopped:%d empty:%d\n", diff --git a/drivers/xen/blktap/ring.c b/drivers/xen/blktap/ring.c index 38896e7..6b86be5 100644 --- a/drivers/xen/blktap/ring.c +++ b/drivers/xen/blktap/ring.c @@ -247,7 +247,7 @@ blktap_ring_submit_request(struct blktap *tap, breq = RING_GET_REQUEST(&ring->ring, ring->ring.req_prod_pvt); breq->id = request->usr_idx; - breq->sector_number = request->rq->sector; + breq->sector_number = blk_rq_pos(request->rq); breq->handle = 0; breq->operation = request->operation; breq->nr_segments = request->nr_pages; -- 1.7.0.4 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-12 23:31 UTC
[Xen-devel] [PATCH 5/5] Fix compilation format warning in drivers/xen/blktap/device.c
From: Giam Teck Choon <giamteckchoon@gmail.com> drivers/xen/blktap/device.c: In function ‘blktap_device_create’: drivers/xen/blktap/device.c:869: warning: format ‘%llu’ expects type ‘long long unsigned int’, but argument 5 has type ‘sector_t’ drivers/xen/blktap/device.c: In function ‘blktap_device_debug’: drivers/xen/blktap/device.c:898: warning: format ‘%llu’ expects type ‘long long unsigned int’, but argument 4 has type ‘sector_t’ Signed-off-by: Giam Teck Choon <giamteckchoon@gmail.com> Acked-by: Daniel Stodden <daniel.stodden@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- drivers/xen/blktap/device.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/xen/blktap/device.c b/drivers/xen/blktap/device.c index c94a0e9..fce2769 100644 --- a/drivers/xen/blktap/device.c +++ b/drivers/xen/blktap/device.c @@ -488,7 +488,8 @@ blktap_device_create(struct blktap *tap, struct blktap_params *params) set_bit(BLKTAP_DEVICE, &tap->dev_inuse); dev_info(disk_to_dev(gd), "sector-size: %u capacity: %llu\n", - queue_logical_block_size(rq), get_capacity(gd)); + queue_logical_block_size(rq), + (unsigned long long)get_capacity(gd)); return 0; @@ -516,7 +517,8 @@ blktap_device_debug(struct blktap *tap, char *buf, size_t size) s += snprintf(s, end - s, "disk capacity:%llu sector size:%u\n", - get_capacity(disk), queue_logical_block_size(q)); + (unsigned long long)get_capacity(disk), + queue_logical_block_size(q)); s += snprintf(s, end - s, "queue flags:%#lx plugged:%d stopped:%d empty:%d\n", -- 1.7.0.4 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-13 00:50 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/12/2010 03:31 PM, Daniel Stodden wrote:> It''s fairly a big change in how I/O buffers are managed. Prior to this > series, we had zero-copy I/O down to userspace. Unfortunately, blktap2 > always had to jump through a couple of extra loops to do so. Present > state of that is that we dropped that, so all tapdev I/O is bounced > to/from a bunch of normal pages. Essentially replacing the old VMA > management with a couple insert/zap VM calls.Do you have any performance results comparing the two approaches?> One issue was that the kernel can''t cope with recursive > I/O. Submitting an iovec on a tapdev, passing it to userspace and then > reissuing the same vector via AIO apparently doesn''t fit well with the > lock protocol applied to those pages. This is the main reason why > blktap had to deal a lot with grant refs. About as much as blkback > already does before passing requests on. What happens there is that > it''s aliasing those granted pages under a different PFN, thereby in a > separate page struct. Not pretty, but it worked, so it''s not the > reason why we chose to drop that at some point. > > The more prevalent problem was network storage, especially anything > involving TCP. That includes VHD on both NFS and iSCSI. The problem > with those is that retransmits (by the transport) and I/O op > completion (on the application layer) are never synchronized. With > sufficiently bad timing and bit of jitter on the network, it''s > perfectly common for the kernel to complete an AIO request with a late > ack on the input queue just when retransmission timer is about to fire > underneath. The completion will unmap the granted frame, crashing any > uncanceled retransmission on an empty page frame. There are different > ways to deal with that. Page destructors might be one way, but as far > as I heard they are not particularly popular upstream. Issuing the > block I/O on dom0 memory is straightforward and avoids the hassle. One > could go argue that retransmits after DIO completion are still a > potential privacy problem (I did), but it''s not Xen''s problem after > all.Surely this can be dealt with by replacing the mapped granted page with a local copy if the refcount is elevated? Then that can catch any stray residual references while we can still return the granted page to its owner. And obviously, not reuse that pfn for grants until the refcount is zero...> If zero-copy becomes more attractive again, the plan would be to > rather use grantdev in userspace, such as a filter driver for tapdisk > instead. Until then, there''s presumably a notable difference in L2 > cache footprint. Then again, there''s also a whole number of cycles not > spent in redundant hypercalls now, to deal with the pseudophysical > map.Frankly, I think the idea of putting blkback+tapdisk entirely in usermode is all upside with no (obvious) downsides. It: 1. avoids having to upstream anything 2. avoids having to upstream anything 3. avoids having to upstream anything 4. gets us back zero-copy (if that''s important) 5. makes the IO path nice and straightforward 6. seems to address all the other problems you mentioned The only caveat is the stray unmapping problem, but I think gntdev can be modified to deal with that pretty easily. qemu has usermode blkback support already, and an actively improving block-IO infrastructure, so one approach might be to consider putting (parts of) tapdisk into qemu - and makes it pretty natural to reuse it with non-Xen guests via virtio-block, emulated devices, etc. But I''m not sold on that; having a standalone tapdisk w/ blkback makes sense to me as well. On the other hand, I don''t think we''re going to be able to get away with putting netback in usermode, so we still need to deal with that - but I think an all-copying version will be fine to get started with at least.> There are also benefits or non-issues. > > - This blktap is rather xen-independent. Certainly depends on the > common ring macros, but lacking grant stuff it compiles on bare > metal Linux with no CONFIG_XEN. Not consummated here, because > that''s going to move the source tree out of drivers/xen. But I''d > like to post a new branch proposing to do so. > > - Blktaps size in dom0 didn''t really change. Frames (now pages) were > always pooled. We used to balloon memory to claim space for > redundant grant mappings. Now we reserve, by default, the same > volume in normal memory. > > - The previous code would runs all I/O on a single pool. Typically > two rings worth of requests. Sufficient for a whole lot of systems, > especially with single storage backends, but not so nice when I/O > on a number of otherwise independent filers or volumes collides. > > Pools are refcounted kobjects in sysfs. Toolstacks using the new > code can thereby choose to elimitate bottlenecks by grouping taps > on different buffer pools. Pools can also be resized, to accomodate > greater queue depths. [Note that blkback still has the same issue, > so guests won''t take advantage of that before that''s resolved as > well.] > > - XCP started to make some use of stacking tapdevs. Think pointing > the image chain of a bunch of "leaf" taps to a shared parent > node. That works fairly well, but definitely takes independent > resource pools to avoid deadlock by parent starvation then. > > Please pull upstream/xen/dom0/backend/blktap2 from > git://xenbits.xensource.com/people/dstodden/linux.gitOK, I''ve pulled it, but I haven''t had a chance to test it yet. Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-13 03:56 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
Hey. On Fri, 2010-11-12 at 19:50 -0500, Jeremy Fitzhardinge wrote:> On 11/12/2010 03:31 PM, Daniel Stodden wrote: > > It''s fairly a big change in how I/O buffers are managed. Prior to this > > series, we had zero-copy I/O down to userspace. Unfortunately, blktap2 > > always had to jump through a couple of extra loops to do so. Present > > state of that is that we dropped that, so all tapdev I/O is bounced > > to/from a bunch of normal pages. Essentially replacing the old VMA > > management with a couple insert/zap VM calls. > > Do you have any performance results comparing the two approaches?No. One could probably go try large ramdisks or an AIO backend in tmpfs. All the storage I''m concerned with here terminates either on the NIC or a local spindle. That''s hard to thwart in cache bandwidth.> > One issue was that the kernel can''t cope with recursive > > I/O. Submitting an iovec on a tapdev, passing it to userspace and then > > reissuing the same vector via AIO apparently doesn''t fit well with the > > lock protocol applied to those pages. This is the main reason why > > blktap had to deal a lot with grant refs. About as much as blkback > > already does before passing requests on. What happens there is that > > it''s aliasing those granted pages under a different PFN, thereby in a > > separate page struct. Not pretty, but it worked, so it''s not the > > reason why we chose to drop that at some point. > > > > The more prevalent problem was network storage, especially anything > > involving TCP. That includes VHD on both NFS and iSCSI. The problem > > with those is that retransmits (by the transport) and I/O op > > completion (on the application layer) are never synchronized. With > > sufficiently bad timing and bit of jitter on the network, it''s > > perfectly common for the kernel to complete an AIO request with a late > > ack on the input queue just when retransmission timer is about to fire > > underneath. The completion will unmap the granted frame, crashing any > > uncanceled retransmission on an empty page frame. There are different > > ways to deal with that. Page destructors might be one way, but as far > > as I heard they are not particularly popular upstream. Issuing the > > block I/O on dom0 memory is straightforward and avoids the hassle. One > > could go argue that retransmits after DIO completion are still a > > potential privacy problem (I did), but it''s not Xen''s problem after > > all. > > Surely this can be dealt with by replacing the mapped granted page with > a local copy if the refcount is elevated?Yeah. We briefly discussed this when the problem started to pop up (again). I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with a dummy page mapped in. You wouldn''t need a copy, a R/O zero map easily does the job. On UP that''d be just a matter of disabling interrupts for a while. I dropped it after it became clear that XS was moving to SMP, where one would end up with a full barrier to orchestrate the TLB flushes everywhere. Now, the skb runs prone to crash all run in softirq context, I wouldn''t exactly expect a huge performance win from syncing on that kind of thing across all nodes, compared to local memcpy. Nor would I want storage stuff to touch locks shared with TCP, that''s just not our business. Correct me if I''m mistaken. I''d like to see maybe stuff like node affinity on NUMA getting a bit more work. I think the patch presently just fills the queue node in, but that didn''t see much testing, and one would have to correlate that.> Then that can catch any stray > residual references while we can still return the granted page to its > owner. And obviously, not reuse that pfn for grants until the refcount > is zero...> > If zero-copy becomes more attractive again, the plan would be to > > rather use grantdev in userspace, such as a filter driver for tapdisk > > instead. Until then, there''s presumably a notable difference in L2 > > cache footprint. Then again, there''s also a whole number of cycles not > > spent in redundant hypercalls now, to deal with the pseudophysical > > map. > > Frankly, I think the idea of putting blkback+tapdisk entirely in > usermode is all upside with no (obvious) downsides. It: > > 1. avoids having to upstream anything > 2. avoids having to upstream anything > 3. avoids having to upstream anything > > 4. gets us back zero-copy (if that''s important)(No, unfortunately. DIO on a granted frame under blktap would be as vulnerable as DIO on a granted frame under a userland blkback, userland won''t buy us anthing as far as the zcopy side of things is concerned).> 5. makes the IO path nice and straightforward > 6. seems to address all the other problems you mentionedI''m not at all against a userland blkback. Just wouldn''t go as far as considering this a silver bullet. The main thing I''m scared of is ending up hacking cheesy stuff into the user ABI to take advantage of things immediately available to FSes on the bio layer, but harder (or at least slower) to get made available to userland. DISCARD support is one currently hot example, do you see that in AIO somewhere? Ok, probably a good thing for everybody anyway, so maybe patching that is useful work. But it''s extra work right now and probably no more fun to maintain than blktap is. The second issue I see is the XCP side of things. XenServer got a lot of benefit out of blktap2, and particularly because of the tapdevs. It promotes a fairly rigorous split between a blkback VBD, controlled by the agent, and tapdevs, controlled by XS''s storage manager. That doesn''t prevent blkback to go into userspace, but it better won''t share a process with some libblktap, which in turn would better not be controlled under the same xenstore path. So for XCP it''d be AIO on tapdevs for the time being, and with that whatever the syscall interface lets you do.> The only caveat is the stray unmapping problem, but I think gntdev can > be modified to deal with that pretty easily.Not easier than anything else in kernel space, but when dealing only with the refcounts, that''s as as good a place as anwhere else, yes.> qemu has usermode blkback support already, and an actively improving > block-IO infrastructure, so one approach might be to consider putting > (parts of) tapdisk into qemu - and makes it pretty natural to reuse it > with non-Xen guests via virtio-block, emulated devices, etc. But I''m > not sold on that; having a standalone tapdisk w/ blkback makes sense to > me as well. > > On the other hand, I don''t think we''re going to be able to get away with > putting netback in usermode, so we still need to deal with that - but I > think an all-copying version will be fine to get started with at least.> > Please pull upstream/xen/dom0/backend/blktap2 from > > git://xenbits.xensource.com/people/dstodden/linux.git > > OK, I''ve pulled it, but I haven''t had a chance to test it yet.Thanks. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-15 18:27 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/12/2010 07:55 PM, Daniel Stodden wrote:>> Surely this can be dealt with by replacing the mapped granted page with >> a local copy if the refcount is elevated? > Yeah. We briefly discussed this when the problem started to pop up > (again). > > I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with > a dummy page mapped in. You wouldn''t need a copy, a R/O zero map easily > does the job.Hm, I''d be a bit concerned that that might cause problems if used generically. If the page is being used RO, then replacing with a copy shouldn''t be a problem, but getting a consistent snapshot of a mutable page is obviously going to be a problem.> On UP that''d be just a matter of disabling interrupts for > a while.Disable for what purpose? You mean to do an exchange of the mapping, or something else?> I dropped it after it became clear that XS was moving to SMP, where one > would end up with a full barrier to orchestrate the TLB flushes > everywhere. Now, the skb runs prone to crash all run in softirq context, > I wouldn''t exactly expect a huge performance win from syncing on that > kind of thing across all nodes, compared to local memcpy. Nor would I > want storage stuff to touch locks shared with TCP, that''s just not our > business. Correct me if I''m mistaken.I don''t follow what your high-level concern is here. If we update the pte to unmap the granted page, then its up to Xen to arrange for any TLB flushes to make sure the page is no longer accessible to the domain. We don''t need to do anything explicit, and its independent of whether we''re simply unmapping the page or replacing the mapping with another one (ie, any TLB flushing necessary is already going on). I was also under the impression that this is a relatively rare event rather than something that''s going to be necessary for every page, so the overhead should be minimal.>>> If zero-copy becomes more attractive again, the plan would be to >>> rather use grantdev in userspace, such as a filter driver for tapdisk >>> instead. Until then, there''s presumably a notable difference in L2 >>> cache footprint. Then again, there''s also a whole number of cycles not >>> spent in redundant hypercalls now, to deal with the pseudophysical >>> map. >> Frankly, I think the idea of putting blkback+tapdisk entirely in >> usermode is all upside with no (obvious) downsides. It: >> >> 1. avoids having to upstream anything >> 2. avoids having to upstream anything >> 3. avoids having to upstream anything >> >> 4. gets us back zero-copy (if that''s important) > (No, unfortunately. DIO on a granted frame under blktap would be as > vulnerable as DIO on a granted frame under a userland blkback, userland > won''t buy us anthing as far as the zcopy side of things is concerned).Why''s that? Are you talking about the stray page reference vulnerability, or something else? You''re right that it doesn''t really help with stray references because its still kernel code which ends up doing the dereference rather than usermode.>> 5. makes the IO path nice and straightforward >> 6. seems to address all the other problems you mentioned > I''m not at all against a userland blkback. Just wouldn''t go as far as > considering this a silver bullet. > > The main thing I''m scared of is ending up hacking cheesy stuff into the > user ABI to take advantage of things immediately available to FSes on > the bio layer, but harder (or at least slower) to get made available to > userland.But that hasn''t been a problem for tapdisk so far. Given that it is using DIO then any completed write has at least been submitted to a storage device, so then its just a question of how to make sure that any buffers are fully flushed, which would be fdatasync() I guess. Besides, our requirements are hardly unique; if there''s some clear need for a new API, then the course of action is to work with broader Linux community to work out what it should be and how it should be implemented. There''s no need for "hacking cheesy stuff" - there''s been enough of that already.> DISCARD support is one currently hot example, do you see that in AIO > somewhere? Ok, probably a good thing for everybody anyway, so maybe > patching that is useful work. But it''s extra work right now and probably > no more fun to maintain than blktap is.fallocate() is being extended to allow hole-punching in files. I don''t know what work is being done to do discard on random block devices, but that''s clearly a generically useful thing to have.> The second issue I see is the XCP side of things. XenServer got a lot of > benefit out of blktap2, and particularly because of the tapdevs. It > promotes a fairly rigorous split between a blkback VBD, controlled by > the agent, and tapdevs, controlled by XS''s storage manager. > > That doesn''t prevent blkback to go into userspace, but it better won''t > share a process with some libblktap, which in turn would better not be > controlled under the same xenstore path.Could you elaborate on this? What was the benefit?>> The only caveat is the stray unmapping problem, but I think gntdev can >> be modified to deal with that pretty easily. > Not easier than anything else in kernel space, but when dealing only > with the refcounts, that''s as as good a place as anwhere else, yes.I think the refcount test is pretty straightforward - if the refcount is 1, then we''re the sole owner of the page and we don''t need to worry about any other users. If its > 1, then somebody else has it, and we need to make sure it no longer refers to a granted page (which is just a matter of doing a set_pte_atomic() to remap from present to present). Then we''d have a set of frames whose lifetimes are being determined by some other subsystem. We can either maintain a list of them and poll waiting for them to become free, or just release them and let them be managed by the normal kernel lifetime rules (which requires that the memory attached to them be completely normal, of course). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-15 19:19 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Mon, 2010-11-15 at 18:27 +0000, Jeremy Fitzhardinge wrote:> Then we''d have a set of frames whose lifetimes are being determined by > some other subsystem. We can either maintain a list of them and poll > waiting for them to become free, or just release them and let them be > managed by the normal kernel lifetime rules (which requires that the > memory attached to them be completely normal, of course).GNTTABOP_unmap_and_replace is probably the answer to this? This is what netback uses (via gnttab_copy_grant_page) when it wants to copy and skb which has remained in flight for too long. Maybe this doesn''t work since the memcpy and replace are not atomic and we don''t have a lock to use, since we don''t know which subsystem holds the reference. However it should work if we don''t really care about the contents of the replacement too much, which I don''t think we do in this case since the data could just as easily get clobbered by the userspace process which thinks the write has completed fully, in which case we just replace the grant mapping without bothering with the copy. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-15 19:34 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/15/2010 11:19 AM, Ian Campbell wrote:> On Mon, 2010-11-15 at 18:27 +0000, Jeremy Fitzhardinge wrote: >> Then we''d have a set of frames whose lifetimes are being determined by >> some other subsystem. We can either maintain a list of them and poll >> waiting for them to become free, or just release them and let them be >> managed by the normal kernel lifetime rules (which requires that the >> memory attached to them be completely normal, of course). > GNTTABOP_unmap_and_replace is probably the answer to this? This is what > netback uses (via gnttab_copy_grant_page) when it wants to copy and skb > which has remained in flight for too long. > > Maybe this doesn''t work since the memcpy and replace are not atomic and > we don''t have a lock to use, since we don''t know which subsystem holds > the reference. However it should work if we don''t really care about the > contents of the replacement too much, which I don''t think we do in this > case since the data could just as easily get clobbered by the userspace > process which thinks the write has completed fully, in which case we > just replace the grant mapping without bothering with the copy.Well, if atomicity is important (and it is a relatively rare event), I dare say Xen could de-schedule all the VCPUs a lot more efficiently than we could run stop_machine() from within the kernel... J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-15 20:07 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Mon, 2010-11-15 at 19:34 +0000, Jeremy Fitzhardinge wrote:> On 11/15/2010 11:19 AM, Ian Campbell wrote: > > On Mon, 2010-11-15 at 18:27 +0000, Jeremy Fitzhardinge wrote: > >> Then we''d have a set of frames whose lifetimes are being determined by > >> some other subsystem. We can either maintain a list of them and poll > >> waiting for them to become free, or just release them and let them be > >> managed by the normal kernel lifetime rules (which requires that the > >> memory attached to them be completely normal, of course). > > GNTTABOP_unmap_and_replace is probably the answer to this? This is what > > netback uses (via gnttab_copy_grant_page) when it wants to copy and skb > > which has remained in flight for too long. > > > > Maybe this doesn''t work since the memcpy and replace are not atomic and > > we don''t have a lock to use, since we don''t know which subsystem holds > > the reference. However it should work if we don''t really care about the > > contents of the replacement too much, which I don''t think we do in this > > case since the data could just as easily get clobbered by the userspace > > process which thinks the write has completed fully, in which case we > > just replace the grant mapping without bothering with the copy. > > Well, if atomicity is important (and it is a relatively rare event), I > dare say Xen could de-schedule all the VCPUs a lot more efficiently than > we could run stop_machine() from within the kernel...Absolutely. GNTTABOP_unmap_and_replace is itself atomic according to the comment in the header, it''s not clear what it is atomic WRT, presumably just other gnttab or mmu ops. In any case it doesn''t handle synchronising the content of the old and new pages so if we care about that we would need a new GNTAPOP_copy_unmap_and_replace or something like that. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-16 00:43 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Mon, 2010-11-15 at 15:07 -0500, Ian Campbell wrote:> On Mon, 2010-11-15 at 19:34 +0000, Jeremy Fitzhardinge wrote: > > On 11/15/2010 11:19 AM, Ian Campbell wrote: > > > On Mon, 2010-11-15 at 18:27 +0000, Jeremy Fitzhardinge wrote: > > >> Then we''d have a set of frames whose lifetimes are being determined by > > >> some other subsystem. We can either maintain a list of them and poll > > >> waiting for them to become free, or just release them and let them be > > >> managed by the normal kernel lifetime rules (which requires that the > > >> memory attached to them be completely normal, of course). > > > GNTTABOP_unmap_and_replace is probably the answer to this? This is what > > > netback uses (via gnttab_copy_grant_page) when it wants to copy and skb > > > which has remained in flight for too long. > > > > > > Maybe this doesn''t work since the memcpy and replace are not atomic and > > > we don''t have a lock to use, since we don''t know which subsystem holds > > > the reference. However it should work if we don''t really care about the > > > contents of the replacement too much, which I don''t think we do in this > > > case since the data could just as easily get clobbered by the userspace > > > process which thinks the write has completed fully, in which case we > > > just replace the grant mapping without bothering with the copy. > > > > Well, if atomicity is important (and it is a relatively rare event), I > > dare say Xen could de-schedule all the VCPUs a lot more efficiently than > > we could run stop_machine() from within the kernel... > > Absolutely. > > GNTTABOP_unmap_and_replace is itself atomic according to the comment in > the header, it''s not clear what it is atomic WRT, presumably just other > gnttab or mmu ops. In any case it doesn''t handle synchronising the > content of the old and new pages so if we care about that we would need > a new GNTAPOP_copy_unmap_and_replace or something like that.Ah, nice. First time I hear we have a dedicated call for that. So far I was assuming dom0 would have to unmap/remap and doing so atomically across cores would be left as an exercise to dom0. One would need a copy for protocols where it isn''t clear that the dup gets dropped on the receiving side. So I guess the main hazard would be writes on sth like NFS over UDP. In that case, I think blktap has a pretty safe spot to copy data between I/O completion from userland and replacing the page, without extra atomicity. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-16 09:13 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:> On 11/12/2010 07:55 PM, Daniel Stodden wrote: > >> Surely this can be dealt with by replacing the mapped granted page with > >> a local copy if the refcount is elevated? > > Yeah. We briefly discussed this when the problem started to pop up > > (again). > > > > I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with > > a dummy page mapped in. You wouldn''t need a copy, a R/O zero map easily > > does the job. > > Hm, I''d be a bit concerned that that might cause problems if used > generically.Yeah. It wasn''t a problem because all the network backends are on TCP, where one can be rather sure that the dups are going to be properly dropped. Does this hold everywhere ..? -- As mentioned below, the problem is rather in AIO/DIO than being Xen-specific, so you can see the same behavior on bare metal kernels too. A userspace app seeing an AIO complete and then reusing that buffer elsewhere will occassionally resend garbage over the network. How often that would happen in practice probably depends on the popularity of DIO on NFS for normal apps. Not too many, so yeah, very generically one would maybe want to memcpy().> If the page is being used RO, then replacing with a copy > shouldn''t be a problem, but getting a consistent snapshot of a mutable > page is obviously going to be a problem.I think it''s safe to assume the problem is limited to writes, so the buffers aren''t really mutable.> > On UP that''d be just a matter of disabling interrupts for > > a while. > > Disable for what purpose? You mean to do an exchange of the mapping, or > something else?[I was referring to the prospect of doing a gref unmap/map sequence in dom0. Since we do have a swap operation in Xen, just asking flipping the PTE entry and flushing across all vcpus sounds sufficient to me, indeed.]> > I dropped it after it became clear that XS was moving to SMP, where one > > would end up with a full barrier to orchestrate the TLB flushes > > everywhere. Now, the skb runs prone to crash all run in softirq context, > > I wouldn''t exactly expect a huge performance win from syncing on that > > kind of thing across all nodes, compared to local memcpy. Nor would I > > want storage stuff to touch locks shared with TCP, that''s just not our > > business. Correct me if I''m mistaken. > > I don''t follow what your high-level concern is here. If we update the > pte to unmap the granted page, then its up to Xen to arrange for any TLB > flushes to make sure the page is no longer accessible to the domain. We > don''t need to do anything explicit, and its independent of whether we''re > simply unmapping the page or replacing the mapping with another one (ie, > any TLB flushing necessary is already going on). > > I was also under the impression that this is a relatively rare event > rather than something that''s going to be necessary for every page, so > the overhead should be minimal. > > >>> If zero-copy becomes more attractive again, the plan would be to > >>> rather use grantdev in userspace, such as a filter driver for tapdisk > >>> instead. Until then, there''s presumably a notable difference in L2 > >>> cache footprint. Then again, there''s also a whole number of cycles not > >>> spent in redundant hypercalls now, to deal with the pseudophysical > >>> map. > >> Frankly, I think the idea of putting blkback+tapdisk entirely in > >> usermode is all upside with no (obvious) downsides. It: > >> > >> 1. avoids having to upstream anything > >> 2. avoids having to upstream anything > >> 3. avoids having to upstream anything > >> > >> 4. gets us back zero-copy (if that''s important) > > (No, unfortunately. DIO on a granted frame under blktap would be as > > vulnerable as DIO on a granted frame under a userland blkback, userland > > won''t buy us anthing as far as the zcopy side of things is concerned). > > Why''s that? Are you talking about the stray page reference > vulnerability, or something else? You''re right that it doesn''t really > help with stray references because its still kernel code which ends up > doing the dereference rather than usermode.[Still was talking about the stray reference issues].> >> 5. makes the IO path nice and straightforward > >> 6. seems to address all the other problems you mentioned > > I''m not at all against a userland blkback. Just wouldn''t go as far as > > considering this a silver bullet. > > > > The main thing I''m scared of is ending up hacking cheesy stuff into the > > user ABI to take advantage of things immediately available to FSes on > > the bio layer, but harder (or at least slower) to get made available to > > userland. > > But that hasn''t been a problem for tapdisk so far. Given that it is > using DIO then any completed write has at least been submitted to a > storage device, so then its just a question of how to make sure that any > buffers are fully flushed, which would be fdatasync() I guess.> Besides, our requirements are hardly unique; if there''s some clear need > for a new API, then the course of action is to work with broader Linux > community to work out what it should be and how it should be > implemented. There''s no need for "hacking cheesy stuff" - there''s been > enough of that already.> > DISCARD support is one currently hot example, do you see that in AIO > > somewhere? Ok, probably a good thing for everybody anyway, so maybe > > patching that is useful work. But it''s extra work right now and probably > > no more fun to maintain than blktap is. > > fallocate() is being extended to allow hole-punching in files. I don''t > know what work is being done to do discard on random block devices, but > that''s clearly a generically useful thing to have.Hole punching for userland being in the works is actually great news. Just saw the patchset on ext4-devel flying by, great.> > The second issue I see is the XCP side of things. XenServer got a lot of > > benefit out of blktap2, and particularly because of the tapdevs. It > > promotes a fairly rigorous split between a blkback VBD, controlled by > > the agent, and tapdevs, controlled by XS''s storage manager. > > > > That doesn''t prevent blkback to go into userspace, but it better won''t > > share a process with some libblktap, which in turn would better not be > > controlled under the same xenstore path. > > > Could you elaborate on this? What was the benefit?It''s been mainly a matter of who controls what. Blktap1 was basically a VBD, controlled by the agent. Blktap2 is a VDI represented as a block device. Leaving management of that to XCP''s storage manager, which just hands that device node over to Xapi simplified many things. Before, the agent had to understand a lot about the type of storage, then talk to the right backend accordingly. Worse, in order to have storage management control a couple datapath features, you''d basically have to talk to Xapi, which would talk though xenstore to blktap, which was a bit tedious. :) Merging the VDI side of things with VBDs again just didn''t sound so attrative at first sight. But I''ve thinking a little longer about the issues in the meantime. I think it''s actually not so hard to do that without collateral damage. Let''s say we create an extension to tapdisk which speaks blkback''s datapath in userland. We''d basically put one of those tapdisks on every storage node, independent of the image type, such as a bare LUN or a VHD. We add a couple additional IPC calls to make it directly connect/disconnect to/from (ring-ref,event-channel) pairs. Means it doesn''t even need to talk xenstore, the control plane could all be left to some single daemon, which knows how to instruct the right tapdev (via libblktapctl) by looking at the physical-device node. I guess getting the control stuff out of the kernel is always a good idea. There are some important parts which would go missing. Such as ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to gntmap 352 pages simultaneously isn''t so good, so there still needs to be some bridge arbitrating them. I''d rather keep that in kernel space, okay to cram stuff like that into gntdev? It''d be much more straightforward than IPC. Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. Can''t find it now, what happened? Without, there''s presently still no zero-copy. Once the issues were solved, it''d be kinda nice. Simplifies stuff like memshr for blktap, which depends on getting hold of original grefs. We''d presumably still need the tapdev nodes, for qemu, etc. But those can stay non-xen aware then.> >> The only caveat is the stray unmapping problem, but I think gntdev can > >> be modified to deal with that pretty easily. > > Not easier than anything else in kernel space, but when dealing only > > with the refcounts, that''s as as good a place as anwhere else, yes. > > I think the refcount test is pretty straightforward - if the refcount is > 1, then we''re the sole owner of the page and we don''t need to worry > about any other users. If its > 1, then somebody else has it, and we > need to make sure it no longer refers to a granted page (which is just a > matter of doing a set_pte_atomic() to remap from present to present).[set_pte_atomic over grant ptes doesn''t work, or does it?]> Then we''d have a set of frames whose lifetimes are being determined by > some other subsystem. We can either maintain a list of them and poll > waiting for them to become free, or just release them and let them be > managed by the normal kernel lifetime rules (which requires that the > memory attached to them be completely normal, of course).The latter sounds like a good alternative to polling. So an unmap_and_replace, and giving up ownership thereafter. Next run of the dispatcher thread can can just refill the foreign pfn range via alloc_empty_pages(), to rebalance. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stefano Stabellini
2010-Nov-16 12:17 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Tue, 16 Nov 2010, Daniel Stodden wrote:> Let''s say we create an extension to tapdisk which speaks blkback''s > datapath in userland. We''d basically put one of those tapdisks on every > storage node, independent of the image type, such as a bare LUN or a > VHD. We add a couple additional IPC calls to make it directly > connect/disconnect to/from (ring-ref,event-channel) pairs. > > Means it doesn''t even need to talk xenstore, the control plane could all > be left to some single daemon, which knows how to instruct the right > tapdev (via libblktapctl) by looking at the physical-device node. I > guess getting the control stuff out of the kernel is always a good idea. > > There are some important parts which would go missing. Such as > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to > gntmap 352 pages simultaneously isn''t so good, so there still needs to > be some bridge arbitrating them. I''d rather keep that in kernel space, > okay to cram stuff like that into gntdev? It''d be much more > straightforward than IPC. > > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > Can''t find it now, what happened? Without, there''s presently still no > zero-copy. > > Once the issues were solved, it''d be kinda nice. Simplifies stuff like > memshr for blktap, which depends on getting hold of original grefs. > > We''d presumably still need the tapdev nodes, for qemu, etc. But those > can stay non-xen aware then. >Considering that there is a blkback implementation in qemu already, why don''t use it? I don''t certainly feel the need of yet another blkback implementation. A lot of people are working on qemu nowadays and this would let us exploit some of that work and contribute to it ourselves. We would only need to write a vhd block driver in qemu (even though a "vdi" driver is already present, I assume is not actually compatible?) and everything else is already there. We could reuse their qcow and qcow2 drivers that honestly are better maintained than ours (we receive a bug report per week about qcow/qcow2 not working properly). Finally qemu needs to be able to do I/O anyway because of the IDE emulation, so it has to be in the picture in a way or another. One day not far from now when we make virtio work on Xen, even the fast PV data path might go through qemu, so we might as well optimize it. After talking to the xapi guys to better understand their requirements, I am pretty sure that the new upstream qemu with QMP support would be able to satisfy them without issues. Of all the possible solutions, this is certainly the one that requires less lines of code and would allow us to reuse more resource that otherwise would just remain untapped. I backported the upstream xen_disk implementation to qemu-xen and run a test on the upstream 2.6.37rc1 kernel as dom0: VMs boot fine and performances seem to be interesting. For the moment I am thinking about enabling the qemu blkback implementation as a fallback in case blktap2 is not present in the system (ie: 2.6.37 kernels). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dave Scott
2010-Nov-16 13:00 UTC
RE: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
Hi, Re: XCP's use of blktap2:> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: > > On 11/12/2010 07:55 PM, Daniel Stodden wrote: > > > The second issue I see is the XCP side of things. XenServer got a > lot of > > > benefit out of blktap2, and particularly because of the tapdevs. It > > > promotes a fairly rigorous split between a blkback VBD, controlled > by > > > the agent, and tapdevs, controlled by XS's storage manager. > > > > > > That doesn't prevent blkback to go into userspace, but it better > won't > > > share a process with some libblktap, which in turn would better not > be > > > controlled under the same xenstore path. > > > > > > Could you elaborate on this? What was the benefit? > > It's been mainly a matter of who controls what. Blktap1 was basically a > VBD, controlled by the agent. Blktap2 is a VDI represented as a block > device. Leaving management of that to XCP's storage manager, which just > hands that device node over to Xapi simplified many things. Before, the > agent had to understand a lot about the type of storage, then talk to > the right backend accordingly. Worse, in order to have storage > management control a couple datapath features, you'd basically have to > talk to Xapi, which would talk though xenstore to blktap, which was a > bit tedious. :)As Daniel says, XCP currently separates domain management (setting up, rebooting VMs) from storage management (attaching disks, snapshot, coalesce). In the current design the storage layer handles the storage control-path (instigating snapshots, clones, coalesce, dedup in future) through a storage API ("SMAPI") and provides a uniform interface to qemu, blkback for the data-path (currently in the form of a dom0 block device). In a VM start, xapi will first ask the storage control-path to make a disk available, and then pass this information to blkback/qemu. One of the trickiest things XCP handles is vhd "coalesce": merging a vhd file into its "parent". This comes up because vhds are arranged in a tree structure where the leaves are separate independent VM disks and the nodes represent shared common blocks, the result of (eg) cloning a single VM lots of times. When guest disks are deleted and the vhd leaves are removed, it sometimes becomes possible to save space by merging nodes together. The tricky bit is doing this while I/O is still being performed in parallel against logically separate (but related by parentage/history) disks on different hosts. It's necessary for the thing doing the coalescing to know where all the I/O is going on (eg to be able to find the host and pid where the related tapdisks (or qemus) live) and it's necessary for it to be able to signal to these processes when they need to re-read the vhd tree metadata. In the bad old blktap1 days, the storage control-path didn't know enough about the data-path to reliably signal the active tapdisks: IIRC the tapdisks were spawned by blktapctrl as a side-effect of the domain manager writing to xenstore. In the much better blktap2 days :) the storage control-path sets up (registers?) the data-path (currently via tap-ctl and a dom0 block device) and so it knows who to talk to in order to co-ordinate a coalesce. So I think the critical thing is to be able to have the storage control-path able to do something to "register" a data-path, enabling it to find later and signal any processes using that data-path. There are a bunch of different possibilities the storage control-path could use instead of using tap-ctl to create a block device, including: 1. directly spawn a tapdisk2 userspace process. Some identifier (pid, unix domain socket) could be passed to qemu allowing it to perform I/O. The block backend could be either in the tapdisk2 directly or in qemu? 2. return a (path to vhd file, callback unix domain socket). This could be passed to qemu (or something else) and qemu could use the callback socket to register its intention to use the data-path (and hence that it needs to be signaled if something changes) I'm sure there are lots of possibilities :-) Cheers, Dave _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stefano Stabellini
2010-Nov-16 14:48 UTC
RE: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Tue, 16 Nov 2010, Dave Scott wrote:> Hi, > > Re: XCP''s use of blktap2: > > > On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: > > > On 11/12/2010 07:55 PM, Daniel Stodden wrote: > > > > The second issue I see is the XCP side of things. XenServer got a > > lot of > > > > benefit out of blktap2, and particularly because of the tapdevs. It > > > > promotes a fairly rigorous split between a blkback VBD, controlled > > by > > > > the agent, and tapdevs, controlled by XS''s storage manager. > > > > > > > > That doesn''t prevent blkback to go into userspace, but it better > > won''t > > > > share a process with some libblktap, which in turn would better not > > be > > > > controlled under the same xenstore path. > > > > > > > > > Could you elaborate on this? What was the benefit? > > > > It''s been mainly a matter of who controls what. Blktap1 was basically a > > VBD, controlled by the agent. Blktap2 is a VDI represented as a block > > device. Leaving management of that to XCP''s storage manager, which just > > hands that device node over to Xapi simplified many things. Before, the > > agent had to understand a lot about the type of storage, then talk to > > the right backend accordingly. Worse, in order to have storage > > management control a couple datapath features, you''d basically have to > > talk to Xapi, which would talk though xenstore to blktap, which was a > > bit tedious. :) > > As Daniel says, XCP currently separates domain management (setting up, rebooting VMs) from storage management (attaching disks, snapshot, coalesce). In the current design the storage layer handles the storage control-path (instigating snapshots, clones, coalesce, dedup in future) through a storage API ("SMAPI") and provides a uniform interface to qemu, blkback for the data-path (currently in the form of a dom0 block device). In a VM start, xapi will first ask the storage control-path to make a disk available, and then pass this information to blkback/qemu. > > One of the trickiest things XCP handles is vhd "coalesce": merging a vhd file into its "parent". This comes up because vhds are arranged in a tree structure where the leaves are separate independent VM disks and the nodes represent shared common blocks, the result of (eg) cloning a single VM lots of times. When guest disks are deleted and the vhd leaves are removed, it sometimes becomes possible to save space by merging nodes together. The tricky bit is doing this while I/O is still being performed in parallel against logically separate (but related by parentage/history) disks on different hosts. It''s necessary for the thing doing the coalescing to know where all the I/O is going on (eg to be able to find the host and pid where the related tapdisks (or qemus) live) and it''s necessary for it to be able to signal to these processes when they need to re-read the vhd tree metadata. > > In the bad old blktap1 days, the storage control-path didn''t know enough about the data-path to reliably signal the active tapdisks: IIRC the tapdisks were spawned by blktapctrl as a side-effect of the domain manager writing to xenstore. In the much better blktap2 days :) the storage control-path sets up (registers?) the data-path (currently via tap-ctl and a dom0 block device) and so it knows who to talk to in order to co-ordinate a coalesce. > > So I think the critical thing is to be able to have the storage control-path able to do something to "register" a data-path, enabling it to find later and signal any processes using that data-path. There are a bunch of different possibilities the storage control-path could use instead of using tap-ctl to create a block device, including: >Qemu could be spawned directly (even before the VM) and QMP could be use to communicate with it. The qemu pid and/or the socket to issue QMP commands could be used as identifiers.> > I''m sure there are lots of possibilities :-)Indeed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Nov-16 16:11 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
> I backported the upstream xen_disk implementation to qemu-xen > and run a test on the upstream 2.6.37rc1 kernel as dom0: VMs boot fine > and performances seem to be interesting. For the moment I am thinkingCan you quantify "interesting"? As in sucks-but-I-can-live-with-it or wow-it-is-as-fast-as-kernel-blkback? Or should ask this question later when you had a chance to run proper benchmarks? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stefano Stabellini
2010-Nov-16 16:16 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Tue, 16 Nov 2010, Konrad Rzeszutek Wilk wrote:> > I backported the upstream xen_disk implementation to qemu-xen > > and run a test on the upstream 2.6.37rc1 kernel as dom0: VMs boot fine > > and performances seem to be interesting. For the moment I am thinking > > Can you quantify "interesting"? As in sucks-but-I-can-live-with-it or > wow-it-is-as-fast-as-kernel-blkback?wow-it-is-as-fast-as-kernel-blkback> Or should ask this question later > when you had a chance to run proper benchmarks?that would be better :) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-16 17:56 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/16/2010 01:13 AM, Daniel Stodden wrote:> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: >> On 11/12/2010 07:55 PM, Daniel Stodden wrote: >>>> Surely this can be dealt with by replacing the mapped granted page with >>>> a local copy if the refcount is elevated? >>> Yeah. We briefly discussed this when the problem started to pop up >>> (again). >>> >>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with >>> a dummy page mapped in. You wouldn''t need a copy, a R/O zero map easily >>> does the job. >> Hm, I''d be a bit concerned that that might cause problems if used >> generically. > Yeah. It wasn''t a problem because all the network backends are on TCP, > where one can be rather sure that the dups are going to be properly > dropped. > > Does this hold everywhere ..? -- As mentioned below, the problem is > rather in AIO/DIO than being Xen-specific, so you can see the same > behavior on bare metal kernels too. A userspace app seeing an AIO > complete and then reusing that buffer elsewhere will occassionally > resend garbage over the network.Yeah, that sounds like a generic security problem. I presume the protocol will just discard the excess retransmit data, but it might mean a usermode program ends up transmitting secrets it never intended to...> There are some important parts which would go missing. Such as > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to > gntmap 352 pages simultaneously isn''t so good, so there still needs to > be some bridge arbitrating them. I''d rather keep that in kernel space, > okay to cram stuff like that into gntdev? It''d be much more > straightforward than IPC.What''s the problem? If you do nothing then it will appear to the kernel as a bunch of processes doing memory allocations, and they''ll get blocked/rate-limited accordingly if memory is getting short. There''s plenty of existing mechanisms to control that sort of thing (cgroups, etc) without adding anything new to the kernel. Or are you talking about something other than simple memory pressure? And there''s plenty of existing IPC mechanisms if you want them to explicitly coordinate with each other, but I''d tend to thing that''s premature unless you have something specific in mind.> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > Can''t find it now, what happened? Without, there''s presently still no > zero-copy.gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively new-ish) mmu notifier infrastructure which is intended to allow a device to sync an external MMU with usermode mappings. We''re not using it in precisely that way, but it allows us to wrangle grant mappings before the generic code tries to do normal pte ops on them.> Once the issues were solved, it''d be kinda nice. Simplifies stuff like > memshr for blktap, which depends on getting hold of original grefs. > > We''d presumably still need the tapdev nodes, for qemu, etc. But those > can stay non-xen aware then. > >>>> The only caveat is the stray unmapping problem, but I think gntdev can >>>> be modified to deal with that pretty easily. >>> Not easier than anything else in kernel space, but when dealing only >>> with the refcounts, that''s as as good a place as anwhere else, yes. >> I think the refcount test is pretty straightforward - if the refcount is >> 1, then we''re the sole owner of the page and we don''t need to worry >> about any other users. If its > 1, then somebody else has it, and we >> need to make sure it no longer refers to a granted page (which is just a >> matter of doing a set_pte_atomic() to remap from present to present). > [set_pte_atomic over grant ptes doesn''t work, or does it?]No, I forgot about grant ptes magic properties. But there is the hypercall.>> Then we''d have a set of frames whose lifetimes are being determined by >> some other subsystem. We can either maintain a list of them and poll >> waiting for them to become free, or just release them and let them be >> managed by the normal kernel lifetime rules (which requires that the >> memory attached to them be completely normal, of course). > The latter sounds like a good alternative to polling. So an > unmap_and_replace, and giving up ownership thereafter. Next run of the > dispatcher thread can can just refill the foreign pfn range via > alloc_empty_pages(), to rebalance.Do we actually need a "foreign page range"? Won''t any pfn do? If we start with a specific range of foreign pfns and then start freeing those pfns back to the kernel, we won''t have one for long... J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-16 21:28 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote:> On 11/16/2010 01:13 AM, Daniel Stodden wrote: > > On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: > >> On 11/12/2010 07:55 PM, Daniel Stodden wrote: > >>>> Surely this can be dealt with by replacing the mapped granted page with > >>>> a local copy if the refcount is elevated? > >>> Yeah. We briefly discussed this when the problem started to pop up > >>> (again). > >>> > >>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with > >>> a dummy page mapped in. You wouldn''t need a copy, a R/O zero map easily > >>> does the job. > >> Hm, I''d be a bit concerned that that might cause problems if used > >> generically. > > Yeah. It wasn''t a problem because all the network backends are on TCP, > > where one can be rather sure that the dups are going to be properly > > dropped. > > > > Does this hold everywhere ..? -- As mentioned below, the problem is > > rather in AIO/DIO than being Xen-specific, so you can see the same > > behavior on bare metal kernels too. A userspace app seeing an AIO > > complete and then reusing that buffer elsewhere will occassionally > > resend garbage over the network. > > Yeah, that sounds like a generic security problem. I presume the > protocol will just discard the excess retransmit data, but it might mean > a usermode program ends up transmitting secrets it never intended to... > > > There are some important parts which would go missing. Such as > > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to > > gntmap 352 pages simultaneously isn''t so good, so there still needs to > > be some bridge arbitrating them. I''d rather keep that in kernel space, > > okay to cram stuff like that into gntdev? It''d be much more > > straightforward than IPC. > > What''s the problem? If you do nothing then it will appear to the kernel > as a bunch of processes doing memory allocations, and they''ll get > blocked/rate-limited accordingly if memory is getting short.The problem is that just letting the page allocator work through allocations isn''t going to scale anywhere. The worst case memory requested under load is <number-of-disks> * (32 * 11 pages). As a (conservative) rule of thumb, N will be 200 or rather better. The number of I/O actually in-flight at any point, in contrast, is derived from the queue/sg sizes of the physical device. For a simple disk, that''s about a ring or two.> There''s > plenty of existing mechanisms to control that sort of thing (cgroups, > etc) without adding anything new to the kernel. Or are you talking > about something other than simple memory pressure? > > And there''s plenty of existing IPC mechanisms if you want them to > explicitly coordinate with each other, but I''d tend to thing that''s > premature unless you have something specific in mind. > > > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > > Can''t find it now, what happened? Without, there''s presently still no > > zero-copy. > > gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively > new-ish) mmu notifier infrastructure which is intended to allow a device > to sync an external MMU with usermode mappings. We''re not using it in > precisely that way, but it allows us to wrangle grant mappings before > the generic code tries to do normal pte ops on them.The mmu notifiers were for safe teardown only. They are not sufficient for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll need to back those VMAs with page structs. Or bounce again (gulp, just mentioning it). As with the blktap2 patches, note there is no difference in the dom0 memory bill, it takes page frames. This is pretty much exactly the pooling stuff in current drivers/blktap. The interface could look as follows ([] designates users). * [toolstack] Calling some ctls to create/destroy ctls pools of frames. (Blktap currently does this in sysfs.) * [toolstack] Optionally resize them, according to the physical queue depth [estimate] of the underlying storage. * [tapdisk] A backend instance, when starting up, opens a gntdev, then uses a ctl to bind its gntdev handle to a frame pool. * [tapdisk] The .mmap call now will allocate frames to back the VMA. This operation can fail/block under congestion. Neither is desirable, so we need a .poll. * [tapdisk] To integrate grant mappings with a single-threaded event loop, use .poll. The handle fires as soon as a request can be mapped. Under congestion, the .poll code will queue waiting disks and wake them round-robin, once VMAs are released. (A [tapdisk] doesn''t mean to dismiss a potential [qemu].) Still confident we want that? (Seriously asking). A lot of the code to do so has been written for blktap, it wouldn''t be hard to bend into a gntdev extension.> > Once the issues were solved, it''d be kinda nice. Simplifies stuff like > > memshr for blktap, which depends on getting hold of original grefs. > > > > We''d presumably still need the tapdev nodes, for qemu, etc. But those > > can stay non-xen aware then. > > > >>>> The only caveat is the stray unmapping problem, but I think gntdev can > >>>> be modified to deal with that pretty easily. > >>> Not easier than anything else in kernel space, but when dealing only > >>> with the refcounts, that''s as as good a place as anwhere else, yes. > >> I think the refcount test is pretty straightforward - if the refcount is > >> 1, then we''re the sole owner of the page and we don''t need to worry > >> about any other users. If its > 1, then somebody else has it, and we > >> need to make sure it no longer refers to a granted page (which is just a > >> matter of doing a set_pte_atomic() to remap from present to present). > > [set_pte_atomic over grant ptes doesn''t work, or does it?] > > No, I forgot about grant ptes magic properties. But there is the hypercall.Yup.> >> Then we''d have a set of frames whose lifetimes are being determined by > >> some other subsystem. We can either maintain a list of them and poll > >> waiting for them to become free, or just release them and let them be > >> managed by the normal kernel lifetime rules (which requires that the > >> memory attached to them be completely normal, of course). > > The latter sounds like a good alternative to polling. So an > > unmap_and_replace, and giving up ownership thereafter. Next run of the > > dispatcher thread can can just refill the foreign pfn range via > > alloc_empty_pages(), to rebalance. > > Do we actually need a "foreign page range"? Won''t any pfn do? If we > start with a specific range of foreign pfns and then start freeing those > pfns back to the kernel, we won''t have one for long...I guess we''ve been meaning the same thing here, unless I''m misunderstanding you. Any pfn does, and the balloon pagevec allocations default to order 0 entries indeed. Sorry, you''re right, that''s not a ''range''. With a pending re-xmit, the backend can find a couple (or all) of the request frames have count>1. It can flip and abandon those as normal memory. But it will need those lost memory slots back, straight away or next time it''s running out of frames. As order-0 allocations. Foreign memory is deliberately short. Blkback still defaults to 2 rings worth of address space, iirc, globally. That''s what that mempool sysfs stuff in the later blktap2 patches aimed at -- making the size configurable where queue length matters, and isolate throughput between physical backends, where the toolstack wants to care. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-17 02:40 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Tue, 2010-11-16 at 07:17 -0500, Stefano Stabellini wrote:> On Tue, 16 Nov 2010, Daniel Stodden wrote: > > Let''s say we create an extension to tapdisk which speaks blkback''s > > datapath in userland. We''d basically put one of those tapdisks on every > > storage node, independent of the image type, such as a bare LUN or a > > VHD. We add a couple additional IPC calls to make it directly > > connect/disconnect to/from (ring-ref,event-channel) pairs. > > > > Means it doesn''t even need to talk xenstore, the control plane could all > > be left to some single daemon, which knows how to instruct the right > > tapdev (via libblktapctl) by looking at the physical-device node. I > > guess getting the control stuff out of the kernel is always a good idea. > > > > There are some important parts which would go missing. Such as > > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to > > gntmap 352 pages simultaneously isn''t so good, so there still needs to > > be some bridge arbitrating them. I''d rather keep that in kernel space, > > okay to cram stuff like that into gntdev? It''d be much more > > straightforward than IPC. > > > > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > > Can''t find it now, what happened? Without, there''s presently still no > > zero-copy. > > > > Once the issues were solved, it''d be kinda nice. Simplifies stuff like > > memshr for blktap, which depends on getting hold of original grefs. > > > > We''d presumably still need the tapdev nodes, for qemu, etc. But those > > can stay non-xen aware then. > > > > Considering that there is a blkback implementation in qemu already, why > don''t use it? I don''t certainly feel the need of yet another blkback > implementation. > A lot of people are working on qemu nowadays and this would let us > exploit some of that work and contribute to it ourselves. > We would only need to write a vhd block driver in qemu (even though a > "vdi" driver is already present, I assume is not actually compatible?) > and everything else is already there. > We could reuse their qcow and qcow2 drivers that honestly are better > maintained than ours (we receive a bug report per week about qcow/qcow2 > not working properly). > Finally qemu needs to be able to do I/O anyway because of the IDE > emulation, so it has to be in the picture in a way or another. One day > not far from now when we make virtio work on Xen, even the fast PV > data path might go through qemu, so we might as well optimize it. > After talking to the xapi guys to better understand their requirements, > I am pretty sure that the new upstream qemu with QMP support would be > able to satisfy them without issues. > Of all the possible solutions, this is certainly the one that requires > less lines of code and would allow us to reuse more resource that > otherwise would just remain untapped. > > I backported the upstream xen_disk implementation to qemu-xen > and run a test on the upstream 2.6.37rc1 kernel as dom0: VMs boot fine > and performances seem to be interesting. For the moment I am thinking > about enabling the qemu blkback implementation as a fallback in case > blktap2 is not present in the system (ie: 2.6.37 kernels).I''m not against reducing code and effort. But in order to switch to a different base we would need a drop-in match for VHD and at least a good match for all the control machinery on which xen-sm presently depends. There''s also a lot of investment in filter drivers etc. Then there is SM control, stuff like pause/unpause to get guests off the storage nodes for snapshot/coalesce, more recently calls for statistics and monitoring, tweaking some physical I/O details, etc. Used to be a bitch, nowadays it''s somewhat simpler, but that''s all stuff we completely depend on. Moving blkback out of kernel space, into tapdisk, is predictable in size and complexity. Replacing tapdisks altogether would be quite a different story. The remainder below isn''t fully qualified, just random bits coming to my mind, assuming you''re not talking about sharing code/libs and frameworks, but actual processes. 1st, what''s the rationale with fully PV''d guests on xen? (That argument might not count if just taking qemu as the container process and stripping emulation for those.) Related, there''s the question of memory footprint. Kernel blkback is extremely lightweight. Moving the datapath into userland can create headaches, especially on 32bit dom0s with lot of guests and disks on backends which used to be bare LUNs under blkback. That''s a problem tapdisk has to face too, just wondering about the size of the issue in qemu. Related, Xapi depends a lot on dom0 plugs, where the datapath can be somewhat hairy when it comes to blocking I/O and resource allocation. Then there is sharing. Storage activation normally doesn''t operate in a specific VM context. It presently doesn''t even relate to a particular VBD, much less a VM. For qemu alone, putting storage virtualization into the same address space is an obvious choice. For Xen, enforcing that sounds like a step backward.>From the shared-framework perspective, and the amount of code involved:The ring path alone is too small to consider, and the more difficult parts on top of that like state machines for write ordering and syncing etc are hard to share because the depend on the queue implementation and image driver interface. Control might be a different story. As far as frontend/backend IPC via xenstore goes, right now I still feel like those backends could be managed by a single daemon, similar to what blktapctrl did (let''s just make it stateless/restartable this time). I guess qemu processes run their xenstore trees already fine, but internally? Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stefano Stabellini
2010-Nov-17 12:35 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Wed, 17 Nov 2010, Daniel Stodden wrote:> I''m not against reducing code and effort. But in order to switch to a > different base we would need a drop-in match for VHD and at least a good > match for all the control machinery on which xen-sm presently depends. > There''s also a lot of investment in filter drivers etc. >I am hoping we don''t actually have to rewrite all that code but that we would be able to refactor it to fit the qemu driver APIs.> Then there is SM control, stuff like pause/unpause to get guests off the > storage nodes for snapshot/coalesce, more recently calls for statistics > and monitoring, tweaking some physical I/O details, etc. Used to be a > bitch, nowadays it''s somewhat simpler, but that''s all stuff we > completely depend on. >Upstream qemu has an RPC interface called QMP, that can be used to issue commands and retrieve informations. For example snapshots are already supported by this interface. We need to support QMP one way or another, even only for the pure qemu emulation use case, so we might as well exploit it.> Moving blkback out of kernel space, into tapdisk, is predictable in size > and complexity. Replacing tapdisks altogether would be quite a different > story. >This is not a one day project, we have time for this. I was thinking about starting to make use of blkback qemu (with the current qemu-xen) as a fallback when blkback2 is not present, mainly to allow developers to work with upstream 2.6.37. Meanwhile in the next few months upstream qemu should apply our patches and therefore we could start using upstream qemu for development with xen-unstable. Upstream qemu will offer much better aio support and QMP. At that point we could start adding a VHD driver to upstream qemu and slowly everything else we need. By the time 2.6.38/39 is out we could have a proper backend with VHD support. What do you think?> The remainder below isn''t fully qualified, just random bits coming to my > mind, assuming you''re not talking about sharing code/libs and > frameworks, but actual processes. > > 1st, what''s the rationale with fully PV''d guests on xen? (That argument > might not count if just taking qemu as the container process and > stripping emulation for those.) >PV guest needs qemu already for the console and the framebuffer backends, so it just fits in the current picture without modifications.> Related, there''s the question of memory footprint. Kernel blkback is > extremely lightweight. Moving the datapath into userland can create > headaches, especially on 32bit dom0s with lot of guests and disks on > backends which used to be bare LUNs under blkback. That''s a problem > tapdisk has to face too, just wondering about the size of the issue in > qemu.The PV qemu (the qemu run for PV guests) is very different from the HVM qemu: it does very little, only runs the backends. I expect its memory footprint to be really small.> > Related, Xapi depends a lot on dom0 plugs, where the datapath can be > somewhat hairy when it comes to blocking I/O and resource allocation. >I am not sure what do you mean here.> Then there is sharing. Storage activation normally doesn''t operate in a > specific VM context. It presently doesn''t even relate to a particular > VBD, much less a VM. For qemu alone, putting storage virtualization into > the same address space is an obvious choice. For Xen, enforcing that > sounds like a step backward.Having the backend in qemu and running it in a VM context are two different things. Qemu is very flexible in this regard, with one line change (or maybe just different command line options) you can have qemu doing hardware emulation, PV backends, both, or only some PV backends. You could have: - 1 qemu doing hardware emulation and PV backend; - 1 qemu doing hardware emulation and another one doing PV backends; - 1 qemu doing hardware emulation and 1 qemu doing some PV backends and 1 qemu doing the other PV backends; and so on. The only thing that cannot be easily split at the moment is the hardware emulation, but the backends are completely modular.> > >From the shared-framework perspective, and the amount of code involved: > The ring path alone is too small to consider, and the more difficult > parts on top of that like state machines for write ordering and syncing > etc are hard to share because the depend on the queue implementation and > image driver interface. >Yeah, if you are thinking about refactoring blktap2 in libraries and use them in both the stand alone tapdisk case and the qemu case, it is probably not worth it. In any case in qemu they do not depend on the image driver interface because it is generic.> Control might be a different story. As far as frontend/backend IPC via > xenstore goes, right now I still feel like those backends could be > managed by a single daemon, similar to what blktapctrl did (let''s just > make it stateless/restartable this time). I guess qemu processes run > their xenstore trees already fine, but internally?Yes, they do. There is a generic xen_backend interface that adds support for Xen frontend/backend pairs. Using xen_backend each qemu instance listens to its own xenstore backend path. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jonathan Ludlam
2010-Nov-17 15:34 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 17 Nov 2010, at 12:35, Stefano Stabellini wrote: On Wed, 17 Nov 2010, Daniel Stodden wrote: Related, Xapi depends a lot on dom0 plugs, where the datapath can be somewhat hairy when it comes to blocking I/O and resource allocation. I am not sure what do you mean here. We would very much like to move away from dom0 block attaches. Currently they''re required for (probably non exhaustive list): - pygrub - VM import/export - vdi import/export - suspend/resume - Demo Linux VM installation (XS only) Dave is already working on moving the export code to a domU. The VDI stuff is similar, and suspend resume could be done via plugging into e.g. qemu stub-domains, and pygrub could be eventually replaced by PVgrub. So it''s not inconceivable that XCP might no longer require dom0 block attaches in a similar timescale to that being discussed. Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andres Lagar-Cavilla
2010-Nov-17 16:36 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
I''ll throw an idea there and you educate me why it''s lame. Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn''s with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn''s is swapped with the domU mfn. Pfn''s and page struct''s on both ends remain untouched. 3. For blkfront reads, call swap when stuffing the response back into the ring 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don''t have a weirdo granted PTE, or work with a frame from other domain. It''s your page all along, dom0 5. One assumption for domU is that pages allocated as blkfront buffers won''t be touched by anybody, so a) it''s safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap''s. I.e. all these hypercalls won''t happen as often and at the granularity as skbuff''s demanded for GNTTABOP_transfer 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it''s not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don''t know about how practical that is though. Problems at first glance: 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren''t all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? 3. Managing the pool of backend reserved pages may be a problem? So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I''d like to hear other problems people may find with this approach Andres> Message: 3 > Date: Tue, 16 Nov 2010 13:28:51 -0800 > From: Daniel Stodden <daniel.stodden@citrix.com> > Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy. > To: Jeremy Fitzhardinge <jeremy@goop.org> > Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com> > Message-ID: <1289942932.11102.802.camel@agari.van.xensource.com> > Content-Type: text/plain; charset="UTF-8" > > On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote: >> On 11/16/2010 01:13 AM, Daniel Stodden wrote: >>> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: >>>> On 11/12/2010 07:55 PM, Daniel Stodden wrote: >>>>>> Surely this can be dealt with by replacing the mapped granted page with >>>>>> a local copy if the refcount is elevated? >>>>> Yeah. We briefly discussed this when the problem started to pop up >>>>> (again). >>>>> >>>>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with >>>>> a dummy page mapped in. You wouldn''t need a copy, a R/O zero map easily >>>>> does the job. >>>> Hm, I''d be a bit concerned that that might cause problems if used >>>> generically. >>> Yeah. It wasn''t a problem because all the network backends are on TCP, >>> where one can be rather sure that the dups are going to be properly >>> dropped. >>> >>> Does this hold everywhere ..? -- As mentioned below, the problem is >>> rather in AIO/DIO than being Xen-specific, so you can see the same >>> behavior on bare metal kernels too. A userspace app seeing an AIO >>> complete and then reusing that buffer elsewhere will occassionally >>> resend garbage over the network. >> >> Yeah, that sounds like a generic security problem. I presume the >> protocol will just discard the excess retransmit data, but it might mean >> a usermode program ends up transmitting secrets it never intended to... >> >>> There are some important parts which would go missing. Such as >>> ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to >>> gntmap 352 pages simultaneously isn''t so good, so there still needs to >>> be some bridge arbitrating them. I''d rather keep that in kernel space, >>> okay to cram stuff like that into gntdev? It''d be much more >>> straightforward than IPC. >> >> What''s the problem? If you do nothing then it will appear to the kernel >> as a bunch of processes doing memory allocations, and they''ll get >> blocked/rate-limited accordingly if memory is getting short. > > The problem is that just letting the page allocator work through > allocations isn''t going to scale anywhere. > > The worst case memory requested under load is <number-of-disks> * (32 * > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather > better. > > The number of I/O actually in-flight at any point, in contrast, is > derived from the queue/sg sizes of the physical device. For a simple > disk, that''s about a ring or two. > >> There''s >> plenty of existing mechanisms to control that sort of thing (cgroups, >> etc) without adding anything new to the kernel. Or are you talking >> about something other than simple memory pressure? >> >> And there''s plenty of existing IPC mechanisms if you want them to >> explicitly coordinate with each other, but I''d tend to thing that''s >> premature unless you have something specific in mind. >> >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. >>> Can''t find it now, what happened? Without, there''s presently still no >>> zero-copy. >> >> gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively >> new-ish) mmu notifier infrastructure which is intended to allow a device >> to sync an external MMU with usermode mappings. We''re not using it in >> precisely that way, but it allows us to wrangle grant mappings before >> the generic code tries to do normal pte ops on them. > > The mmu notifiers were for safe teardown only. They are not sufficient > for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll > need to back those VMAs with page structs. Or bounce again (gulp, just > mentioning it). As with the blktap2 patches, note there is no difference > in the dom0 memory bill, it takes page frames. > > This is pretty much exactly the pooling stuff in current drivers/blktap. > The interface could look as follows ([] designates users). > > * [toolstack] > Calling some ctls to create/destroy ctls pools of frames. > (Blktap currently does this in sysfs.) > > * [toolstack] > Optionally resize them, according to the physical queue > depth [estimate] of the underlying storage. > > * [tapdisk] > A backend instance, when starting up, opens a gntdev, then > uses a ctl to bind its gntdev handle to a frame pool. > > * [tapdisk] > The .mmap call now will allocate frames to back the VMA. > This operation can fail/block under congestion. Neither > is desirable, so we need a .poll. > > * [tapdisk] > To integrate grant mappings with a single-threaded event loop, > use .poll. The handle fires as soon as a request can be mapped. > > Under congestion, the .poll code will queue waiting disks and wake > them round-robin, once VMAs are released. > > (A [tapdisk] doesn''t mean to dismiss a potential [qemu].) > > Still confident we want that? (Seriously asking). A lot of the code to > do so has been written for blktap, it wouldn''t be hard to bend into a > gntdev extension. > >>> Once the issues were solved, it''d be kinda nice. Simplifies stuff like >>> memshr for blktap, which depends on getting hold of original grefs. >>> >>> We''d presumably still need the tapdev nodes, for qemu, etc. But those >>> can stay non-xen aware then. >>> >>>>>> The only caveat is the stray unmapping problem, but I think gntdev can >>>>>> be modified to deal with that pretty easily. >>>>> Not easier than anything else in kernel space, but when dealing only >>>>> with the refcounts, that''s as as good a place as anwhere else, yes. >>>> I think the refcount test is pretty straightforward - if the refcount is >>>> 1, then we''re the sole owner of the page and we don''t need to worry >>>> about any other users. If its > 1, then somebody else has it, and we >>>> need to make sure it no longer refers to a granted page (which is just a >>>> matter of doing a set_pte_atomic() to remap from present to present). >>> [set_pte_atomic over grant ptes doesn''t work, or does it?] >> >> No, I forgot about grant ptes magic properties. But there is the hypercall. > > Yup. > >>>> Then we''d have a set of frames whose lifetimes are being determined by >>>> some other subsystem. We can either maintain a list of them and poll >>>> waiting for them to become free, or just release them and let them be >>>> managed by the normal kernel lifetime rules (which requires that the >>>> memory attached to them be completely normal, of course). >>> The latter sounds like a good alternative to polling. So an >>> unmap_and_replace, and giving up ownership thereafter. Next run of the >>> dispatcher thread can can just refill the foreign pfn range via >>> alloc_empty_pages(), to rebalance. >> >> Do we actually need a "foreign page range"? Won''t any pfn do? If we >> start with a specific range of foreign pfns and then start freeing those >> pfns back to the kernel, we won''t have one for long... > > I guess we''ve been meaning the same thing here, unless I''m > misunderstanding you. Any pfn does, and the balloon pagevec allocations > default to order 0 entries indeed. Sorry, you''re right, that''s not a > ''range''. With a pending re-xmit, the backend can find a couple (or all) > of the request frames have count>1. It can flip and abandon those as > normal memory. But it will need those lost memory slots back, straight > away or next time it''s running out of frames. As order-0 allocations. > > Foreign memory is deliberately short. Blkback still defaults to 2 rings > worth of address space, iirc, globally. That''s what that mempool sysfs > stuff in the later blktap2 patches aimed at -- making the size > configurable where queue length matters, and isolate throughput between > physical backends, where the toolstack wants to care. > > Daniel > > > > > ------------------------------ > > Message: 4 > Date: Tue, 16 Nov 2010 13:42:54 -0800 (PST) > From: Boris Derzhavets <bderzhavets@yahoo.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to > handle kernel paging request > To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: Jeremy Fitzhardinge <jeremy@goop.org>, > xen-devel@lists.xensource.com, Bruce Edge <bruce.edge@gmail.com> > Message-ID: <923132.8834.qm@web56101.mail.re3.yahoo.com> > Content-Type: text/plain; charset="us-ascii" > >> So what I think you are saying is that you keep on getting the bug in DomU? >> Is the stack-trace the same as in rc1? > > Yes. > When i want to get 1-2 hr of stable work :- > > # service network restart > # service nfs restart > > at Dom0. > > I also believe that presence of xen-pcifront.fix.patch is making things much more stable > on F14. > > Boris. > > --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle kernel paging request > To: "Boris Derzhavets" <bderzhavets@yahoo.com> > Cc: "Jeremy Fitzhardinge" <jeremy@goop.org>, xen-devel@lists.xensource.com, "Bruce Edge" <bruce.edge@gmail.com> > Date: Tuesday, November 16, 2010, 4:15 PM > > On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote: >>> Huh. I .. what? I am confused. I thought we established that the issue >>> was not related to Xen PCI front? You also seem to uncomment the >>> upstream.core.patches and the xen.pvhvm.patch - why? >> >> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch >> it gives failed HUNKs > > Uhh.. I am even more confused. >> >>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes >> >> I am pretty sure rc2 doesn''t contain everything from xen.next-2.6.37.patch, >> gntdev''s stuff for sure. I''ve built 2.6.37-rc2 kernel rpms and loaded >> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1. >> Device /dev/xen/gntdev has not been created. I understand that it''s >> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i cannot >> get 3.2 GB copied over to DomU from NFS share at Dom0. > > So what I think you are saying is that you keep on getting the bug in DomU? > Is the stack-trace the same as in rc1? > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/015048ae/attachment.html > > ------------------------------ > > Message: 5 > Date: Tue, 16 Nov 2010 13:49:14 -0800 (PST) > From: Boris Derzhavets <bderzhavets@yahoo.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to > handle kernel paging request > To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: Jeremy Fitzhardinge <jeremy@goop.org>, > xen-devel@lists.xensource.com, Bruce Edge <bruce.edge@gmail.com> > Message-ID: <228566.47308.qm@web56106.mail.re3.yahoo.com> > Content-Type: text/plain; charset="iso-8859-1" > > Yes, here we are > > [ 186.975228] ------------[ cut here ]------------ > [ 186.975245] kernel BUG at mm/mmap.c:2399! > [ 186.975254] invalid opcode: 0000 [#1] SMP > [ 186.975269] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map > [ 186.975284] CPU 0 > [ 186.975290] Modules linked in: nfs fscache deflate zlib_deflate ctr camellia cast5 rmd160 crypto_null ccm serpent blowfish twofish_generic twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 uinput xen_netfront microcode xen_blkfront [last unloaded: scsi_wait_scan] > [ 186.975507] > [ 186.975515] Pid: 1562, comm: ls Not tainted 2.6.37-0.1.rc1.git8.xendom0.fc14.x86_64 #1 / > [ 186.975529] RIP: e030:[<ffffffff8110ada1>] [<ffffffff8110ada1>] exit_mmap+0x10c/0x119 > [ 186.975550] RSP: e02b:ffff8800781bde18 EFLAGS: 00010202 > [ 186.975560] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > [ 186.975573] RDX: 00000000914a9149 RSI: 0000000000000001 RDI: ffffea00000c0280 > [ 186.975585] RBP: ffff8800781bde48 R08: ffffea00000c0280 R09: 0000000000000001 > [ 186.975598] R10: ffffffff8100750f R11: ffffea0000967778 R12: ffff880076c68b00 > [ 186.975610] R13: ffff88007f83f1e0 R14: ffff880076c68b68 R15: 0000000000000001 > [ 186.975625] FS: 00007f8e471d97c0(0000) GS:ffff88007f831000(0000) knlGS:0000000000000000 > [ 186.975639] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 186.975650] CR2: 00007f8e464a9940 CR3: 0000000001a03000 CR4: 0000000000002660 > [ 186.975663] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 186.976012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 186.976012] Process ls (pid: 1562, threadinfo ffff8800781bc000, task ffff8800788223e0) > [ 186.976012] Stack: > [ 186.976012] 000000000000006b ffff88007f83f1e0 ffff8800781bde38 ffff880076c68b00 > [ 186.976012] ffff880076c68c40 ffff8800788229d0 ffff8800781bde68 ffffffff810505fc > [ 186.976012] ffff8800788223e0 ffff880076c68b00 ffff8800781bdeb8 ffffffff81056747 > [ 186.976012] Call Trace: > [ 186.976012] [<ffffffff810505fc>] mmput+0x65/0xd8 > [ 186.976012] [<ffffffff81056747>] exit_mm+0x13e/0x14b > [ 186.976012] [<ffffffff81056976>] do_exit+0x222/0x7c6 > [ 186.976012] [<ffffffff8100750f>] ? xen_restore_fl_direct_end+0x0/0x1 > [ 186.976012] [<ffffffff8107ea7c>] ? arch_local_irq_restore+0xb/0xd > [ 186.976012] [<ffffffff814b3949>] ? lockdep_sys_exit_thunk+0x35/0x67 > [ 186.976012] [<ffffffff810571b0>] do_group_exit+0x88/0xb6 > [ 186.976012] [<ffffffff810571f5>] sys_exit_group+0x17/0x1b > [ 186.976012] [<ffffffff8100acf2>] system_call_fastpath+0x16/0x1b > [ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48 > [ 186.976012] RIP [<ffffffff8110ada1>] exit_mmap+0x10c/0x119 > [ 186.976012] RSP <ffff8800781bde18> > [ 186.976012] ---[ end trace c0f4eff4054a67e4 ]--- > [ 186.976012] Fixing recursive fault but reboot is needed! > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975228] ------------[ cut here ]------------ > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975254] invalid opcode: 0000 [#1] SMP > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975269] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Stack: > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Call Trace: > > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48 > > --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle kernel paging request > To: "Boris Derzhavets" <bderzhavets@yahoo.com> > Cc: "Jeremy Fitzhardinge" <jeremy@goop.org>, xen-devel@lists.xensource.com, "Bruce Edge" <bruce.edge@gmail.com> > Date: Tuesday, November 16, 2010, 4:15 PM > > On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote: >>> Huh. I .. what? I am confused. I thought we established that the issue >>> was not related to Xen PCI front? You also seem to uncomment the >>> upstream.core.patches and the xen.pvhvm.patch - why? >> >> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch >> it gives failed HUNKs > > Uhh.. I am even more confused. >> >>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes >> >> I am pretty sure rc2 doesn''t contain everything from xen.next-2.6.37.patch, >> gntdev''s stuff for sure. I''ve built 2.6.37-rc2 kernel rpms and loaded >> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1. >> Device /dev/xen/gntdev has not been created. I understand that it''s >> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i cannot >> get 3.2 GB copied over to DomU from NFS share at Dom0. > > So what I think you are saying is that you keep on getting the bug in DomU? > Is the stack-trace the same as in rc1? > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/84bccfd3/attachment.html > > ------------------------------ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > > > End of Xen-devel Digest, Vol 69, Issue 218 > ******************************************_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-17 17:04 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Tue, 2010-11-16 at 21:28 +0000, Daniel Stodden wrote:> > > > Also, I was absolutely certain I once saw VM_FOREIGN support in > gntdev.. > > > Can''t find it now, what happened? Without, there''s presently still > no > > > zero-copy. > > > > gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively > > new-ish) mmu notifier infrastructure which is intended to allow a > device > > to sync an external MMU with usermode mappings. We''re not using it > in > > precisely that way, but it allows us to wrangle grant mappings > before > > the generic code tries to do normal pte ops on them. > > The mmu notifiers were for safe teardown only. They are not sufficient > for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll > need to back those VMAs with page structs. Or bounce again (gulp, > just mentioning it). As with the blktap2 patches, note there is no > difference in the dom0 memory bill, it takes page frames.I though the VM_FOREIGN stuff which blktap[12] hooks into gup() was there in order to avoid some deadlock arising from having a single struct page taking part in nested block I/O. i.e. one I/O created by blkback and the second from the tapdisk process deadlock against each other. I though that in order to work around this blktap creates a second range of struct pages (the struct vm_foreign_map array in vma->vm_private_data) which aliases the pages under I/O and then it hooks gup() to do the switcheroo when necessary. If the blkback interface is running in userspace (whether in a tapdisk or qemu process) then there is no nesting of I/O and the only I/O is from process context and therefore this particular issue is no longer a problem because we can use a properly struct page backed page without needing a magic VM_FOREIGN shadow of it? Have I misunderstood something about the reason for VM_FOREIGN? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-17 17:52 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote:> I''ll throw an idea there and you educate me why it''s lame. > > Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn''s with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. > > Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. > > The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: > > 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. > > 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn''s is swapped with the domU mfn. Pfn''s and page struct''s on both ends remain untouched.Would GNTTABOP_swap also require the domU to have already unmapped the page from its own pagetables? Presumably it would fail if it didn''t, otherwise you''d end up with a domU mapping the same mfn as a dom0-private page.> 3. For blkfront reads, call swap when stuffing the response back into the ring > > 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don''t have a weirdo granted PTE, or work with a frame from other domain. It''s your page all along, dom0 > > 5. One assumption for domU is that pages allocated as blkfront buffers won''t be touched by anybody, so a) it''s safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) > > 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap''s. I.e. all these hypercalls won''t happen as often and at the granularity as skbuff''s demanded for GNTTABOP_transfer > > 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it''s not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don''t know about how practical that is though.I think that will be the common case - the kernel will always attempt to write dirty pagecache pages to make clean ones, and it will still want them around to access. So it can''t really give up the page altogether; if it hands it over to dom0, it needs to make a local copy first.> Problems at first glance: > 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. > 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren''t all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? > 3. Managing the pool of backend reserved pages may be a problem? > > So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I''d like to hear other problems people may find with this approachIt''s not clear to me that its any improvement over just directly copying the data up front. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-17 18:00 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/16/2010 01:28 PM, Daniel Stodden wrote:>> What''s the problem? If you do nothing then it will appear to the kernel >> as a bunch of processes doing memory allocations, and they''ll get >> blocked/rate-limited accordingly if memory is getting short. > The problem is that just letting the page allocator work through > allocations isn''t going to scale anywhere. > > The worst case memory requested under load is <number-of-disks> * (32 * > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather > better.Under what circumstances would you end up needing to allocate that many pages?> The number of I/O actually in-flight at any point, in contrast, is > derived from the queue/sg sizes of the physical device. For a simple > disk, that''s about a ring or two.Wouldn''t that be the worst case?>> There''s >> plenty of existing mechanisms to control that sort of thing (cgroups, >> etc) without adding anything new to the kernel. Or are you talking >> about something other than simple memory pressure? >> >> And there''s plenty of existing IPC mechanisms if you want them to >> explicitly coordinate with each other, but I''d tend to thing that''s >> premature unless you have something specific in mind. >> >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. >>> Can''t find it now, what happened? Without, there''s presently still no >>> zero-copy. >> gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively >> new-ish) mmu notifier infrastructure which is intended to allow a device >> to sync an external MMU with usermode mappings. We''re not using it in >> precisely that way, but it allows us to wrangle grant mappings before >> the generic code tries to do normal pte ops on them. > The mmu notifiers were for safe teardown only. They are not sufficient > for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll > need to back those VMAs with page structs.The pages will have struct page, because they''re normal kernel pages which happen to be backed by mapped granted pages. Are you talking about the #ifdef CONFIG_XEN code in the middle of __get_user_pages()? Isn''t that just there to cope with the nested-IO-on-the-same-page problem that the current blktap architecture provokes? If there''s only a single IO on each page - the one initiated by usermode - then it shouldn''t be necessary, right?> Or bounce again (gulp, just > mentioning it). As with the blktap2 patches, note there is no difference > in the dom0 memory bill, it takes page frames.(And perhaps actual pages to substitute for the granted pages.)> I guess we''ve been meaning the same thing here, unless I''m > misunderstanding you. Any pfn does, and the balloon pagevec allocations > default to order 0 entries indeed. Sorry, you''re right, that''s not a > ''range''. With a pending re-xmit, the backend can find a couple (or all) > of the request frames have count>1. It can flip and abandon those as > normal memory. But it will need those lost memory slots back, straight > away or next time it''s running out of frames. As order-0 allocations.Right. GFP_KERNEL order 0 allocations are pretty reliable; they only fail if the system is under extreme memory pressure. And it has the nice property that if those allocations block or fail it rate limits IO ingress from domains rather than being crushed by memory pressure at the backend (ie, the problem with trying to allocate memory in the writeout path). Also the cgroup mechanism looks like an extremely powerful way to control the allocations for a process or group of processes to stop them from dominating the whole machine. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-17 19:27 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Wed, 2010-11-17 at 12:04 -0500, Ian Campbell wrote:> On Tue, 2010-11-16 at 21:28 +0000, Daniel Stodden wrote: > > > > > > Also, I was absolutely certain I once saw VM_FOREIGN support in > > gntdev.. > > > > Can''t find it now, what happened? Without, there''s presently still > > no > > > > zero-copy. > > > > > > gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively > > > new-ish) mmu notifier infrastructure which is intended to allow a > > device > > > to sync an external MMU with usermode mappings. We''re not using it > > in > > > precisely that way, but it allows us to wrangle grant mappings > > before > > > the generic code tries to do normal pte ops on them. > > > > The mmu notifiers were for safe teardown only. They are not sufficient > > for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll > > need to back those VMAs with page structs. Or bounce again (gulp, > > just mentioning it). As with the blktap2 patches, note there is no > > difference in the dom0 memory bill, it takes page frames. > > I though the VM_FOREIGN stuff which blktap[12] hooks into gup() was > there in order to avoid some deadlock arising from having a single > struct page taking part in nested block I/O. i.e. one I/O created by > blkback and the second from the tapdisk process deadlock against each > other.> I though that in order to work around this blktap creates a second range > of struct pages (the struct vm_foreign_map array in > vma->vm_private_data) which aliases the pages under I/O and then it > hooks gup() to do the switcheroo when necessary. > > If the blkback interface is running in userspace (whether in a tapdisk > or qemu process) then there is no nesting of I/O and the only I/O is > from process context and therefore this particular issue is no longer a > problem because we can use a properly struct page backed page without > needing a magic VM_FOREIGN shadow of it? > > Have I misunderstood something about the reason for VM_FOREIGN?VM_FOREIGN is a special case of grant mapping frames into a user''s VMA. This stuff is primarily there to make gup() grab a pagevec from the VMA struct, instead of relying on follow_page in order to map user-virtual to pseudophysical, which is what gup normally does. In brief, it''s hacking gup() to keep DIO working. In other words: Userland can''t I/O on some VM_PFNMAP or VM_IO or similar. If you ask for DMA, the kernel quite urgently wants this to look like normal memory. Your description of the aliasing is correct, but that''s yet another thing. It implies redoing the grantmap and p2m entries privately. To make this clearer: If the aliasing weren''t necessary, blktap2 would just have had to grab blkback''s prepared page struct from the request SG vector, and .mmap that to userland, but still with VM_FOREIGN and some pagevec pointer in the VMA. Instead, there''s blkback-pagemap, specifically to map that SG page entry *back* to the original gref from a table, and *redo* the entire gntmap + p2m thing another time, privately. Twisted, agreed. :) Cheers, Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andres Lagar-Cavilla
2010-Nov-17 19:47 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
So, swapping mfns for write requests is a definite no-no. One could still live with copying write buffers and swapping read buffers by the end of the request. That still yields some benefit. As for kernel mappings, I though a solution would be to provide the hypervisor with both pte pointers. After all pte pointers are already provided for mapping grants in user-space. But that''s a little too much to handle for the current interface. Thanks for the feedback Andres On Nov 17, 2010, at 12:52 PM, Jeremy Fitzhardinge wrote:> On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote: >> I''ll throw an idea there and you educate me why it''s lame. >> >> Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn''s with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. >> >> Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. >> >> The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: >> >> 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. >> >> 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn''s is swapped with the domU mfn. Pfn''s and page struct''s on both ends remain untouched. > > Would GNTTABOP_swap also require the domU to have already unmapped the > page from its own pagetables? Presumably it would fail if it didn''t, > otherwise you''d end up with a domU mapping the same mfn as a > dom0-private page. > >> 3. For blkfront reads, call swap when stuffing the response back into the ring >> >> 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don''t have a weirdo granted PTE, or work with a frame from other domain. It''s your page all along, dom0 >> >> 5. One assumption for domU is that pages allocated as blkfront buffers won''t be touched by anybody, so a) it''s safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) >> >> 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap''s. I.e. all these hypercalls won''t happen as often and at the granularity as skbuff''s demanded for GNTTABOP_transfer >> >> 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it''s not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don''t know about how practical that is though. > > I think that will be the common case - the kernel will always attempt to > write dirty pagecache pages to make clean ones, and it will still want > them around to access. So it can''t really give up the page altogether; > if it hands it over to dom0, it needs to make a local copy first. > >> Problems at first glance: >> 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. >> 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren''t all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? >> 3. Managing the pool of backend reserved pages may be a problem? >> >> So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I''d like to hear other problems people may find with this approach > > It''s not clear to me that its any improvement over just directly copying > the data up front. > > J_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-17 20:21 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Wed, 2010-11-17 at 13:00 -0500, Jeremy Fitzhardinge wrote:> On 11/16/2010 01:28 PM, Daniel Stodden wrote: > >> What''s the problem? If you do nothing then it will appear to the kernel > >> as a bunch of processes doing memory allocations, and they''ll get > >> blocked/rate-limited accordingly if memory is getting short. > > The problem is that just letting the page allocator work through > > allocations isn''t going to scale anywhere. > > > > The worst case memory requested under load is <number-of-disks> * (32 * > > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather > > better. > > Under what circumstances would you end up needing to allocate that many > pages?I don''t. Independently running tapdisks would do, on behalf of guests queuing I/O. That''s why one wouldn''t just let them run and allocate their own memory. The memory space set aside for I/O should be a shared resource.> > The number of I/O actually in-flight at any point, in contrast, is > > derived from the queue/sg sizes of the physical device. For a simple > > disk, that''s about a ring or two. > > Wouldn''t that be the worst case?Yes. It''s quite small. A 2 or 3 megs per physical backend are usually sufficient.> >> There''s > >> plenty of existing mechanisms to control that sort of thing (cgroups, > >> etc) without adding anything new to the kernel. Or are you talking > >> about something other than simple memory pressure? > >> > >> And there''s plenty of existing IPC mechanisms if you want them to > >> explicitly coordinate with each other, but I''d tend to thing that''s > >> premature unless you have something specific in mind. > >> > >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev.. > >>> Can''t find it now, what happened? Without, there''s presently still no > >>> zero-copy. > >> gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively > >> new-ish) mmu notifier infrastructure which is intended to allow a device > >> to sync an external MMU with usermode mappings. We''re not using it in > >> precisely that way, but it allows us to wrangle grant mappings before > >> the generic code tries to do normal pte ops on them. > > The mmu notifiers were for safe teardown only. They are not sufficient > > for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll > > need to back those VMAs with page structs. > > The pages will have struct page, because they''re normal kernel pages > which happen to be backed by mapped granted pages.And, like all granted frames, not owning them implies they are not resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() without the VM_FOREIGN hack. Correct me if I''m mistaken. I used to be quicker looking up stuff on arch-xen kernels, but I think fundamental constants of the Xen universe didn''t change since last time.> Are you talking > about the #ifdef CONFIG_XEN code in the middle of __get_user_pages()? > Isn''t that just there to cope with the nested-IO-on-the-same-page > problem that the current blktap architecture provokes? If there''s only > a single IO on each page - the one initiated by usermode - then it > shouldn''t be necessary, right?No. Jake brought the aliasing in specifically to get blktap2 working with zero-copy. VM_FOREIGN is much older. Only blktap2 does the recursive thing, because it''s a blkdev above some physical dev. Blktap1 went from the guest ring straight down to userland. As would be the case with a gntdev-based blkback.> > Or bounce again (gulp, just > > mentioning it). As with the blktap2 patches, note there is no difference > > in the dom0 memory bill, it takes page frames. > > (And perhaps actual pages to substitute for the granted pages.)Well yes, that''s right. Still fine as long there''s some relatively small constant boundary on the worst case. O(n) for large systems would go in the hundreds of megs. Given that the *reasonable* amount of memory used simultaneously is pretty small in any case, even going through the memory allocator can be skipped. [ Part of the reason why blktap *never* frees those pages, apart from being slightly greedy, are deadlock hazards when writing those nodes in dom0 through the pagecache, as dom0 might. You need memory pools on the datapath to guarantee progress under pressure. That got pretty ugly after 2.6.27, btw. ] In any case, let''s skip trying what happens if a thundering herd of several hundred userspace disks tries gfp()ing their grant slots out of dom0 without without arbitration.> > I guess we''ve been meaning the same thing here, unless I''m > > misunderstanding you. Any pfn does, and the balloon pagevec allocations > > default to order 0 entries indeed. Sorry, you''re right, that''s not a > > ''range''. With a pending re-xmit, the backend can find a couple (or all) > > of the request frames have count>1. It can flip and abandon those as > > normal memory. But it will need those lost memory slots back, straight > > away or next time it''s running out of frames. As order-0 allocations. > > Right. GFP_KERNEL order 0 allocations are pretty reliable; they only > fail if the system is under extreme memory pressure. And it has the > nice property that if those allocations block or fail it rate limits IO > ingress from domains rather than being crushed by memory pressure at the > backend (ie, the problem with trying to allocate memory in the writeout > path). > > Also the cgroup mechanism looks like an extremely powerful way to > control the allocations for a process or group of processes to stop them > from dominating the whole machine.Ah. In case it can be put to work to bind processes allocating pagecache entries for dirtying to some boundary, I''d be really interested. I think I came across it once but didn''t take the time to read the docs thoroughly. Can it? Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-17 21:02 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/17/2010 12:21 PM, Daniel Stodden wrote:> And, like all granted frames, not owning them implies they are not > resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() > without the VM_FOREIGN hack.Hm, I see. Well, I wonder if using _PAGE_SPECIAL would help (it is put on usermode ptes which don''t have a backing struct page). After all, there''s no fundamental reason why it would need a pfn; the mfn in the pte is what''s actually needed to ultimately generate a DMA descriptor.> Correct me if I''m mistaken. I used to be quicker looking up stuff on > arch-xen kernels, but I think fundamental constants of the Xen universe > didn''t change since last time.No, but Linux has.> [ > Part of the reason why blktap *never* frees those pages, apart from > being slightly greedy, are deadlock hazards when writing those nodes in > dom0 through the pagecache, as dom0 might. You need memory pools on the > datapath to guarantee progress under pressure. That got pretty ugly > after 2.6.27, btw. > ]That''s what mempools are intended to solve.> In any case, let''s skip trying what happens if a thundering herd of > several hundred userspace disks tries gfp()ing their grant slots out of > dom0 without without arbitration.I''m not against arbitration, but I don''t think that''s something that should be implemented as part of a Xen driver.>>> I guess we''ve been meaning the same thing here, unless I''m >>> misunderstanding you. Any pfn does, and the balloon pagevec allocations >>> default to order 0 entries indeed. Sorry, you''re right, that''s not a >>> ''range''. With a pending re-xmit, the backend can find a couple (or all) >>> of the request frames have count>1. It can flip and abandon those as >>> normal memory. But it will need those lost memory slots back, straight >>> away or next time it''s running out of frames. As order-0 allocations. >> Right. GFP_KERNEL order 0 allocations are pretty reliable; they only >> fail if the system is under extreme memory pressure. And it has the >> nice property that if those allocations block or fail it rate limits IO >> ingress from domains rather than being crushed by memory pressure at the >> backend (ie, the problem with trying to allocate memory in the writeout >> path). >> >> Also the cgroup mechanism looks like an extremely powerful way to >> control the allocations for a process or group of processes to stop them >> from dominating the whole machine. > Ah. In case it can be put to work to bind processes allocating pagecache > entries for dirtying to some boundary, I''d be really interested. I think > I came across it once but didn''t take the time to read the docs > thoroughly. Can it?I''m not sure about dirtyness - it seems like something that should be within its remit, even if it doesn''t currently have it. The cgroup mechanism is extremely powerful, now that I look at it. You can do everything from setting block IO priorities and QoS parameters to CPU limits. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-17 21:57 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote:> On 11/17/2010 12:21 PM, Daniel Stodden wrote: > > And, like all granted frames, not owning them implies they are not > > resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() > > without the VM_FOREIGN hack. > > Hm, I see. Well, I wonder if using _PAGE_SPECIAL would help (it is put > on usermode ptes which don''t have a backing struct page). After all, > there''s no fundamental reason why it would need a pfn; the mfn in the > pte is what''s actually needed to ultimately generate a DMA descriptor.The kernel needs the page structs at least for locking and refcounting. There''s also a some trickier stuff in there. Like redirtying disk-backed user memory after read completion, in case it''s been laundered. (So that an AIO on unpinned user memory doesn''t subsequently get flashed back when cycling through swap, if I understood that thing correctly.) Doesn''t apply for blktap (it''s all reserved pages). All I mean is: I wouldn''t exactly see some innocent little dio hack or so shape up in there. Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I''d agree. E.g. there''s a concept of devices DMA-ing into arbitrary I/O memory space, not host memory, on some bus architectures. PCI would come to my mind (the old shared medium stuff, unsure about those newfangled P-t-P topologies). But not in Linux, so I presently don''t see anybody upstream bothering to make block-I/O request addressing more forgiving than it is. PAGE_SPECIAL -- to the kernel, that means the opposite: page structs which aren''t backed by ''real'' memory, so gup(), for example, is told to fail (how nasty). In contrast, VM_FOREIGN is non-memory backed by page structs.> > Correct me if I''m mistaken. I used to be quicker looking up stuff on > > arch-xen kernels, but I think fundamental constants of the Xen universe > > didn''t change since last time. > > No, but Linux has.Not in that respect. There''s certainly a way to get VM_FOREIGN out of the mainline code. It would involve an unlikely() branch in .pte_val(=xen_pte_val) to fall back into a private local m2p hash lookup. Assuming that kind of thing gets nowhere inlined. Not nice, but still more upstreamable than VM_FOREIGN.> > [ > > Part of the reason why blktap *never* frees those pages, apart from > > being slightly greedy, are deadlock hazards when writing those nodes in > > dom0 through the pagecache, as dom0 might. You need memory pools on the > > datapath to guarantee progress under pressure. That got pretty ugly > > after 2.6.27, btw. > > ] > > That''s what mempools are intended to solve.That''s why the blktap frame pool is now a mempool, indeed.> > In any case, let''s skip trying what happens if a thundering herd of > > several hundred userspace disks tries gfp()ing their grant slots out of > > dom0 without without arbitration. > > I''m not against arbitration, but I don''t think that''s something that > should be implemented as part of a Xen driver.Uhm, maybe I''m misunderstanding you, isn''t the whole thing a Xen driver? What do you suggest?> >>> I guess we''ve been meaning the same thing here, unless I''m > >>> misunderstanding you. Any pfn does, and the balloon pagevec allocations > >>> default to order 0 entries indeed. Sorry, you''re right, that''s not a > >>> ''range''. With a pending re-xmit, the backend can find a couple (or all) > >>> of the request frames have count>1. It can flip and abandon those as > >>> normal memory. But it will need those lost memory slots back, straight > >>> away or next time it''s running out of frames. As order-0 allocations. > >> Right. GFP_KERNEL order 0 allocations are pretty reliable; they only > >> fail if the system is under extreme memory pressure. And it has the > >> nice property that if those allocations block or fail it rate limits IO > >> ingress from domains rather than being crushed by memory pressure at the > >> backend (ie, the problem with trying to allocate memory in the writeout > >> path). > >> > >> Also the cgroup mechanism looks like an extremely powerful way to > >> control the allocations for a process or group of processes to stop them > >> from dominating the whole machine. > > Ah. In case it can be put to work to bind processes allocating pagecache > > entries for dirtying to some boundary, I''d be really interested. I think > > I came across it once but didn''t take the time to read the docs > > thoroughly. Can it? > > I''m not sure about dirtyness - it seems like something that should be > within its remit, even if it doesn''t currently have it. > > The cgroup mechanism is extremely powerful, now that I look at it. You > can do everything from setting block IO priorities and QoS parameters to > CPU limits.Thanks. I''ll keep it under my pillow then. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-17 22:14 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/17/2010 01:57 PM, Daniel Stodden wrote:> On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote: >> On 11/17/2010 12:21 PM, Daniel Stodden wrote: >>> And, like all granted frames, not owning them implies they are not >>> resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup() >>> without the VM_FOREIGN hack. >> Hm, I see. Well, I wonder if using _PAGE_SPECIAL would help (it is put >> on usermode ptes which don''t have a backing struct page). After all, >> there''s no fundamental reason why it would need a pfn; the mfn in the >> pte is what''s actually needed to ultimately generate a DMA descriptor. > The kernel needs the page structs at least for locking and refcounting.Yeah.> There''s also a some trickier stuff in there. Like redirtying disk-backed > user memory after read completion, in case it''s been laundered. (So that > an AIO on unpinned user memory doesn''t subsequently get flashed back > when cycling through swap, if I understood that thing correctly.) > > Doesn''t apply for blktap (it''s all reserved pages). All I mean is: I > wouldn''t exactly see some innocent little dio hack or so shape up in > there. > > Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I''d > agree. E.g. there''s a concept of devices DMA-ing into arbitrary I/O > memory space, not host memory, on some bus architectures. PCI would come > to my mind (the old shared medium stuff, unsure about those newfangled > P-t-P topologies). But not in Linux, so I presently don''t see anybody > upstream bothering to make block-I/O request addressing more forgiving > than it is. > > PAGE_SPECIAL -- to the kernel, that means the opposite: page structs > which aren''t backed by ''real'' memory, so gup(), for example, is told to > fail (how nasty).It''s pfns with no corresponding struct page - it''s the pte level equivalent of VM_PFNMAP at the VMA level. But you''re right that we can''t do without struct pages. So we''re back to needing a way of mapping from a random mfn to a pfn so we can find the corresponding struct page. I would be tempted to put a layer over m2p to allow local m2p mappings to override the global m2p table.> In contrast, VM_FOREIGN is non-memory backed by page > structs.Yep. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-17 23:32 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Wed, 2010-11-17 at 16:57 -0500, Daniel Stodden wrote:> > > In any case, let''s skip trying what happens if a thundering herd of > > > several hundred userspace disks tries gfp()ing their grant slots out of > > > dom0 without without arbitration. > > > > I''m not against arbitration, but I don''t think that''s something that > > should be implemented as part of a Xen driver. > > Uhm, maybe I''m misunderstanding you, isn''t the whole thing a Xen driver? > What do you suggest?Just misread you, sorry. You mean arbitration via IPC. Somewhat, but ugly. Just counting pages with cmpxchg in some shm segment isnt''t a big deal, agreed. But userspace fixing the counter after detecting some process crash would already start to complicate that. Next, someone has to do the ballooning, and you need gntmap to understand the VMA type anyway. From there on, going the rest of the way and let the kernel driver round-robin the frame pool altogether is much smaller and cleaner. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-17 23:42 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Wed, 2010-11-17 at 11:36 -0500, Andres Lagar-Cavilla wrote:> I''ll throw an idea there and you educate me why it''s lame. > > Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn''s with fake page structs that make get_user_pages happy, quirky grant PTEs, etc. > > Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. > > The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out: > > 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them. > > 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn''s is swapped with the domU mfn. Pfn''s and page struct''s on both ends remain untouched. > > 3. For blkfront reads, call swap when stuffing the response back into the ring > > 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don''t have a weirdo granted PTE, or work with a frame from other domain. It''s your page all along, dom0 > > 5. One assumption for domU is that pages allocated as blkfront buffers won''t be touched by anybody, so a) it''s safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle) > > 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap''s. I.e. all these hypercalls won''t happen as often and at the granularity as skbuff''s demanded for GNTTABOP_transfer > > 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it''s not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don''t know about how practical that is though. > > Problems at first glance: > 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback. > 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren''t all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback? > 3. Managing the pool of backend reserved pages may be a problem?I guess GNT_transfer for network I/O died because of the double-ended TLB fallout? Still liked the general direction, nice shot. Cheers, Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Nov-18 02:29 UTC
[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On 11/17/2010 04:41 PM, Daniel Stodden wrote:> On Wed, 2010-11-17 at 18:49 -0500, Jeremy Fitzhardinge wrote: >> On 11/17/2010 03:06 PM, Daniel Stodden wrote: >>>> So we''re back to needing a way of mapping from a random mfn to a pfn so >>>> we can find the corresponding struct page. I would be tempted to put a >>>> layer over m2p to allow local m2p mappings to override the global m2p table. >>> I think a local m2p lookup on a slow path is a superior option, iff you >>> do think it''s doable. Without e.g. risking to bloat some inline stuff, I >>> mean. >>> >>> Where do you think in the call stack down into pvops code would the >>> lookup go? Asking because I''d expect the kernel to potentially learn >>> more tricks with that. >> I don''t think m2p is all that performance critical. p2m is used way more. >> >> p2m is already a radix tree; > Yes, but pfns are dense plug holes, aren''t they?Yes.>> I think m2p could be done somewhat >> similarly, where undefined entries fall through to the global m2p. The >> main problem is probably making sure the memory allocation for new m2p >> entries can be done in a blocking context, so we don''t have to rely on >> GFP_ATOMIC. > Whatever the index is going to be, all the backends I''m aware of run > their rings on a kthread. Sounds to me like GFP_WAIT followed by an > rwlock is perfectly sufficient. Only the reversal commonly ends up in > interrupt context. > >> That particular m2p lookup would be in xen_pte_val(), but I think that''s >> the callsite for pretty much all m2p lookups. >> >>> A mfn->gref mapping would obsolete blkback-pagemap. Well, iff the >>> kernel-blktap zerocopy stuff wants to be resurrected. >>> >>> It would also be a cheap way to implement current blktap to do >>> virt->gref lookups for tapdisks. Some tapdisk filters want this, present >>> major example is the memshr patch, and it''s sort of nicer than a ring >>> message hack. >>> >>> Wouldn''t storing the handle allow unmapping grant ptes on the normal >>> user PT teardown path? I think we always was this .zap_pte vma-operation >>> in blktap, iirc. MMU notifier replacement? Maybe not a good one. >> I think mmu notifiers are fine; this is exactly what they''re for after >> all. Very few address spaces have granted pages mapped into them, so >> keeping the normal pte operations as fast as possible and using more >> expensive notifiers for the afflicted mms seems like the right tradeoff. >> >> Hm. Before Gerd highlighted mmu notifiers as the right way to deal with >> granted pages in gntdev, I had the idea of allocating a shadow page for >> each pte page and storing grant refs there, where the shadow is hung of >> the pte page''s struct page, where set_pte could use it to do the special >> thing if needed. >> >> I wonder if we could do something similar here, where we store the pfn >> for a given pte in the shadow? But how would it get used? There''s no >> "read pte" pvop, and pte_val gets the pte value, not its address, so >> there''d be no way to find the shadow. So I guess that doesn''t work. > I''m not sure about the radix variant. All the backends do order-0 > allocations, as discussed above. Depending on the driver pair behavior, > the mfn ranges can get arbitrarily sparse. The real M2P makes completely > different assumptions in density and size, or not?Well, there''s two use-cases for the local m2p idea. One is for granted pages, which are going to be all over the place, but the other is for hardware mfns, which are more likely to be densely clustered. A radix tree for grant mfns is likely to be pretty inefficient - but the worst case is one radix page per mfn, which isn''t too bad, since we''re not expecting that many granted mfns to be around. But perhaps a hash or rbtree would be a better fit. Or we could insist on making the mfns contiguous.> Well, maybe one could shadow (cow, actually) just that? Saves the index. > Likely a dumb idea. :)I guess Xen won''t let us map over the m2p, but maybe we could alias it. But that''s going to waste lots of address space in a 32b dom0.> What numbers of grant refs do we run? I remember times when the tables > were bumped up massively, for stuff like pvfb, a long time ago. I guess > rest remained rather conservative.We only really need to worry about mfns which are actually going to be mapped into userspace and guped. We could even propose a (new?) mmu notifier to prepare for gup so that it can be deferred as late as possible.> The shadow idea sounds pretty cool, mainly because the vaddr spaces are > often contiguous. At least for userland pages. > > Thing which would bother me is that settling on a single shadow page > already limits map entries to sizeof(pte), so ideas like additionally > mapping to grefs/handles/flag already go out of the window. Or grow > bigger leaf, but on average that''s also more space overhead.At least we can always fit a pointer into sizeof(pte_t), so there''s some scope for having more storage if needed. But I don''t see how it can help for gup... J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-18 13:56 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Wed, 2010-11-17 at 19:27 +0000, Daniel Stodden wrote:> On Wed, 2010-11-17 at 12:04 -0500, Ian Campbell wrote: > > On Tue, 2010-11-16 at 21:28 +0000, Daniel Stodden wrote: > > > > > > > > Also, I was absolutely certain I once saw VM_FOREIGN support in > > > gntdev.. > > > > > Can''t find it now, what happened? Without, there''s presently still > > > no > > > > > zero-copy. > > > > > > > > gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively > > > > new-ish) mmu notifier infrastructure which is intended to allow a > > > device > > > > to sync an external MMU with usermode mappings. We''re not using it > > > in > > > > precisely that way, but it allows us to wrangle grant mappings > > > before > > > > the generic code tries to do normal pte ops on them. > > > > > > The mmu notifiers were for safe teardown only. They are not sufficient > > > for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll > > > need to back those VMAs with page structs. Or bounce again (gulp, > > > just mentioning it). As with the blktap2 patches, note there is no > > > difference in the dom0 memory bill, it takes page frames. > > > > I though the VM_FOREIGN stuff which blktap[12] hooks into gup() was > > there in order to avoid some deadlock arising from having a single > > struct page taking part in nested block I/O. i.e. one I/O created by > > blkback and the second from the tapdisk process deadlock against each > > other. > > > > > I though that in order to work around this blktap creates a second range > > of struct pages (the struct vm_foreign_map array in > > vma->vm_private_data) which aliases the pages under I/O and then it > > hooks gup() to do the switcheroo when necessary. > > > > If the blkback interface is running in userspace (whether in a tapdisk > > or qemu process) then there is no nesting of I/O and the only I/O is > > from process context and therefore this particular issue is no longer a > > problem because we can use a properly struct page backed page without > > needing a magic VM_FOREIGN shadow of it? > > > > Have I misunderstood something about the reason for VM_FOREIGN? > > VM_FOREIGN is a special case of grant mapping frames into a user''s VMA. > > This stuff is primarily there to make gup() grab a pagevec from the VMA > struct, instead of relying on follow_page in order to map user-virtual > to pseudophysical, which is what gup normally does. > > In brief, it''s hacking gup() to keep DIO working. > > In other words: Userland can''t I/O on some VM_PFNMAP or VM_IO or > similar. If you ask for DMA, the kernel quite urgently wants this to > look like normal memory. > > Your description of the aliasing is correct, but that''s yet another > thing. It implies redoing the grantmap and p2m entries privately. > > To make this clearer: If the aliasing weren''t necessary, blktap2 would > just have had to grab blkback''s prepared page struct from the request SG > vector, and .mmap that to userland, but still with VM_FOREIGN and some > pagevec pointer in the VMA. > > Instead, there''s blkback-pagemap, specifically to map that SG page entry > *back* to the original gref from a table, and *redo* the entire gntmap + > p2m thing another time, privately.In the userland blkback case do you need to redo the mapping? Isn''t the original mapping (including pte update) of the granted mfn into the struct page associated with the user address sufficient? Is the reason gup() doesn''t work is that it tries to go from a pte_entry to a struct page (via a pfn, e.g. in vm_normal_page) which involves a m2p translation of the pte value, and this is not valid for a foreign page (since the m2p entry is for the remote domain)? Or is the issue that we try and dynamically fault in those addresses and that doesn''t work because you need a special set_pte variant (aka the gnttab map hypercall) to map a granted page? Maybe both...> > Twisted, agreed. :) > > Cheers, > Daniel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2010-Nov-18 19:37 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Thu, 2010-11-18 at 08:56 -0500, Ian Campbell wrote:> On Wed, 2010-11-17 at 19:27 +0000, Daniel Stodden wrote: > > On Wed, 2010-11-17 at 12:04 -0500, Ian Campbell wrote: > > > On Tue, 2010-11-16 at 21:28 +0000, Daniel Stodden wrote: > > > > > > > > > > Also, I was absolutely certain I once saw VM_FOREIGN support in > > > > gntdev.. > > > > > > Can''t find it now, what happened? Without, there''s presently still > > > > no > > > > > > zero-copy. > > > > > > > > > > gntdev doesn''t need VM_FOREIGN any more - it uses the (relatively > > > > > new-ish) mmu notifier infrastructure which is intended to allow a > > > > device > > > > > to sync an external MMU with usermode mappings. We''re not using it > > > > in > > > > > precisely that way, but it allows us to wrangle grant mappings > > > > before > > > > > the generic code tries to do normal pte ops on them. > > > > > > > > The mmu notifiers were for safe teardown only. They are not sufficient > > > > for DIO, which wants gup() to work. If you want zcopy on gntdev, we''ll > > > > need to back those VMAs with page structs. Or bounce again (gulp, > > > > just mentioning it). As with the blktap2 patches, note there is no > > > > difference in the dom0 memory bill, it takes page frames. > > > > > > I though the VM_FOREIGN stuff which blktap[12] hooks into gup() was > > > there in order to avoid some deadlock arising from having a single > > > struct page taking part in nested block I/O. i.e. one I/O created by > > > blkback and the second from the tapdisk process deadlock against each > > > other. > > > > > > > > > I though that in order to work around this blktap creates a second range > > > of struct pages (the struct vm_foreign_map array in > > > vma->vm_private_data) which aliases the pages under I/O and then it > > > hooks gup() to do the switcheroo when necessary. > > > > > > If the blkback interface is running in userspace (whether in a tapdisk > > > or qemu process) then there is no nesting of I/O and the only I/O is > > > from process context and therefore this particular issue is no longer a > > > problem because we can use a properly struct page backed page without > > > needing a magic VM_FOREIGN shadow of it? > > > > > > Have I misunderstood something about the reason for VM_FOREIGN? > > > > VM_FOREIGN is a special case of grant mapping frames into a user''s VMA. > > > > This stuff is primarily there to make gup() grab a pagevec from the VMA > > struct, instead of relying on follow_page in order to map user-virtual > > to pseudophysical, which is what gup normally does. > > > > In brief, it''s hacking gup() to keep DIO working. > > > > In other words: Userland can''t I/O on some VM_PFNMAP or VM_IO or > > similar. If you ask for DMA, the kernel quite urgently wants this to > > look like normal memory. > > > > Your description of the aliasing is correct, but that''s yet another > > thing. It implies redoing the grantmap and p2m entries privately. > > > > To make this clearer: If the aliasing weren''t necessary, blktap2 would > > just have had to grab blkback''s prepared page struct from the request SG > > vector, and .mmap that to userland, but still with VM_FOREIGN and some > > pagevec pointer in the VMA. > > > > Instead, there''s blkback-pagemap, specifically to map that SG page entry > > *back* to the original gref from a table, and *redo* the entire gntmap + > > p2m thing another time, privately. > > In the userland blkback case do you need to redo the mapping? Isn''t the > original mapping (including pte update) of the granted mfn into the > struct page associated with the user address sufficient?It''s completely sufficient. Running the sring in tapdisk straight away implies mapping once and for all. The present aliasing happens because of the two arrows in the following request submission chain: blkback -> tapdev -> disk. Each a designating I/O submissing to a blk request queue. It''s just for the stacking.> Is the reason gup() doesn''t work is that it tries to go from a pte_entry > to a struct page (via a pfn, e.g. in vm_normal_page) which involves a > m2p translation of the pte value, and this is not valid for a foreign > page (since the m2p entry is for the remote domain)?Exactly. So either it''s VM_FOREIGN, or we maybe go for a somewhat cleaner solution by teaching mfn_to_pfn new tricks, such as going through a private mfn->gntpfn lookup on the m2p failure path. See the related thread with Jeremy. I guess you''d have additional thoughts on that.> Or is the issue that we try and dynamically fault in those addresses and > that doesn''t work because you need a special set_pte variant (aka the > gnttab map hypercall) to map a granted page?Not a real issue. Blktap presently vm_insert()s, but demand paging makes no difference: gup() completely relies on follow_page. Means that a pte miss falls through to handle_mm_fault, then retries. Thx, Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Nov-19 10:57 UTC
Re: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
On Thu, 2010-11-18 at 19:37 +0000, Daniel Stodden wrote:> On Thu, 2010-11-18 at 08:56 -0500, Ian Campbell wrote: > > On Wed, 2010-11-17 at 19:27 +0000, Daniel Stodden wrote: > > > On Wed, 2010-11-17 at 12:04 -0500, Ian Campbell wrote: > > > > On Tue, 2010-11-16 at 21:28 +0000, Daniel Stodden wrote:> > > Instead, there''s blkback-pagemap, specifically to map that SG page entry > > > *back* to the original gref from a table, and *redo* the entire gntmap + > > > p2m thing another time, privately. > > > > In the userland blkback case do you need to redo the mapping? Isn''t the > > original mapping (including pte update) of the granted mfn into the > > struct page associated with the user address sufficient? > > It''s completely sufficient. Running the sring in tapdisk straight away > implies mapping once and for all. The present aliasing happens because > of the two arrows in the following request submission chain: blkback -> > tapdev -> disk. Each a designating I/O submissing to a blk request > queue. It''s just for the stacking.Great, just wanted to check I''d understood correctly.> > Is the reason gup() doesn''t work is that it tries to go from a pte_entry > > to a struct page (via a pfn, e.g. in vm_normal_page) which involves a > > m2p translation of the pte value, and this is not valid for a foreign > > page (since the m2p entry is for the remote domain)? > > Exactly. So either it''s VM_FOREIGN, or we maybe go for a somewhat > cleaner solution by teaching mfn_to_pfn new tricks, such as going > through a private mfn->gntpfn lookup on the m2p failure path. See the > related thread with Jeremy. I guess you''d have additional thoughts on > that.I saw the thread but don''t have any particular thoughts, it seems like an eminently plausible idea... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel