This patchset reimplements snapshot deletion with the help of the readahead framework. For this callbacks are added to the framework. The main idea is to traverse many snapshots at once at read many branches at once. This way readahead get many requests at once (currently about 50000), giving it the chance to order disk accesses properly. On a single disk, the effect is currently spoiled by sync operations that still take place, mainly checksum deletion. The most benefit can be gained with multiple devices, as all devices can be fully utilized. It scales quite well with the number of devices. For more details see the commit messages of the individual patches and the source code comments. How it is tested: I created a test volume using David Sterba''s stress-subvol-git-aging.sh. It checks out randoms version of the kernel git tree, creating a snapshot from it from time to time and checks out other versions there, and so on. In the end the fs had 80 subvols with various degrees of sharing between them. The following tests were conducted on it: - delete a subvol using droptree and check the fs with btrfsck afterwards for consistency - delete all subvols and verify with btrfs-debug-tree that the extent allocation tree is clean - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get a good pressure on locking - add various degrees of memory pressure to the previous test to get pages to expire early from page cache - enable all relevant kernel debugging options during all tests The performance gain on a single drive was about 20%, on 8 drives about 600%. It depends vastly on the maximum parallelity of the readahead, that is currently hardcoded to about 50000. This number is subject to 2 factors, the available RAM and the size of the saved state for a commit. As the full state has to be saved on commit, a large parallelity leads to a large state. Based on this I''ll see if I can add delayed checksum deletions and running the delayed refs via readahead, to gain a maximum ordering of I/O ops. This patchset is also available at git://git.kernel.org/pub/scm/linux/kernel/git/arne/linux-btrfs.git droptree Arne Jansen (5): btrfs: extend readahead interface btrfs: add droptree inode btrfs: droptree structures and initialization btrfs: droptree implementation btrfs: use droptree for snapshot deletion fs/btrfs/Makefile | 2 +- fs/btrfs/btrfs_inode.h | 4 + fs/btrfs/ctree.h | 78 ++- fs/btrfs/disk-io.c | 19 + fs/btrfs/droptree.c | 1916 +++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/free-space-cache.c | 131 +++- fs/btrfs/free-space-cache.h | 32 + fs/btrfs/inode.c | 3 +- fs/btrfs/reada.c | 494 +++++++++--- fs/btrfs/scrub.c | 29 +- fs/btrfs/transaction.c | 35 +- 11 files changed, 2592 insertions(+), 151 deletions(-) create mode 100644 fs/btrfs/droptree.c -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This extends the readahead interface with callbacks. The old readahead behaviour is now moved into a callback that is used by default if no other callback is given. For a detailed description of the callbacks see the inline comments in reada.c. It also fixes some cases where the hook has not been called. This is not a problem with the default callback, as it just cut some branches from readahead. With the callback mechanism, we want a guaranteed delivery. This patch also makes readaheads hierarchical. A readahead can have sub-readaheads. The idea is that the content of one tree can trigger readaheads to other trees. Also added is a function to cancel all outstanding requests for a given readahead and all its sub-readas. As the interface changes slightly, scrub has been edited to reflect the changes. Signed-off-by: Arne Jansen <sensille@gmx.net> --- fs/btrfs/ctree.h | 37 ++++- fs/btrfs/reada.c | 481 ++++++++++++++++++++++++++++++++++++++++++------------ fs/btrfs/scrub.c | 29 ++-- 3 files changed, 420 insertions(+), 127 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8e4457e..52b8a91 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3020,6 +3020,13 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 devid, struct btrfs_scrub_progress *progress); /* reada.c */ +#undef READA_DEBUG +struct reada_extctl; +struct reada_control; +typedef void (*reada_cb_t)(struct btrfs_root *root, struct reada_control *rc, + u64 wanted_generation, struct extent_buffer *eb, + u64 start, int err, struct btrfs_key *top, + void *ctx); struct reada_control { struct btrfs_root *root; /* tree to prefetch */ struct btrfs_key key_start; @@ -3027,12 +3034,34 @@ struct reada_control { atomic_t elems; struct kref refcnt; wait_queue_head_t wait; + struct reada_control *parent; + reada_cb_t callback; +#ifdef READA_DEBUG + int not_first; +#endif }; -struct reada_control *btrfs_reada_add(struct btrfs_root *root, - struct btrfs_key *start, struct btrfs_key *end); -int btrfs_reada_wait(void *handle); +struct reada_control *btrfs_reada_alloc(struct reada_control *parent, + struct btrfs_root *root, + struct btrfs_key *key_start, struct btrfs_key *key_end, + reada_cb_t callback); +int btrfs_reada_add(struct reada_control *parent, + struct btrfs_root *root, + struct btrfs_key *key_start, struct btrfs_key *key_end, + reada_cb_t callback, void *ctx, + struct reada_control **rcp); +int btrfs_reada_wait(struct reada_control *handle); void btrfs_reada_detach(void *handle); int btree_readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, u64 start, int err); - +int reada_add_block(struct reada_control *rc, u64 logical, + struct btrfs_key *top, int level, u64 generation, void *ctx); +void reada_control_elem_get(struct reada_control *rc); +void reada_control_elem_put(struct reada_control *rc); +void reada_start_machine(struct btrfs_fs_info *fs_info); +int btrfs_reada_abort(struct btrfs_fs_info *fs_info, struct reada_control *rc); + +/* droptree.c */ +int btrfs_droptree_pause(struct btrfs_fs_info *fs_info); +void btrfs_droptree_continue(struct btrfs_fs_info *fs_info); +void droptree_drop_list(struct btrfs_fs_info *fs_info, struct list_head *list); #endif diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 2373b39..0d88163 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -27,18 +27,18 @@ #include "volumes.h" #include "disk-io.h" #include "transaction.h" - -#undef DEBUG +#include "locking.h" /* * This is the implementation for the generic read ahead framework. * * To trigger a readahead, btrfs_reada_add must be called. It will start - * a read ahead for the given range [start, end) on tree root. The returned + * a readahead for the given range [start, end) on tree root. The returned * handle can either be used to wait on the readahead to finish * (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach). + * If no return pointer is given, the readahead is started in the background. * - * The read ahead works as follows: + * The readahead works as follows: * On btrfs_reada_add, the root of the tree is inserted into a radix_tree. * reada_start_machine will then search for extents to prefetch and trigger * some reads. When a read finishes for a node, all contained node/leaf @@ -52,6 +52,27 @@ * Any number of readaheads can be started in parallel. The read order will be * determined globally, i.e. 2 parallel readaheads will normally finish faster * than the 2 started one after another. + * + * In addition to the default behaviour, a callback can be passed to reada_add. + * This callback will be called for each completed read, in an unspecified + * order. This callback can then enqueue further reada requests via + * reada_add_block or create sub-readaheads with btrfs_reada_add (detached). + * The rules for custom callbacks are: + * - The elem count must never go to zero unless the reada is completed. So + * either enqueue further blocks or create sub-readaheads with itself as + * parent. Each sub-readahead will add one to the parent''s element count. + * If you need to defer some work, keep the count from dropping to zero + * by calling reada_control_elem_get(). When finished, return it with + * reada_control_elem_put(). This might also free the rc. + * - The extent buffer passed to the callback will be read locked, spinning. + * - The callback is called in the context of the checksum workers + * - The callback is also called if the read failed. This is signaled via + * the err parameter. In this case the eb might be NULL. Make sure to + * properly update your data structures even in error cases to not leave + * refs anywhere. + * + * If no callback is given, the default callback is used giving the initially + * described behaviour. */ #define MAX_MIRRORS 2 @@ -60,6 +81,7 @@ struct reada_extctl { struct list_head list; struct reada_control *rc; + void *ctx; u64 generation; }; @@ -97,30 +119,87 @@ struct reada_machine_work { static void reada_extent_put(struct btrfs_fs_info *, struct reada_extent *); static void reada_control_release(struct kref *kref); static void reada_zone_release(struct kref *kref); -static void reada_start_machine(struct btrfs_fs_info *fs_info); static void __reada_start_machine(struct btrfs_fs_info *fs_info); -static int reada_add_block(struct reada_control *rc, u64 logical, - struct btrfs_key *top, int level, u64 generation); +/* + * this is the default callback for readahead. It just descends into the + * tree within the range given at creation. if an error occurs, just cut + * this part of the tree + */ +static void readahead_descend(struct btrfs_root *root, struct reada_control *rc, + u64 wanted_generation, struct extent_buffer *eb, + u64 start, int err, struct btrfs_key *top, + void *ctx) +{ + int nritems; + u64 generation; + int level; + int i; + + BUG_ON(err == -EAGAIN); /* FIXME: not yet implemented, don''t cancel + * readahead with default callback */ + + if (err || eb == NULL) { + /* + * this is the error case, the extent buffer has not been + * read correctly. We won''t access anything from it and + * just cleanup our data structures. Effectively this will + * cut the branch below this node from read ahead. + */ + return; + } + + level = btrfs_header_level(eb); + if (level == 0) { + /* + * if this is a leaf, ignore the content. + */ + return; + } + + nritems = btrfs_header_nritems(eb); + generation = btrfs_header_generation(eb); + + /* + * if the generation doesn''t match, just ignore this node. + * This will cut off a branch from prefetch. Alternatively one could + * start a new (sub-) prefetch for this branch, starting again from + * root. + */ + if (wanted_generation != generation) + return; + + for (i = 0; i < nritems; i++) { + u64 n_gen; + struct btrfs_key key; + struct btrfs_key next_key; + u64 bytenr; + + btrfs_node_key_to_cpu(eb, &key, i); + if (i + 1 < nritems) + btrfs_node_key_to_cpu(eb, &next_key, i + 1); + else + next_key = *top; + bytenr = btrfs_node_blockptr(eb, i); + n_gen = btrfs_node_ptr_generation(eb, i); + + if (btrfs_comp_cpu_keys(&key, &rc->key_end) < 0 && + btrfs_comp_cpu_keys(&next_key, &rc->key_start) > 0) + reada_add_block(rc, bytenr, &next_key, + level - 1, n_gen, ctx); + } +} -/* recurses */ /* in case of err, eb might be NULL */ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, u64 start, int err) { - int level = 0; - int nritems; - int i; - u64 bytenr; - u64 generation; struct reada_extent *re; struct btrfs_fs_info *fs_info = root->fs_info; struct list_head list; unsigned long index = start >> PAGE_CACHE_SHIFT; struct btrfs_device *for_dev; - - if (eb) - level = btrfs_header_level(eb); + struct reada_extctl *rec; /* find extent */ spin_lock(&fs_info->reada_lock); @@ -142,65 +221,21 @@ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, re->scheduled_for = NULL; spin_unlock(&re->lock); - if (err == 0) { - nritems = level ? btrfs_header_nritems(eb) : 0; - generation = btrfs_header_generation(eb); - /* - * FIXME: currently we just set nritems to 0 if this is a leaf, - * effectively ignoring the content. In a next step we could - * trigger more readahead depending from the content, e.g. - * fetch the checksums for the extents in the leaf. - */ - } else { + /* + * call hooks for all registered readaheads + */ + list_for_each_entry(rec, &list, list) { + btrfs_tree_read_lock(eb); /* - * this is the error case, the extent buffer has not been - * read correctly. We won''t access anything from it and - * just cleanup our data structures. Effectively this will - * cut the branch below this node from read ahead. + * we set the lock to blocking, as the callback might want to + * sleep on allocations. */ - nritems = 0; - generation = 0; + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); + rec->rc->callback(root, rec->rc, rec->generation, eb, start, + err, &re->top, rec->ctx); + btrfs_tree_read_unlock_blocking(eb); } - for (i = 0; i < nritems; i++) { - struct reada_extctl *rec; - u64 n_gen; - struct btrfs_key key; - struct btrfs_key next_key; - - btrfs_node_key_to_cpu(eb, &key, i); - if (i + 1 < nritems) - btrfs_node_key_to_cpu(eb, &next_key, i + 1); - else - next_key = re->top; - bytenr = btrfs_node_blockptr(eb, i); - n_gen = btrfs_node_ptr_generation(eb, i); - - list_for_each_entry(rec, &list, list) { - struct reada_control *rc = rec->rc; - - /* - * if the generation doesn''t match, just ignore this - * extctl. This will probably cut off a branch from - * prefetch. Alternatively one could start a new (sub-) - * prefetch for this branch, starting again from root. - * FIXME: move the generation check out of this loop - */ -#ifdef DEBUG - if (rec->generation != generation) { - printk(KERN_DEBUG "generation mismatch for " - "(%llu,%d,%llu) %llu != %llu\n", - key.objectid, key.type, key.offset, - rec->generation, generation); - } -#endif - if (rec->generation == generation && - btrfs_comp_cpu_keys(&key, &rc->key_end) < 0 && - btrfs_comp_cpu_keys(&next_key, &rc->key_start) > 0) - reada_add_block(rc, bytenr, &next_key, - level - 1, n_gen); - } - } /* * free extctl records */ @@ -213,12 +248,7 @@ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, rc = rec->rc; kfree(rec); - kref_get(&rc->refcnt); - if (atomic_dec_and_test(&rc->elems)) { - kref_put(&rc->refcnt, reada_control_release); - wake_up(&rc->wait); - } - kref_put(&rc->refcnt, reada_control_release); + reada_control_elem_put(rc); reada_extent_put(fs_info, re); /* one ref for each entry */ } @@ -352,7 +382,8 @@ again: blocksize = btrfs_level_size(root, level); re->logical = logical; re->blocksize = blocksize; - re->top = *top; + if (top) + re->top = *top; INIT_LIST_HEAD(&re->extctl); spin_lock_init(&re->lock); kref_init(&re->refcnt); @@ -503,6 +534,47 @@ static void reada_extent_put(struct btrfs_fs_info *fs_info, kfree(re); } +void reada_control_elem_get(struct reada_control *rc) +{ +#ifndef READA_DEBUG + atomic_inc(&rc->elems); +#else + int new = atomic_inc_return(&rc->elems); + + if (rc->not_first && new == 1) { + /* + * warn if we try to get an elem although it + * was already down to zero + */ + WARN_ON(1); + } + rc->not_first = 1; +#endif +} + +void reada_control_elem_put(struct reada_control *rc) +{ + struct reada_control *next_rc; + + do { + next_rc = NULL; + kref_get(&rc->refcnt); + if (atomic_dec_and_test(&rc->elems)) { + /* + * when the last elem is finished, wake all + * waiters. Also, if we have a parent, remove + * our element from there and wake the waiters. + * Walk up the chain of parents as long as + * we finish the last elem. Drop our ref. + */ + kref_put(&rc->refcnt, reada_control_release); + wake_up(&rc->wait); + next_rc = rc->parent; + } + kref_put(&rc->refcnt, reada_control_release); + } while ((rc = next_rc)); +} + static void reada_zone_release(struct kref *kref) { struct reada_zone *zone = container_of(kref, struct reada_zone, refcnt); @@ -521,12 +593,87 @@ static void reada_control_release(struct kref *kref) kfree(rc); } -static int reada_add_block(struct reada_control *rc, u64 logical, - struct btrfs_key *top, int level, u64 generation) +/* + * context to pass from reada_add_block to worker in case the extent is + * already uptodate in memory + */ +struct reada_uptodate_ctx { + struct btrfs_key top; + struct extent_buffer *eb; + struct reada_control *rc; + u64 logical; + u64 generation; + void *ctx; + struct btrfs_work work; +}; + +/* worker for immediate processing of uptodate blocks */ +static void reada_add_block_uptodate(struct btrfs_work *work) +{ + struct reada_uptodate_ctx *ruc; + + ruc = container_of(work, struct reada_uptodate_ctx, work); + + btrfs_tree_read_lock(ruc->eb); + /* + * we set the lock to blocking, as the callback might want to sleep + * on allocations. + */ + btrfs_set_lock_blocking_rw(ruc->eb, BTRFS_READ_LOCK); + ruc->rc->callback(ruc->rc->root, ruc->rc, ruc->generation, ruc->eb, + ruc->logical, 0, &ruc->top, ruc->ctx); + btrfs_tree_read_unlock_blocking(ruc->eb); + + reada_control_elem_put(ruc->rc); + free_extent_buffer(ruc->eb); + kfree(ruc); +} + +int reada_add_block(struct reada_control *rc, u64 logical, + struct btrfs_key *top, int level, u64 generation, + void *ctx) { struct btrfs_root *root = rc->root; struct reada_extent *re; struct reada_extctl *rec; + struct extent_buffer *eb; + struct inode *btree_inode; + + /* + * first check if the buffer is already uptodate in memory. In this + * case it wouldn''t make much sense to go through the reada dance. + * Instead process it as soon as possible, but in worker context to + * prevent recursion. + */ + eb = btrfs_find_tree_block(root, logical, + btrfs_level_size(root, level)); + btree_inode = eb->first_page->mapping->host; + + if (eb && btrfs_buffer_uptodate(eb, generation)) { + struct reada_uptodate_ctx *ruc; + + ruc = kzalloc(sizeof(*ruc), GFP_NOFS); + if (!ruc) { + free_extent_buffer(eb); + return -1; + } + ruc->rc = rc; + ruc->ctx = ctx; + ruc->generation = generation; + ruc->logical = logical; + ruc->eb = eb; + if (top) + ruc->top = *top; + ruc->work.func = reada_add_block_uptodate; + reada_control_elem_get(rc); + + btrfs_queue_worker(&root->fs_info->readahead_workers, + &ruc->work); + + return 0; + } + if (eb) + free_extent_buffer(eb); re = reada_find_extent(root, logical, top, level); /* takes one ref */ if (!re) @@ -539,14 +686,17 @@ static int reada_add_block(struct reada_control *rc, u64 logical, } rec->rc = rc; + rec->ctx = ctx; rec->generation = generation; - atomic_inc(&rc->elems); + reada_control_elem_get(rc); spin_lock(&re->lock); list_add_tail(&rec->list, &re->extctl); spin_unlock(&re->lock); - /* leave the ref on the extent */ + reada_start_machine(root->fs_info); + + /* leave the ref on re */ return 0; } @@ -750,10 +900,14 @@ static void __reada_start_machine(struct btrfs_fs_info *fs_info) reada_start_machine(fs_info); } -static void reada_start_machine(struct btrfs_fs_info *fs_info) +void reada_start_machine(struct btrfs_fs_info *fs_info) { struct reada_machine_work *rmw; + /* + * FIXME if there are still requests in flight, we don''t need to + * kick a worker. Add a check to prevent unnecessary work + */ rmw = kzalloc(sizeof(*rmw), GFP_NOFS); if (!rmw) { /* FIXME we cannot handle this properly right now */ @@ -765,7 +919,7 @@ static void reada_start_machine(struct btrfs_fs_info *fs_info) btrfs_queue_worker(&fs_info->readahead_workers, &rmw->work); } -#ifdef DEBUG +#ifdef READA_DEBUG static void dump_devs(struct btrfs_fs_info *fs_info, int all) { struct btrfs_device *device; @@ -870,15 +1024,49 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int all) #endif /* - * interface + * if parent is given, the caller has to hold a ref on parent */ -struct reada_control *btrfs_reada_add(struct btrfs_root *root, - struct btrfs_key *key_start, struct btrfs_key *key_end) +struct reada_control *btrfs_reada_alloc(struct reada_control *parent, + struct btrfs_root *root, + struct btrfs_key *key_start, struct btrfs_key *key_end, + reada_cb_t callback) +{ + struct reada_control *rc; + + rc = kzalloc(sizeof(*rc), GFP_NOFS); + if (!rc) + return ERR_PTR(-ENOMEM); + + rc->root = root; + rc->parent = parent; + rc->callback = callback ? callback : readahead_descend; + if (key_start) + rc->key_start = *key_start; + if (key_end) + rc->key_end = *key_end; + atomic_set(&rc->elems, 0); + init_waitqueue_head(&rc->wait); + kref_init(&rc->refcnt); + if (parent) { + /* + * we just add one element to the parent as long as we''re + * not finished + */ + reada_control_elem_get(parent); + } + + return rc; +} + +int btrfs_reada_add(struct reada_control *parent, struct btrfs_root *root, + struct btrfs_key *key_start, struct btrfs_key *key_end, + reada_cb_t callback, void *ctx, struct reada_control **rcp) { struct reada_control *rc; u64 start; u64 generation; int level; + int ret; struct extent_buffer *node; static struct btrfs_key max_key = { .objectid = (u64)-1, @@ -886,17 +1074,18 @@ struct reada_control *btrfs_reada_add(struct btrfs_root *root, .offset = (u64)-1 }; - rc = kzalloc(sizeof(*rc), GFP_NOFS); + rc = btrfs_reada_alloc(parent, root, key_start, key_end, callback); if (!rc) - return ERR_PTR(-ENOMEM); + return -ENOMEM; - rc->root = root; - rc->key_start = *key_start; - rc->key_end = *key_end; - atomic_set(&rc->elems, 0); - init_waitqueue_head(&rc->wait); - kref_init(&rc->refcnt); - kref_get(&rc->refcnt); /* one ref for having elements */ + if (rcp) { + *rcp = rc; + /* + * as we return the rc, get an addition ref on it for + * the caller + */ + kref_get(&rc->refcnt); + } node = btrfs_root_node(root); start = node->start; @@ -904,35 +1093,36 @@ struct reada_control *btrfs_reada_add(struct btrfs_root *root, generation = btrfs_header_generation(node); free_extent_buffer(node); - reada_add_block(rc, start, &max_key, level, generation); + ret = reada_add_block(rc, start, &max_key, level, generation, ctx); reada_start_machine(root->fs_info); - return rc; + return ret; } -#ifdef DEBUG -int btrfs_reada_wait(void *handle) +#ifdef READA_DEBUG +int btrfs_reada_wait(struct reada_control *rc) { - struct reada_control *rc = handle; + struct btrfs_fs_info *fs_info = rc->root->fs_info; + int i; while (atomic_read(&rc->elems)) { wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0, - 5 * HZ); - dump_devs(rc->root->fs_info, rc->elems < 10 ? 1 : 0); + 1 * HZ); + dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0); + printk(KERN_DEBUG "reada_wait on %p: %d elems\n", rc, + atomic_read(&rc->elems)); } - dump_devs(rc->root->fs_info, rc->elems < 10 ? 1 : 0); + dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0); kref_put(&rc->refcnt, reada_control_release); return 0; } #else -int btrfs_reada_wait(void *handle) +int btrfs_reada_wait(struct reada_control *rc) { - struct reada_control *rc = handle; - while (atomic_read(&rc->elems)) { wait_event(rc->wait, atomic_read(&rc->elems) == 0); } @@ -949,3 +1139,80 @@ void btrfs_reada_detach(void *handle) kref_put(&rc->refcnt, reada_control_release); } + +/* + * abort all readahead for a specific reada_control + * this function does not wait for outstanding requests to finish, so + * when it returns, the abort is not fully complete. This function will + * cancel all currently enqueued readaheads for the given rc and all children + * of it. + */ +int btrfs_reada_abort(struct btrfs_fs_info *fs_info, struct reada_control *rc) +{ + struct reada_extent *re = NULL; + struct list_head list; + int ret; + u64 logical = 0; + struct reada_extctl *rec; + struct reada_extctl *tmp; + + INIT_LIST_HEAD(&list); + + while (1) { + spin_lock(&fs_info->reada_lock); + ret = radix_tree_gang_lookup(&fs_info->reada_tree, (void **)&re, + logical >> PAGE_CACHE_SHIFT, 1); + if (ret == 1) + kref_get(&re->refcnt); + spin_unlock(&fs_info->reada_lock); + + if (ret != 1) + break; + + /* + * take out all extctls that should get deleted into another + * list + */ + spin_lock(&re->lock); + if (re->scheduled_for) { + spin_unlock(&re->lock); + goto next; + } + + list_for_each_entry_safe(rec, tmp, &re->extctl, list) { + struct reada_control *it; + + for (it = rec->rc; it; it = it->parent) { + if (it == rc) { + list_move(&rec->list, &list); + break; + } + } + } + spin_unlock(&re->lock); + + /* + * now cancel all extctls in the list + */ + while (!list_empty(&list)) { + struct reada_control *tmp_rc; + + rec = list_first_entry(&list, struct reada_extctl, + list); + rec->rc->callback(rec->rc->root, rec->rc, 0, NULL, + re->logical, + -EAGAIN, &re->top, rec->ctx); + list_del(&rec->list); + tmp_rc = rec->rc; + kfree(rec); + + reada_control_elem_put(tmp_rc); + reada_extent_put(fs_info, re); + } +next: + logical = re->logical + PAGE_CACHE_SIZE; + reada_extent_put(fs_info, re); /* our ref */ + } + + return 1; +} diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index abc0fbf..80140a8 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1136,7 +1136,6 @@ static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev, u64 generation; int mirror_num; struct reada_control *reada1; - struct reada_control *reada2; struct btrfs_key key_start; struct btrfs_key key_end; @@ -1189,23 +1188,21 @@ static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev, key_start.objectid = logical; key_start.type = BTRFS_EXTENT_ITEM_KEY; key_start.offset = (u64)0; + key_end = key_start; key_end.objectid = base + offset + nstripes * increment; - key_end.type = BTRFS_EXTENT_ITEM_KEY; - key_end.offset = (u64)0; - reada1 = btrfs_reada_add(root, &key_start, &key_end); - - key_start.objectid = BTRFS_EXTENT_CSUM_OBJECTID; - key_start.type = BTRFS_EXTENT_CSUM_KEY; - key_start.offset = logical; - key_end.objectid = BTRFS_EXTENT_CSUM_OBJECTID; - key_end.type = BTRFS_EXTENT_CSUM_KEY; - key_end.offset = base + offset + nstripes * increment; - reada2 = btrfs_reada_add(csum_root, &key_start, &key_end); - - if (!IS_ERR(reada1)) + ret = btrfs_reada_add(NULL, root, &key_start, &key_end, + NULL, NULL, &reada1); + /* if readahead fails, we just go ahead without it */ + if (ret == 0) { + key_start.objectid = BTRFS_EXTENT_CSUM_OBJECTID; + key_start.type = BTRFS_EXTENT_CSUM_KEY; + key_start.offset = logical; + key_end = key_start; + key_end.offset = base + offset + nstripes * increment; + ret = btrfs_reada_add(reada1, csum_root, &key_start, + &key_end, NULL, NULL, NULL); btrfs_reada_wait(reada1); - if (!IS_ERR(reada2)) - btrfs_reada_wait(reada2); + } mutex_lock(&fs_info->scrub_lock); while (atomic_read(&fs_info->scrub_pause_req)) { -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This adds a new special inode, the droptree inode. It is placed in the tree root and is used to store the state of snapshot deletion. Even if multiple snapshots are deleted at once, the full state is stored within this one inode. After snapshot deletion completes, the inode is left in place, but truncated to zero. This patch also exports free_space_cache''s io_ctl functions to droptree and adds functions to store and read u8, u16, u32 and u64 values, as well as byte arrays of arbitrary length. Signed-off-by: Arne Jansen <sensille@gmx.net> --- fs/btrfs/btrfs_inode.h | 4 + fs/btrfs/ctree.h | 6 ++ fs/btrfs/free-space-cache.c | 131 ++++++++++++++++++++++++++++++++++++------- fs/btrfs/free-space-cache.h | 32 +++++++++++ fs/btrfs/inode.c | 3 +- 5 files changed, 155 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 9b9b15f..8abbed4 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -196,6 +196,10 @@ static inline void btrfs_i_size_write(struct inode *inode, u64 size) static inline bool btrfs_is_free_space_inode(struct btrfs_root *root, struct inode *inode) { + if (BTRFS_I(inode)->location.objectid == BTRFS_DROPTREE_INO_OBJECTID) + /* it also lives in the tree_root, but is no free space + * inode */ + return false; if (root == root->fs_info->tree_root || BTRFS_I(inode)->location.objectid == BTRFS_FREE_INO_OBJECTID) return true; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 52b8a91..e187ab9 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -116,6 +116,12 @@ struct btrfs_ordered_sum; */ #define BTRFS_FREE_INO_OBJECTID -12ULL +/* + * The inode number assigned to the special inode for storing + * snapshot deletion progress + */ +#define BTRFS_DROPTREE_INO_OBJECTID -13ULL + /* dummy objectid represents multiple objectids */ #define BTRFS_MULTIPLE_OBJECTIDS -255ULL diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index b30242f..7e993b0 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -259,19 +259,8 @@ static int readahead_cache(struct inode *inode) return 0; } -struct io_ctl { - void *cur, *orig; - struct page *page; - struct page **pages; - struct btrfs_root *root; - unsigned long size; - int index; - int num_pages; - unsigned check_crcs:1; -}; - -static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode, - struct btrfs_root *root) +int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode, + struct btrfs_root *root) { memset(io_ctl, 0, sizeof(struct io_ctl)); io_ctl->num_pages = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> @@ -286,12 +275,12 @@ static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode, return 0; } -static void io_ctl_free(struct io_ctl *io_ctl) +void io_ctl_free(struct io_ctl *io_ctl) { kfree(io_ctl->pages); } -static void io_ctl_unmap_page(struct io_ctl *io_ctl) +void io_ctl_unmap_page(struct io_ctl *io_ctl) { if (io_ctl->cur) { kunmap(io_ctl->page); @@ -300,7 +289,7 @@ static void io_ctl_unmap_page(struct io_ctl *io_ctl) } } -static void io_ctl_map_page(struct io_ctl *io_ctl, int clear) +void io_ctl_map_page(struct io_ctl *io_ctl, int clear) { WARN_ON(io_ctl->cur); BUG_ON(io_ctl->index >= io_ctl->num_pages); @@ -312,7 +301,7 @@ static void io_ctl_map_page(struct io_ctl *io_ctl, int clear) memset(io_ctl->cur, 0, PAGE_CACHE_SIZE); } -static void io_ctl_drop_pages(struct io_ctl *io_ctl) +void io_ctl_drop_pages(struct io_ctl *io_ctl) { int i; @@ -327,8 +316,8 @@ static void io_ctl_drop_pages(struct io_ctl *io_ctl) } } -static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode, - int uptodate) +int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode, + int uptodate) { struct page *page; gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping); @@ -361,6 +350,108 @@ static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode, return 0; } +void io_ctl_set_bytes(struct io_ctl *io_ctl, void *data, unsigned long len) +{ + unsigned long l; + + while (len) { + if (io_ctl->cur == NULL) + io_ctl_map_page(io_ctl, 1); + l = min(len, io_ctl->size); + memcpy(io_ctl->cur, data, l); + if (len != l) { + io_ctl_unmap_page(io_ctl); + } else { + io_ctl->cur += l; + io_ctl->size -= l; + } + data += l; + len -= l; + } +} + +void io_ctl_get_bytes(struct io_ctl *io_ctl, void *data, unsigned long len) +{ + unsigned long l; + + while (len) { + if (io_ctl->cur == NULL) + io_ctl_map_page(io_ctl, 0); + l = min(len, io_ctl->size); + memcpy(data, io_ctl->cur, l); + if (len != l) { + io_ctl_unmap_page(io_ctl); + } else { + io_ctl->cur += l; + io_ctl->size -= l; + } + data += l; + len -= l; + } +} + +void io_ctl_set_u64(struct io_ctl *io_ctl, u64 val) +{ + u64 v = cpu_to_le64(val); + + io_ctl_set_bytes(io_ctl, &v, sizeof(u64)); +} + +u64 io_ctl_get_u64(struct io_ctl *io_ctl) +{ + u64 v; + + io_ctl_get_bytes(io_ctl, &v, sizeof(u64)); + + return le64_to_cpu(v); +} + +void io_ctl_set_u32(struct io_ctl *io_ctl, u32 val) +{ + u32 v = cpu_to_le32(val); + + io_ctl_set_bytes(io_ctl, &v, sizeof(u32)); +} + +u32 io_ctl_get_u32(struct io_ctl *io_ctl) +{ + u32 v; + + io_ctl_get_bytes(io_ctl, &v, sizeof(u32)); + + return le32_to_cpu(v); +} + +void io_ctl_set_u16(struct io_ctl *io_ctl, u16 val) +{ + u16 v = cpu_to_le16(val); + + io_ctl_set_bytes(io_ctl, &v, sizeof(u16)); +} + +u16 io_ctl_get_u16(struct io_ctl *io_ctl) +{ + u16 v; + + io_ctl_get_bytes(io_ctl, &v, sizeof(u16)); + + return le16_to_cpu(v); +} + +void io_ctl_set_u8(struct io_ctl *io_ctl, u8 val) +{ + io_ctl_set_bytes(io_ctl, &val, sizeof(u8)); +} + +u8 io_ctl_get_u8(struct io_ctl *io_ctl) +{ + u8 v; + + io_ctl_get_bytes(io_ctl, &v, sizeof(u8)); + + return v; +} + static void io_ctl_set_generation(struct io_ctl *io_ctl, u64 generation) { u64 *val; @@ -523,7 +614,7 @@ static int io_ctl_add_bitmap(struct io_ctl *io_ctl, void *bitmap) return 0; } -static void io_ctl_zero_remaining_pages(struct io_ctl *io_ctl) +void io_ctl_zero_remaining_pages(struct io_ctl *io_ctl) { /* * If we''re not on the boundary we know we''ve modified the page and we diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h index 8f2613f..7fdb2c3 100644 --- a/fs/btrfs/free-space-cache.h +++ b/fs/btrfs/free-space-cache.h @@ -110,4 +110,36 @@ int btrfs_return_cluster_to_free_space( struct btrfs_free_cluster *cluster); int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, u64 *trimmed, u64 start, u64 end, u64 minlen); + +struct io_ctl { + void *cur, *orig; + struct page *page; + struct page **pages; + struct btrfs_root *root; + unsigned long size; + int index; + int num_pages; + unsigned check_crcs:1; +}; + +int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode, + struct btrfs_root *root); +void io_ctl_free(struct io_ctl *io_ctl); +void io_ctl_unmap_page(struct io_ctl *io_ctl); +void io_ctl_map_page(struct io_ctl *io_ctl, int clear); +void io_ctl_drop_pages(struct io_ctl *io_ctl); +int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode, + int uptodate); +void io_ctl_set_bytes(struct io_ctl *io_ctl, void *data, unsigned long len); +void io_ctl_get_bytes(struct io_ctl *io_ctl, void *data, unsigned long len); +void io_ctl_set_u64(struct io_ctl *io_ctl, u64 val); +u64 io_ctl_get_u64(struct io_ctl *io_ctl); +void io_ctl_set_u32(struct io_ctl *io_ctl, u32 val); +u32 io_ctl_get_u32(struct io_ctl *io_ctl); +void io_ctl_set_u16(struct io_ctl *io_ctl, u16 val); +u16 io_ctl_get_u16(struct io_ctl *io_ctl); +void io_ctl_set_u8(struct io_ctl *io_ctl, u8 val); +u8 io_ctl_get_u8(struct io_ctl *io_ctl); +void io_ctl_zero_remaining_pages(struct io_ctl *io_ctl); + #endif diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cbeb2e3..b5d82da 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1809,7 +1809,8 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end) } ret = 0; out: - if (root != root->fs_info->tree_root) + if (root != root->fs_info->tree_root || + inode->i_ino == BTRFS_DROPTREE_INO_OBJECTID) btrfs_delalloc_release_metadata(inode, ordered_extent->len); if (trans) { if (nolock) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Arne Jansen
2012-Apr-12 15:54 UTC
[PATCH 3/5] btrfs: droptree structures and initialization
Add the fs-global state and initialization for snapshot deletion Signed-off-by: Arne Jansen <sensille@gmx.net> --- fs/btrfs/ctree.h | 35 +++++++++++++++++++++++++++++++++++ fs/btrfs/disk-io.c | 18 ++++++++++++++++++ 2 files changed, 53 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e187ab9..8eb0795 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1257,6 +1257,41 @@ struct btrfs_fs_info { /* next backup root to be overwritten */ int backup_root_index; + + /* + * global state for snapshot deletion via readahead. All fields are + * protected by droptree_lock. droptree_lock is a mutex and not a + * spinlock as allocations are done inside and we don''t want to use + * atomic allocations unless we really have to. + */ + struct mutex droptree_lock; + + /* + * currently running requests (droptree_nodes) for each level and + * the corresponding limits. It''s necessary to limit them to have + * an upper limit on the state that has to be written with each + * commit. All nodes exceeding the limit are enqueued to droptree_queue. + */ + long droptree_req[BTRFS_MAX_LEVEL + 1]; + long droptree_limit[BTRFS_MAX_LEVEL + 1]; + struct list_head droptree_queue[BTRFS_MAX_LEVEL + 1]; + + /* + * when droptree is paused, all currently running requests are moved + * to droptree_restart. All nodes in droptree_queue are moved to + * droptree_requeue + */ + struct list_head droptree_restart; + struct list_head droptree_requeue; + + /* + * synchronization for pause/restart. droptree_rc is the top-level + * reada_control, used to cancel all running requests + */ + int droptrees_running; + int droptree_pause_req; + wait_queue_head_t droptree_wait; + struct reada_control *droptree_rc; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b801d29..7b3ddd7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1911,6 +1911,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, int err = -EINVAL; int num_backups_tried = 0; int backup_index = 0; + int i; extent_root = fs_info->extent_root kzalloc(sizeof(struct btrfs_root), GFP_NOFS); @@ -1989,6 +1990,23 @@ struct btrfs_root *open_ctree(struct super_block *sb, INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT); spin_lock_init(&fs_info->reada_lock); + /* snapshot deletion state */ + mutex_init(&fs_info->droptree_lock); + fs_info->droptree_pause_req = 0; + fs_info->droptrees_running = 0; + for (i = 0; i < BTRFS_MAX_LEVEL; ++i) { + fs_info->droptree_limit[i] = 100; + fs_info->droptree_req[i] = 0; + INIT_LIST_HEAD(fs_info->droptree_queue + i); + } + /* FIXME calculate some sane values, maybe based on avail RAM */ + fs_info->droptree_limit[0] = 40000; + fs_info->droptree_limit[1] = 10000; + fs_info->droptree_limit[2] = 4000; + INIT_LIST_HEAD(&fs_info->droptree_restart); + INIT_LIST_HEAD(&fs_info->droptree_requeue); + init_waitqueue_head(&fs_info->droptree_wait); + fs_info->thread_pool_size = min_t(unsigned long, num_online_cpus() + 2, 8); -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This is an implementation of snapshot deletion using the readahead framework. Multiple snapshots can be deleted at once and the trees are not enumerated sequentially but in parallel in many branches. This way readahead can reorder the request to better utilize all disks. For a more detailed description see inline comments. Signed-off-by: Arne Jansen <sensille@gmx.net> --- fs/btrfs/Makefile | 2 +- fs/btrfs/droptree.c | 1916 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 1917 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 0c4fa2b..620d7c8 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -8,7 +8,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ - reada.o backref.o ulist.o + reada.o backref.o ulist.o droptree.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/droptree.c b/fs/btrfs/droptree.c new file mode 100644 index 0000000..9bc9c23 --- /dev/null +++ b/fs/btrfs/droptree.c @@ -0,0 +1,1916 @@ +/* + * Copyright (C) 2011 STRATO. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ +#include "ctree.h" +#include "transaction.h" +#include "disk-io.h" +#include "locking.h" +#include "free-space-cache.h" +#include "print-tree.h" + +/* + * This implements snapshot deletions with the use of the readahead framework. + * The fs-tree is not tranversed in sequential (key-) order, but by descending + * into multiple paths at once. In addition, up to DROPTREE_MAX_ROOTS snapshots + * will be deleted in parallel. + * The basic principle is as follows: + * When a tree node is found, first its refcnt and flags are fetched (via + * readahead) from the extent allocation tree. Simplified, if the refcnt + * is > 1, other trees also reference this node and we also have to drop our + * ref to it and are done. If on the other hand the refcnt is 1, it''s our + * node and we have to free it (the same holds for data extents). So we fetch + * the actual node (via readahead) and add all nodes/extents it points to to + * the queue to again fetch refcnts for them. + * While the case where refcnt > 1 looks like the easy one, there''s one special + * case to take into account: If the node still uses non-shared refs and we + * are the owner of the node, we''re not allowed to just drop our ref, as it + * would leave the node with an unresolvable backref (it points to our tree). + * So we first have to convert the refs to shared refs, and not only for this + * node, but for its full subtree. We can stop descending if we encounter a + * node of which we are not owner. + * One big difference to the old snapshot deletion code that sequentially walks + * the tree is that we can''t represent the current deletion state with a single + * key. As we delete in several paths simultaneously, we have to record all + * loose ends in the state. To not get an exponentially growing state, we don''t + * delete the refs top-down, but bottom-up. In addition, we restrict the + * currently processed nodes per-level globally. This way, we have a bounded + * size for the state and can preallocate the needed space before any work + * starts. During each transction commit, we write the state to a special + * inode (DROPTREE_INO) recorded in the root tree. + * For a commit, all outstanding readahead-requests get cancelled and moved + * to a restart queue. + * + * The central data structure here is the droptree_node (dn). It represents a + * file system extent, meaning a tree node, tree leaf or a data extent. The + * dn''s are organized in a tree corresponding to the disk structure. Each dn + * keeps a bitmap that records which of its children are finished. When the + * last bit gets set by a child, the freeing of the node is triggered. In + * addition, struct droptree_root represents a fs tree to delete. It is mainly + * used to keep all roots in a list. The reada_control for this tree is also + * recorded here. We don''t keep it in the dn''s, as it is being freed and re- + * created with each transaction commit. + */ + +#define DROPTREE_MAX_ROOTS 50 + +/* + * constants used in the on-disk state for deletion progress + */ +#define DROPTREE_STATE_GO_UP 0xffff +#define DROPTREE_STATE_GO_DOWN 0xfffe +#define DROPTREE_STATE_END 0xfffd + +/* + * used in droptree inode + */ +#define DROPTREE_VERSION 1 + +/* + * helper struct used by main loop to keep all roots currently being deleted + * in a list. + */ +struct droptree_root { + struct droptree_node *top_dn; + struct btrfs_root *root; + struct reada_control *rc; + struct list_head list; +}; + +/* + * this structure contains all information needed to carry a + * node/leaf/data extent through all phases. Each phases fills + * in the information it got, so don''t assume all information is + * available at any time. + */ +struct droptree_node { + /* + * pointer back to our root + */ + struct droptree_root *droproot; + + /* + * where our place in the parent is. parent_slot is the bit number + * in parent->map + */ + struct droptree_node *parent; + int parent_slot; + + /* + * tree of nodes. Both are protected by lock at bottom. + */ + struct list_head next; + struct list_head children; + + /* + * information about myself + */ + u64 start; + u64 len; + int level; + u64 generation; + u64 flags; + u64 owner; /* only for data extents */ + u64 offset; /* only for data extents */ + + /* + * bitmap to record the completion of each child node. Protected + * by lock at bottom. + */ + u32 *map; + int nritems; + + /* + * the readahead for prefetching the extent tree entry + */ + struct reada_control *sub_rc; + + /* + * to get out of end_io worker context into readahead worker context. + * This is always needed if we want to do disk-IO, as otherwise we + * might end up holding all end_io worker, thus preventing the IO from + * completion and deadlocking. end_transaction might do disk-IO. + */ + struct btrfs_work work; + + /* + * used to queue node either in restart queue or requeue queue. + * Protected by fs_info->droptree_lock. + */ + struct list_head list; + + /* + * convert == 1 means we need to convert the refs in this tree to + * shared refs. The point where we started the conversion is marked + * with conversion_point == 1. When we finished conversion, the normal + * dropping of refs restarts at this point. When we encounter a node + * that doesn''t need conversion, we mark it with convert_stop == 1. + * This also means no nodes below this point need conversion. + */ + unsigned int convert:1; + unsigned int conversion_point:1; + unsigned int convert_stop:1; + + /* + * lock for bitmap and lists + */ + spinlock_t lock; +}; + +static struct btrfs_key min_key = { 0 }; +static struct btrfs_key max_key = { + .objectid = (u64)-1, + .type = (u8)-1, + .offset = (u64)-1 +}; + +static void droptree_fetch_ref(struct btrfs_work *work); +static void droptree_free_ref(struct btrfs_work *work); +static void droptree_kick(struct btrfs_fs_info *fs_info); +static void droptree_free_up(struct btrfs_trans_handle *trans, + struct droptree_node *dn, int last_ref); +static void droptree_reada_conv(struct btrfs_root *root, + struct reada_control *rc, + u64 wanted_generation, + struct extent_buffer *eb, + u64 start, int err, + struct btrfs_key *top, void *ctx); +static int droptree_reada_dn(struct droptree_node *dn); +static struct droptree_node *droptree_alloc_node(struct droptree_root *dr); +static void droptree_reada_fstree(struct btrfs_root *root, + struct reada_control *rc, + u64 wanted_generation, + struct extent_buffer *eb, + u64 start, int err, + struct btrfs_key *top, void *ctx); + +/* + * Bit operations. These are mostly a copy of lib/bitmap.c, with 2 differences: + * a) it always operates on u32 instead of unsigned long. This makes it easier + * to save them to disk in a portable format + * b) the locking has to be provided externally + */ +#define DT_BITS_PER_BYTE 8 +#define DT_BITS_PER_U32 (sizeof(u32) * DT_BITS_PER_BYTE) +#define DT_BIT_MASK(nr) (1UL << ((nr) % DT_BITS_PER_U32)) +#define DT_BIT_WORD(nr) ((nr) / DT_BITS_PER_U32) +#define DT_BITS_TO_U32(nr) DIV_ROUND_UP(nr, DT_BITS_PER_BYTE * sizeof(u32)) +#define DT_BITMAP_LAST_WORD_MASK(nbits) \ +( \ + ((nbits) % DT_BITS_PER_U32) ? \ + (1UL << ((nbits) % DT_BITS_PER_U32))-1 : ~0UL \ +) + +static void droptree_set_bit(int nr, u32 *addr) +{ + u32 mask = DT_BIT_MASK(nr); + u32 *p = ((u32 *)addr) + DT_BIT_WORD(nr); + + *p |= mask; +} + +static int droptree_test_bit(int nr, const u32 *addr) +{ + return 1UL & (addr[DT_BIT_WORD(nr)] >> (nr & (DT_BITS_PER_U32 - 1))); +} + +static int droptree_bitmap_full(const u32 *addr, int bits) +{ + int k; + int lim = bits / DT_BITS_PER_U32; + + for (k = 0; k < lim; ++k) + if (~addr[k]) + return 0; + + if (bits % DT_BITS_PER_U32) + if (~addr[k] & DT_BITMAP_LAST_WORD_MASK(bits)) + return 0; + + return 1; +} + +/* + * This function is called from readahead. + * Prefetch the extent tree entry for this node so we can later read the + * refcnt without waiting for disk I/O. This readahead is implemented as + * a sub readahead from the fstree readahead. + */ +static void droptree_reada_ref(struct btrfs_root *root, + struct reada_control *rc, + u64 wanted_generation, struct extent_buffer *eb, + u64 start, int err, struct btrfs_key *top, + void *ctx) +{ + int nritems; + u64 generation = 0; + int level = 0; + int i; + struct btrfs_fs_info *fs_info = root->fs_info; + struct droptree_node *dn = ctx; + int nfound = 0; + int ret; + + mutex_lock(&fs_info->droptree_lock); + if (err == -EAGAIN || fs_info->droptree_pause_req) { + /* + * if cancelled, put to restart queue + */ + list_add_tail(&dn->list, &fs_info->droptree_restart); + mutex_unlock(&fs_info->droptree_lock); + return; + } + mutex_unlock(&fs_info->droptree_lock); + + if (eb) { + level = btrfs_header_level(eb); + generation = btrfs_header_generation(eb); + if (wanted_generation != generation) + err = 1; + } else { + BUG_ON(!err); /* can''t happen */ + } + + /* + * either when level 0 is reached and the prefetch is finished, or + * when an error occurs, we finish this readahead and kick the + * reading of the actual refcnt to a worker + */ + if (err || level == 0) { +error: + dn->work.func = droptree_fetch_ref; + dn->sub_rc = rc; + + /* + * we don''t want our rc to go away right now, as this might + * signal the parent rc before we are done. In the next stage, + * we first add a new readahead to the parent and afterwards + * clear our sub-reada + */ + reada_control_elem_get(rc); + + /* + * we have to push the lookup_extent_info out to a worker, + * as we are currently running in the context of the end_io + * workers and lookup_extent_info might do a lookup, thus + * deadlocking + */ + btrfs_queue_worker(&fs_info->readahead_workers, &dn->work); + + return; + } + + /* + * level 0 not reached yet, so continue down the tree with reada + */ + nritems = btrfs_header_nritems(eb); + + for (i = 0; i < nritems; i++) { + u64 n_gen; + struct btrfs_key key; + struct btrfs_key next_key; + u64 bytenr; + + btrfs_node_key_to_cpu(eb, &key, i); + if (i + 1 < nritems) + btrfs_node_key_to_cpu(eb, &next_key, i + 1); + else + next_key = *top; + bytenr = btrfs_node_blockptr(eb, i); + n_gen = btrfs_node_ptr_generation(eb, i); + + if (btrfs_comp_cpu_keys(&key, &rc->key_end) < 0 && + btrfs_comp_cpu_keys(&next_key, &rc->key_start) > 0) { + ret = reada_add_block(rc, bytenr, &next_key, + level - 1, n_gen, ctx); + if (ret) + goto error; + ++nfound; + } + } + if (nfound > 1) { + /* + * just a safeguard, we searched for a single key, so there + * must not be more than one entry in the node + */ + btrfs_print_leaf(rc->root, eb); + BUG_ON(1); /* inconsistent fs */ + } + if (nfound == 0) { + /* + * somewhere the tree changed while we were running down. + * So start over again from the top + */ + ret = droptree_reada_dn(dn); + if (ret) + goto error; + } +} + +/* + * This is called from a worker, kicked off by droptree_reada_ref. We arrive + * here after either the extent tree is prefetched or an error occured. In + * any case, the refcnt is read synchronously now, hopefully without disk I/O. + * If we encounter any hard errors here, we have no chance but to BUG. + */ +static void droptree_fetch_ref(struct btrfs_work *work) +{ + struct droptree_node *dn; + int ret; + u64 refs; + u64 flags; + struct btrfs_trans_handle *trans; + struct extent_buffer *eb = NULL; + struct btrfs_root *root; + struct btrfs_fs_info *fs_info; + struct reada_control *sub_rc; + int free_up = 0; + int is_locked = 0; + + dn = container_of(work, struct droptree_node, work); + + root = dn->droproot->root; + fs_info = root->fs_info; + sub_rc = dn->sub_rc; + + trans = btrfs_join_transaction(fs_info->extent_root); + BUG_ON(!trans); /* can''t back out */ + + ret = btrfs_lookup_extent_info(trans, root, dn->start, dn->len, + &refs, &flags); + BUG_ON(ret); /* can''t back out */ + dn->flags = flags; + + if (dn->convert && dn->level >= 0) { + eb = btrfs_find_create_tree_block(root, dn->start, dn->len); + BUG_ON(!eb); /* can''t back out */ + if (!btrfs_buffer_uptodate(eb, dn->generation)) { + struct reada_control *conv_rc; + +fetch_buf: + /* + * we might need to convert the ref. To check this, + * we need to know the header_owner of the block, so + * we actually need the block''s content. Just add + * a sub-reada for the content that points back here + */ + free_extent_buffer(eb); + btrfs_end_transaction(trans, fs_info->extent_root); + + conv_rc = btrfs_reada_alloc(dn->droproot->rc, + root, NULL, NULL, + droptree_reada_conv); + ret = reada_add_block(conv_rc, dn->start, NULL, + dn->level, dn->generation, dn); + BUG_ON(ret < 0); /* can''t back out */ + reada_start_machine(fs_info); + + return; + } + if (btrfs_header_owner(eb) != root->root_key.objectid) { + /* + * we''re not the owner of the block, so we can stop + * converting. No blocks below this will need conversion + */ + dn->convert_stop = 1; + free_up = 1; + } else { + /* conversion needed, descend */ + btrfs_tree_read_lock(eb); + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); + is_locked = 1; + droptree_reada_fstree(root, dn->droproot->rc, + dn->generation, eb, dn->start, 0, + NULL, dn); + } + goto out; + } + + if (refs > 1) { + /* + * we did the lookup without proper locking. If the refcnt is 1, + * no locking is needed, as we hold the only reference to + * the extent. When the refcnt is >1, it can change at any time. + * To prevent this, we lock either the extent (for a tree + * block), or the leaf referencing the extent (for a data + * extent). Afterwards we repeat the lookup. + */ + if (dn->level == -1) + eb = btrfs_find_create_tree_block(root, + dn->parent->start, + dn->parent->len); + else + eb = btrfs_find_create_tree_block(root, dn->start, + dn->len); + BUG_ON(!eb); /* can''t back out */ + btrfs_tree_read_lock(eb); + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); + is_locked = 1; + ret = btrfs_lookup_extent_info(trans, root, dn->start, dn->len, + &refs, &flags); + BUG_ON(ret); /* can''t back out */ + dn->flags = flags; + + if (refs == 1) { + /* + * The additional ref(s) has gone away in the meantime. + * Now the extent is ours, no need for locking anymore + */ + btrfs_tree_read_unlock_blocking(eb); + free_extent_buffer(eb); + eb = NULL; + } else if (!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) && + dn->level >= 0 && + !btrfs_buffer_uptodate(eb, dn->generation)) { + /* + * we might need to convert the ref. To check this, + * we need to know the header_owner of the block, so + * we actually need the block''s contents. Just add + * a sub-reada for the content that points back here + */ + btrfs_tree_read_unlock_blocking(eb); + + goto fetch_buf; + } + } + + /* + * See if we have to convert the ref to a shared ref before we drop + * our ref. Above we have made sure the buffer is uptodate. + */ + if (refs > 1 && dn->level >= 0 && + !(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) && + btrfs_header_owner(eb) == root->root_key.objectid) { + dn->convert = 1; + /* when done with the conversion, switch to freeing again */ + dn->conversion_point = 1; + + droptree_reada_fstree(root, dn->droproot->rc, + dn->generation, eb, dn->start, 0, + NULL, dn); + } else if (refs == 1 && dn->level >= 0) { + /* + * refcnt is 1, descend into lower levels + */ + ret = reada_add_block(dn->droproot->rc, dn->start, NULL, + dn->level, dn->generation, dn); + WARN_ON(ret < 0); + reada_start_machine(fs_info); + } else { + /* + * either refcnt is >1, or we''ve reached the bottom of the + * tree. In any case, drop our reference + */ + free_up = 1; + } +out: + if (eb) { + if (is_locked) + btrfs_tree_read_unlock_blocking(eb); + free_extent_buffer(eb); + } + + if (free_up) { + /* + * mark node as done in parent. This ends the lifecyle of dn + */ + droptree_free_up(trans, dn, refs == 1); + } + + btrfs_end_transaction(trans, fs_info->extent_root); + /* + * end the sub reada. This might complete the parent. + */ + reada_control_elem_put(sub_rc); +} + +/* + * mark the slot in the parent as done. This might also complete the parent, + * so walk the tree up as long as nodes complete + * + * can''t be called from end_io worker context, as it needs a transaction + */ +static noinline void droptree_free_up(struct btrfs_trans_handle *trans, + struct droptree_node *dn, int last_ref) +{ + struct btrfs_root *root = dn->droproot->root; + struct btrfs_fs_info *fs_info = root->fs_info; + + while (dn) { + struct droptree_node *parent = dn->parent; + int slot = dn->parent_slot; + u64 parent_start = 0; + int ret; + + if (parent && parent->flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) + parent_start = parent->start; + + if (dn->level >= 0) { + mutex_lock(&fs_info->droptree_lock); + --fs_info->droptree_req[dn->level]; + mutex_unlock(&fs_info->droptree_lock); + } + + if (dn->convert) { + if (dn->level >= 0 && + !(dn->flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) && + !dn->convert_stop) { + struct extent_buffer *eb; + eb = read_tree_block(root, dn->start, dn->len, + dn->generation); + BUG_ON(!eb); /* can''t back out */ + btrfs_tree_lock(eb); + btrfs_set_lock_blocking(eb); + ret = btrfs_inc_ref(trans, root, eb, 1, 1); + BUG_ON(ret); /* can''t back out */ + ret = btrfs_dec_ref(trans, root, eb, 0, 1); + BUG_ON(ret); /* can''t back out */ + ret = btrfs_set_disk_extent_flags(trans, + root, eb->start, eb->len, + BTRFS_BLOCK_FLAG_FULL_BACKREF, + 0); + BUG_ON(ret); /* can''t back out */ + btrfs_tree_unlock(eb); + free_extent_buffer(eb); + dn->flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF; + } + dn->convert = 0; + if (dn->conversion_point) { + /* + * start over again for this node, clean it + * and enqueue it again + */ + dn->conversion_point = 0; + + kfree(dn->map); + dn->map = NULL; + dn->nritems = 0; + + /* + * just add to the list and let droptree_kick + * do the actual work of enqueueing + */ + mutex_lock(&fs_info->droptree_lock); + list_add_tail(&dn->list, + &fs_info->droptree_queue[dn->level]); + reada_control_elem_get(dn->droproot->rc); + mutex_unlock(&fs_info->droptree_lock); + + goto out; + } + } else if (last_ref && dn->level >= 0) { + struct extent_buffer *eb; + + /* + * btrfs_free_tree_block needs to read the block to + * act on the owner recorded in the header. We have + * read the block some time ago, so hopefully it is + * still in the cache + */ + eb = read_tree_block(root, dn->start, dn->len, + dn->generation); + BUG_ON(!eb); /* can''t back out */ + btrfs_free_tree_block(trans, root, eb, + parent_start, 1, 0); + free_extent_buffer(eb); + } else { + btrfs_free_extent(trans, root, dn->start, dn->len, + parent_start, + root->root_key.objectid, + dn->owner, dn->offset, 0); + } + + if (!parent) + break; + + /* + * this free is after the parent check, as we don''t want to + * free up the top level node. The main loop uses dn->map + * as an indication if the tree is done. + */ + spin_lock(&parent->lock); + list_del(&dn->next); + spin_unlock(&parent->lock); + kfree(dn->map); + kfree(dn); + + /* + * notify up + */ + spin_lock(&parent->lock); + droptree_set_bit(slot, parent->map); + if (!droptree_bitmap_full(parent->map, parent->nritems)) { + spin_unlock(&parent->lock); + break; + } + spin_unlock(&parent->lock); + + dn = parent; + last_ref = 1; + } + +out: + droptree_kick(fs_info); +} + +/* + * this callback is used when we need the actual eb to decide whether to + * convert the refs for this node or not. It just loops back to + * droptree_reada_fetch_ref + */ +static void droptree_reada_conv(struct btrfs_root *root, + struct reada_control *rc, + u64 wanted_generation, + struct extent_buffer *eb, + u64 start, int err, + struct btrfs_key *top, void *ctx) +{ + struct droptree_node *dn = ctx; + struct btrfs_fs_info *fs_info = root->fs_info; + + if (err == -EAGAIN) { + /* + * we''re still in the process of fetching the refs. + * As we want to start over cleanly after the commit, + * we also have to give up the sub_rc + */ + reada_control_elem_put(dn->sub_rc); + + mutex_lock(&fs_info->droptree_lock); + list_add_tail(&dn->list, &fs_info->droptree_restart); + mutex_unlock(&fs_info->droptree_lock); + return; + } + + if (err || eb == NULL) + BUG(); /* can''t back out */ + + /* not yet done with the conversion stage, go back to step 2 */ + btrfs_queue_worker(&fs_info->readahead_workers, &dn->work); + + droptree_kick(fs_info); +} + +/* + * After having fetched the refcnt for a node and decided we have to descend + * into it, we arrive here. Called from reada for the actual extent. + * The main idea is to find all pointers to lower nodes and add them to reada. + */ +static void droptree_reada_fstree(struct btrfs_root *root, + struct reada_control *rc, + u64 wanted_generation, + struct extent_buffer *eb, + u64 start, int err, + struct btrfs_key *top, void *ctx) +{ + int nritems; + u64 generation; + int level; + int i; + struct droptree_node *dn = ctx; + struct droptree_node *child; + struct btrfs_fs_info *fs_info = root->fs_info; + struct droptree_node **child_map = NULL; + u32 *finished_map = NULL; + int nrsent = 0; + int ret; + + if (err == -EAGAIN) { + mutex_lock(&fs_info->droptree_lock); + list_add_tail(&dn->list, &fs_info->droptree_restart); + mutex_unlock(&fs_info->droptree_lock); + return; + } + + if (err || eb == NULL) { + /* + * FIXME we can''t deal with I/O errors here. One possibility + * would to abandon the subtree and just leave it allocated, + * wasting the space. Another way would be to turn the fs + * readonly. + */ + BUG(); /* can''t back out */ + } + + level = btrfs_header_level(eb); + nritems = btrfs_header_nritems(eb); + generation = btrfs_header_generation(eb); + + if (wanted_generation != generation) { + /* + * the fstree is supposed to be static, as it is inaccessible + * from user space. So if there''s a generation mismatch here, + * something has gone wrong. + */ + BUG(); /* can''t back out */ + } + + /* + * allocate a bitmap if we don''t already have one. In case we restart + * a snapshot deletion after a mount, the map already contains completed + * slots. If we have the map, we put it aside for the moment and replace + * it with a zero-filled map. During the loop, we repopulate it. If we + * wouldn''t do that, we might end up with a dn already being freed + * by completed children that got enqueued during the loop. This way + * we make sure the dn might only be freed during the last round. + */ + if (dn->map) { + struct droptree_node *it; + /* + * we are in a restore. build a map of all child nodes that + * are already present + */ + child_map = kzalloc(nritems * sizeof(struct droptree_node), + GFP_NOFS); + BUG_ON(!child_map); /* can''t back out */ + BUG_ON(nritems != dn->nritems); /* inconsistent fs */ + list_for_each_entry(it, &dn->children, next) { + BUG_ON(it->parent_slot < 0 || + it->parent_slot >= nritems); /* incons. fs */ + child_map[it->parent_slot] = it; + } + finished_map = dn->map; + dn->map = NULL; + } + dn->map = kzalloc(DT_BITS_TO_U32(nritems) * sizeof(u32), GFP_NOFS); + dn->nritems = nritems; + + /* + * fetch refs for all lower nodes + */ + for (i = 0; i < nritems; i++) { + u64 n_gen; + struct btrfs_key key; + u64 bytenr; + u64 num_bytes; + u64 owner = level - 1; + u64 offset = 0; + + /* + * in case of recovery, we could have already finished this + * slot + */ + if (finished_map && droptree_test_bit(i, finished_map)) + goto next_slot; + + if (level == 0) { + struct btrfs_file_extent_item *fi; + + btrfs_item_key_to_cpu(eb, &key, i); + if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY) + goto next_slot; + fi = btrfs_item_ptr(eb, i, + struct btrfs_file_extent_item); + if (btrfs_file_extent_type(eb, fi) =+ BTRFS_FILE_EXTENT_INLINE) + goto next_slot; + bytenr = btrfs_file_extent_disk_bytenr(eb, fi); + if (bytenr == 0) { +next_slot: + spin_lock(&dn->lock); + droptree_set_bit(i, dn->map); + if (droptree_bitmap_full(dn->map, nritems)) { + spin_unlock(&dn->lock); + goto free; + } + spin_unlock(&dn->lock); + continue; + } + num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi); + key.offset -= btrfs_file_extent_offset(eb, fi); + owner = key.objectid; + offset = key.offset; + n_gen = 0; + } else { + btrfs_node_key_to_cpu(eb, &key, i); + bytenr = btrfs_node_blockptr(eb, i); + num_bytes = btrfs_level_size(root, 0); + n_gen = btrfs_node_ptr_generation(eb, i); + } + + if (child_map && child_map[i]) { + child = child_map[i]; + child->generation = n_gen; + } else { + child = droptree_alloc_node(dn->droproot); + BUG_ON(!child); + child->parent = dn; + child->parent_slot = i; + child->level = level - 1; + child->start = bytenr; + child->len = num_bytes; + child->owner = owner; + child->offset = offset; + child->generation = n_gen; + child->convert = dn->convert; + } + ++nrsent; + + /* + * limit the number of outstanding requests for a given level. + * The limit is global to all outstanding snapshot deletions. + * Only requests for levels >= 0 are limited. The level of this + * request is level-1. + */ + if (level > 0) { + mutex_lock(&fs_info->droptree_lock); + if ((fs_info->droptree_pause_req || + (fs_info->droptree_req[level - 1] >+ fs_info->droptree_limit[level - 1]))) { + /* + * limit reached or pause requested, enqueue + * request and get an rc->elem for it + */ + list_add_tail(&child->list, + &fs_info->droptree_queue[level - 1]); + reada_control_elem_get(rc); + mutex_unlock(&fs_info->droptree_lock); + continue; + } + ++fs_info->droptree_req[level - 1]; + mutex_unlock(&fs_info->droptree_lock); + } + if (list_empty(&child->next)) { + spin_lock(&dn->lock); + list_add(&child->next, &dn->children); + spin_unlock(&dn->lock); + } + /* + * this might immediately call back into completion, + * so dn can become invalid in the last round + */ + ret = droptree_reada_dn(child); + BUG_ON(ret); /* can''t back out */ + } + + if (nrsent == 0) { +free: + /* + * this leaf didn''t contain any EXTENT_DATA items. It can''t be + * a node, as nodes are never empty. This point might also be + * reached via the label <free> when we set the last bit our- + * selves. This is possible when all enqueued readas already + * finish while we loop. + * We need to drop our ref and notify our parents, but can''t + * do this in the context of the end_io workers, as it might + * cause disk-I/O causing a deadlock. So kick off to a worker. + */ + dn->work.func = droptree_free_ref; + + /* + * we don''t want our rc to go away right now, as this might + * signal the parent rc before we are done. + */ + reada_control_elem_get(rc); + btrfs_queue_worker(&fs_info->readahead_workers, &dn->work); + } + + kfree(child_map); + kfree(finished_map); + droptree_kick(fs_info); +} + +/* + * worker deferred from droptree_reada_fstree in case the extent didn''t contain + * anything to descend into. Just free this node and notify the parent + */ +static void droptree_free_ref(struct btrfs_work *work) +{ + struct droptree_node *dn; + struct btrfs_trans_handle *trans; + struct reada_control *fs_rc; + struct btrfs_root *root; + struct btrfs_fs_info *fs_info; + + dn = container_of(work, struct droptree_node, work); + fs_rc = dn->droproot->rc; + root = dn->droproot->root; + fs_info = root->fs_info; + + trans = btrfs_join_transaction(fs_info->extent_root); + BUG_ON(!trans); /* can''t back out */ + + droptree_free_up(trans, dn, 1); + + btrfs_end_transaction(trans, fs_info->extent_root); + + /* + * end the sub reada. This might complete the parent. + */ + reada_control_elem_put(fs_rc); +} + +/* + * add a node to readahead. For the top-level node, just add the block to the + * fs-rc, for all other nodes add a sub-reada. + */ +static int droptree_reada_dn(struct droptree_node *dn) +{ + struct btrfs_key ex_start; + struct btrfs_key ex_end; + int ret; + struct droptree_root *dr = dn->droproot; + + ex_start.objectid = dn->start; + ex_start.type = BTRFS_EXTENT_ITEM_KEY; + ex_start.offset = dn->len; + ex_end = ex_start; + ++ex_end.offset; + + if (!dn->parent) + ret = reada_add_block(dr->rc, dn->start, &max_key, + dn->level, dn->generation, dn); + else + ret = btrfs_reada_add(dr->rc, dr->root->fs_info->extent_root, + &ex_start, &ex_end, + droptree_reada_ref, dn, NULL); + + return ret; +} + +/* + * after a restart from a commit, all previously canceled requests need to be + * restarted. Also we moved the queued dns to the requeue queue, so move them + * back here + */ +static void droptree_restart(struct btrfs_fs_info *fs_info) +{ + int ret; + struct droptree_node *dn; + + /* + * keep the lock over the whole operation, otherwise the enqueued + * blocks could immediately be handled and the elem count drop to + * zero before we''re done enqueuing + */ + mutex_lock(&fs_info->droptree_lock); + if (fs_info->droptree_pause_req) { + mutex_unlock(&fs_info->droptree_lock); + return; + } + + while (!list_empty(&fs_info->droptree_restart)) { + dn = list_first_entry(&fs_info->droptree_restart, + struct droptree_node, list); + + list_del_init(&dn->list); + + ret = droptree_reada_dn(dn); + BUG_ON(ret); /* can''t back out */ + } + + while (!list_empty(&fs_info->droptree_requeue)) { + dn = list_first_entry(&fs_info->droptree_requeue, + struct droptree_node, list); + list_del_init(&dn->list); + list_add_tail(&dn->list, &fs_info->droptree_queue[dn->level]); + reada_control_elem_get(dn->droproot->rc); + } + + mutex_unlock(&fs_info->droptree_lock); +} + +/* + * for a commit, everything that''s queued in droptree_queue[level] is put + * aside into requeue queue. Also the elem on the parent is given up, allowing + * the count to drop to zero + */ +static void droptree_move_to_requeue(struct btrfs_fs_info *fs_info) +{ + int i; + struct droptree_node *dn; + struct reada_control *rc; + + mutex_lock(&fs_info->droptree_lock); + + for (i = 0; i < BTRFS_MAX_LEVEL; ++i) { + while (!list_empty(fs_info->droptree_queue + i)) { + dn = list_first_entry(fs_info->droptree_queue + i, + struct droptree_node, list); + + list_del_init(&dn->list); + rc = dn->droproot->rc; + list_add_tail(&dn->list, &fs_info->droptree_requeue); + reada_control_elem_put(rc); + } + } + mutex_unlock(&fs_info->droptree_lock); +} + +/* + * check if we have room in readahead at any level and send respective nodes + * to readahead + */ +static void droptree_kick(struct btrfs_fs_info *fs_info) +{ + int i; + int ret; + struct droptree_node *dn; + struct reada_control *rc; + + mutex_lock(&fs_info->droptree_lock); + + for (i = 0; i < BTRFS_MAX_LEVEL; ++i) { + while (!list_empty(fs_info->droptree_queue + i)) { + if (fs_info->droptree_pause_req) { + mutex_unlock(&fs_info->droptree_lock); + droptree_move_to_requeue(fs_info); + return; + } + + if (fs_info->droptree_req[i] >+ fs_info->droptree_limit[i]) + break; + + dn = list_first_entry(fs_info->droptree_queue + i, + struct droptree_node, list); + + list_del_init(&dn->list); + rc = dn->droproot->rc; + + ++fs_info->droptree_req[i]; + mutex_unlock(&fs_info->droptree_lock); + + spin_lock(&dn->parent->lock); + if (list_empty(&dn->next)) + list_add(&dn->next, &dn->parent->children); + spin_unlock(&dn->parent->lock); + + ret = droptree_reada_dn(dn); + BUG_ON(ret); /* can''t back out */ + + /* + * we got an elem on the rc when the dn got enqueued, + * drop it here so elem can go down to zero + */ + reada_control_elem_put(rc); + mutex_lock(&fs_info->droptree_lock); + } + } + mutex_unlock(&fs_info->droptree_lock); +} + +/* + * mark the running droptree as paused and cancel add requests. When this + * returns, droptree is completly paused. + */ +int btrfs_droptree_pause(struct btrfs_fs_info *fs_info) +{ + struct reada_control *top_rc; + + mutex_lock(&fs_info->droptree_lock); + + ++fs_info->droptree_pause_req; + top_rc = fs_info->droptree_rc; + if (top_rc) + kref_get(&top_rc->refcnt); + mutex_unlock(&fs_info->droptree_lock); + + if (top_rc) { + btrfs_reada_abort(fs_info, top_rc); + btrfs_reada_detach(top_rc); /* free our ref */ + } + /* move all queued requests to requeue */ + droptree_move_to_requeue(fs_info); + + mutex_lock(&fs_info->droptree_lock); + while (fs_info->droptrees_running) { + mutex_unlock(&fs_info->droptree_lock); + wait_event(fs_info->droptree_wait, + fs_info->droptrees_running == 0); + mutex_lock(&fs_info->droptree_lock); + } + mutex_unlock(&fs_info->droptree_lock); + + return 0; +} + +void btrfs_droptree_continue(struct btrfs_fs_info *fs_info) +{ + mutex_lock(&fs_info->droptree_lock); + --fs_info->droptree_pause_req; + mutex_unlock(&fs_info->droptree_lock); + + wake_up(&fs_info->droptree_wait); +} + +/* + * find the special inode used to save droptree state. If it doesn''t exist, + * create it. Similar to the free_space_cache inodes this is generated in the + * root tree. + */ +static noinline struct inode *droptree_get_inode(struct btrfs_fs_info *fs_info) +{ + struct btrfs_key location; + struct inode *inode; + struct btrfs_trans_handle *trans; + struct btrfs_root *root = fs_info->tree_root; + struct btrfs_disk_key disk_key; + struct btrfs_inode_item *inode_item; + struct extent_buffer *leaf; + struct btrfs_path *path; + u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC; + int ret = 0; + + location.objectid = BTRFS_DROPTREE_INO_OBJECTID; + location.type = BTRFS_INODE_ITEM_KEY; + location.offset = 0; + + inode = btrfs_iget(fs_info->sb, &location, root, NULL); + if (inode && !IS_ERR(inode)) + return inode; + + path = btrfs_alloc_path(); + if (!path) + return ERR_PTR(-ENOMEM); + inode = NULL; + + /* + * inode does not exist, create it + */ + trans = btrfs_start_transaction(root, 1); + if (!trans) { + btrfs_free_path(path); + return ERR_PTR(-ENOMEM); + } + + ret = btrfs_insert_empty_inode(trans, root, path, + BTRFS_DROPTREE_INO_OBJECTID); + if (ret) + goto out; + + leaf = path->nodes[0]; + inode_item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_inode_item); + btrfs_item_key(leaf, &disk_key, path->slots[0]); + memset_extent_buffer(leaf, 0, (unsigned long)inode_item, + sizeof(*inode_item)); + btrfs_set_inode_generation(leaf, inode_item, trans->transid); + btrfs_set_inode_size(leaf, inode_item, 0); + btrfs_set_inode_nbytes(leaf, inode_item, 0); + btrfs_set_inode_uid(leaf, inode_item, 0); + btrfs_set_inode_gid(leaf, inode_item, 0); + btrfs_set_inode_mode(leaf, inode_item, S_IFREG | 0600); + btrfs_set_inode_flags(leaf, inode_item, flags); + btrfs_set_inode_nlink(leaf, inode_item, 1); + btrfs_set_inode_transid(leaf, inode_item, trans->transid); + btrfs_set_inode_block_group(leaf, inode_item, 0); + btrfs_mark_buffer_dirty(leaf); + btrfs_release_path(path); + + inode = btrfs_iget(fs_info->sb, &location, root, NULL); + +out: + btrfs_free_path(path); + btrfs_end_transaction(trans, root); + + if (IS_ERR(inode)) + return inode; + + return inode; +} + +/* + * basic allocation and initialization of a droptree node + */ +static struct droptree_node *droptree_alloc_node(struct droptree_root *dr) +{ + struct droptree_node *dn; + + dn = kzalloc(sizeof(*dn), GFP_NOFS); + if (!dn) + return NULL; + + dn->droproot = dr; + dn->parent = NULL; + dn->parent_slot = 0; + dn->map = NULL; + dn->nritems = 0; + INIT_LIST_HEAD(&dn->children); + INIT_LIST_HEAD(&dn->next); + INIT_LIST_HEAD(&dn->list); + spin_lock_init(&dn->lock); + + return dn; +} + +/* + * add a new top-level node to the list of running droptrees. Allocate the + * necessary droptree_root for it + */ +static struct droptree_root *droptree_add_droproot(struct list_head *list, + struct droptree_node *dn, + struct btrfs_root *root) +{ + struct droptree_root *dr; + + dr = kzalloc(sizeof(*dr), GFP_NOFS); + if (!dr) + return NULL; + + dn->droproot = dr; + dr->root = root; + dr->top_dn = dn; + list_add_tail(&dr->list, list); + + return dr; +} + +/* + * Free a complete droptree + * + * again, recursion would be the easy way, but we have to iterate + * through the tree. While freeing the nodes, also remove them from + * restart/requeue-queue. If they''re not empty after all trees have + * been freed, something is wrong. + */ +static noinline void droptree_free_tree(struct droptree_node *dn) +{ + while (dn) { + if (!list_empty(&dn->children)) { + dn = list_first_entry(&dn->children, + struct droptree_node, next); + } else { + struct droptree_node *parent = dn->parent; + list_del(&dn->next); + /* + * if dn is not enqueued, list is empty and del_init + * changes nothing + */ + list_del_init(&dn->list); + kfree(dn->map); + kfree(dn); + dn = parent; + } + } +} + +/* + * set up the reada_control for a droptree_root and enqueue it so it gets + * started by droptree_restart + */ +static int droptree_reada_root(struct reada_control *top_rc, + struct droptree_root *dr) +{ + struct reada_control *rc; + u64 start; + u64 generation; + int level; + struct extent_buffer *node; + struct btrfs_fs_info *fs_info = dr->root->fs_info; + + rc = btrfs_reada_alloc(top_rc, dr->root, &min_key, &max_key, + droptree_reada_fstree); + if (!rc) + return -ENOMEM; + + dr->rc = rc; + kref_get(&rc->refcnt); + + node = btrfs_root_node(dr->root); + start = node->start; + level = btrfs_header_level(node); + generation = btrfs_header_generation(node); + free_extent_buffer(node); + + dr->top_dn->start = start; + dr->top_dn->level = level; + dr->top_dn->len = btrfs_level_size(dr->root, level); + dr->top_dn->generation = generation; + + mutex_lock(&fs_info->droptree_lock); + ++fs_info->droptree_req[level]; + mutex_unlock(&fs_info->droptree_lock); + + /* + * add this root to the restart queue. It can''t be started immediately, + * as the caller is not yet synchronized with the transaction commit + */ + mutex_lock(&fs_info->droptree_lock); + list_add_tail(&dr->top_dn->list, &fs_info->droptree_restart); + mutex_unlock(&fs_info->droptree_lock); + + return 0; +} + +/* + * write out the state of a tree to the droptree inode. To avoid recursion, + * we do this iteratively using a dynamically allocated stack structure. + */ +static int droptree_save_tree(struct btrfs_fs_info *fs_info, + struct io_ctl *io, + struct droptree_node *dn) +{ + struct { + struct list_head *head; + struct droptree_node *cur; + } *stack, *sp; + struct droptree_node *cur; + int down = 1; + + stack = kmalloc(sizeof(*stack) * BTRFS_MAX_LEVEL, GFP_NOFS); + BUG_ON(!stack); /* can''t back out */ + sp = stack; + sp->head = &dn->next; + sp->cur = dn; + cur = dn; + + while (1) { + if (down && cur->nritems && !cur->convert) { + int i; + int l; + + /* + * write out this node before descending down + */ + BUG_ON(cur->nritems && !cur->map); /* can''t happen */ + io_ctl_set_u16(io, cur->parent_slot); + io_ctl_set_u64(io, cur->start); + io_ctl_set_u64(io, cur->len); + io_ctl_set_u16(io, cur->nritems); + l = DT_BITS_TO_U32(cur->nritems); + + for (i = 0; i < l; ++i) + io_ctl_set_u32(io, cur->map[i]); + } + if (down && !list_empty(&cur->children)) { + /* + * walk down one step + */ + if (cur->level > 0) + io_ctl_set_u16(io, DROPTREE_STATE_GO_DOWN); + ++sp; + sp->head = &cur->children; + cur = list_first_entry(&cur->children, + struct droptree_node, next); + sp->cur = cur; + } else if (cur->next.next != sp->head) { + /* + * step to the side + */ + cur = list_first_entry(&cur->next, + struct droptree_node, next); + sp->cur = cur; + down = 1; + } else if (sp != stack) { + /* + * walk up + */ + if (cur->level >= 0) + io_ctl_set_u16(io, DROPTREE_STATE_GO_UP); + --sp; + cur = sp->cur; + down = 0; + } else { + /* + * done + */ + io_ctl_set_u16(io, DROPTREE_STATE_END); + break; + } + } + kfree(stack); + + return 0; +} + +/* + * write out the full droptree state to disk + */ +static void droptree_save_state(struct btrfs_fs_info *fs_info, + struct inode *inode, + struct btrfs_trans_handle *trans, + struct list_head *droplist) +{ + struct io_ctl io_ctl; + struct droptree_root *dr; + struct btrfs_root *root = fs_info->tree_root; + struct extent_state *cached_state = NULL; + int ret; + + io_ctl_init(&io_ctl, inode, root); + io_ctl.check_crcs = 0; + io_ctl_prepare_pages(&io_ctl, inode, 0); + lock_extent_bits(&BTRFS_I(inode)->io_tree, 0, + i_size_read(inode) - 1, + 0, &cached_state, GFP_NOFS); + + io_ctl_set_u32(&io_ctl, DROPTREE_VERSION); /* version */ + io_ctl_set_u64(&io_ctl, fs_info->generation); /* generation */ + + list_for_each_entry(dr, droplist, list) { + io_ctl_set_u64(&io_ctl, dr->root->root_key.objectid); + io_ctl_set_u64(&io_ctl, dr->root->root_key.offset); + io_ctl_set_u8(&io_ctl, dr->top_dn->level); + ret = droptree_save_tree(fs_info, &io_ctl, dr->top_dn); + BUG_ON(ret); /* can''t back out */ + } + io_ctl_set_u64(&io_ctl, 0); /* terminator */ + + ret = btrfs_dirty_pages(root, inode, io_ctl.pages, + io_ctl.num_pages, + 0, i_size_read(inode), &cached_state); + BUG_ON(ret); /* can''t back out */ + io_ctl_drop_pages(&io_ctl); + unlock_extent_cached(&BTRFS_I(inode)->io_tree, 0, + i_size_read(inode) - 1, &cached_state, + GFP_NOFS); + io_ctl_free(&io_ctl); + + ret = filemap_write_and_wait(inode->i_mapping); + BUG_ON(ret); /* can''t back out */ + + ret = btrfs_update_inode(trans, root, inode); + BUG_ON(ret); /* can''t back out */ +} + +/* + * read the saved state from the droptree inode and prepare everything so + * it gets started by droptree_restart + */ +static int droptree_read_state(struct btrfs_fs_info *fs_info, + struct inode *inode, + struct reada_control *top_rc, + struct list_head *droplist) +{ + struct io_ctl io_ctl; + u32 version; + u64 generation; + struct droptree_node **stack; + int ret = 0; + + stack = kmalloc(sizeof(*stack) * BTRFS_MAX_LEVEL, GFP_NOFS); + if (!stack) + return -ENOMEM; + + io_ctl_init(&io_ctl, inode, fs_info->tree_root); + io_ctl.check_crcs = 0; + io_ctl_prepare_pages(&io_ctl, inode, 1); + + version = io_ctl_get_u32(&io_ctl); + if (version != DROPTREE_VERSION) { + printk(KERN_ERR "btrfs: snapshot deletion state has been saved " + "with a different version, ignored\n"); + ret = -EINVAL; + goto out; + } + /* FIXME generation is currently not needed */ + generation = io_ctl_get_u64(&io_ctl); + + while (1) { + struct btrfs_key key; + int ret; + struct btrfs_root *del_root; + struct droptree_root *dr; + int level; + int max_level; + struct droptree_node *root_dn; + + key.objectid = io_ctl_get_u64(&io_ctl); + if (key.objectid == 0) + break; + + key.type = BTRFS_ROOT_ITEM_KEY; + key.offset = io_ctl_get_u64(&io_ctl); + max_level = level = io_ctl_get_u8(&io_ctl); + + BUG_ON(level < 0 || level >= BTRFS_MAX_LEVEL); /* incons. fs */ + del_root = btrfs_read_fs_root_no_radix(fs_info->tree_root, + &key); + if (IS_ERR(del_root)) { + ret = PTR_ERR(del_root); + BUG(); /* inconsistent fs */ + } + + root_dn = droptree_alloc_node(NULL); + /* + * FIXME in this phase is should still be possible to undo + * everything and return a failure. Same goes for the allocation + * failures below + */ + BUG_ON(!root_dn); /* can''t back out */ + dr = droptree_add_droproot(droplist, root_dn, del_root); + BUG_ON(!dr); /* can''t back out */ + + stack[level] = root_dn; + + while (1) { + u64 start; + u64 len; + u64 nritems; + u32 *map; + int n; + int i; + int parent_slot; + struct droptree_node *dn; + + parent_slot = io_ctl_get_u16(&io_ctl); + if (parent_slot == DROPTREE_STATE_GO_UP) { + ++level; + BUG_ON(level > max_level); /* incons. fs */ + continue; + } + if (parent_slot == DROPTREE_STATE_GO_DOWN) { + --level; + BUG_ON(level < 0); /* incons. fs */ + continue; + } + if (parent_slot == DROPTREE_STATE_END) + break; + start = io_ctl_get_u64(&io_ctl); + if (start == 0) + break; + + len = io_ctl_get_u64(&io_ctl); + nritems = io_ctl_get_u16(&io_ctl); + n = DT_BITS_TO_U32(nritems); + BUG_ON(n > 999999); /* incons. fs */ + BUG_ON(n == 0); /* incons. fs */ + + map = kmalloc(n * sizeof(u32), GFP_NOFS); + BUG_ON(!map); /* can''t back out */ + + for (i = 0; i < n; ++i) + map[i] = io_ctl_get_u32(&io_ctl); + + if (level == max_level) { + /* only for root node */ + dn = stack[level]; + } else { + dn = droptree_alloc_node(dr); + BUG_ON(!dn); /* can''t back out */ + dn->parent = stack[level + 1]; + dn->parent_slot = parent_slot; + list_add_tail(&dn->next, + &stack[level+1]->children); + stack[level] = dn; + } + dn->level = level; + dn->start = start; + dn->len = len; + dn->map = map; + dn->nritems = nritems; + dn->generation = 0; + } + ret = droptree_reada_root(top_rc, dr); + BUG_ON(ret); /* can''t back out */ + } +out: + io_ctl_drop_pages(&io_ctl); + io_ctl_free(&io_ctl); + kfree(stack); + + return ret; +} + +/* + * called from transaction.c with a list of roots to delete + */ +void droptree_drop_list(struct btrfs_fs_info *fs_info, struct list_head *list) +{ + struct btrfs_root *root = fs_info->tree_root; + struct inode *inode = NULL; + struct btrfs_path *path = NULL; + int ret; + struct btrfs_trans_handle *trans; + u64 alloc_hint = 0; + u64 prealloc; + loff_t oldsize; + long max_nodes; + int i; + struct list_head droplist; + struct droptree_root *dr; + struct droptree_root *dtmp; + int running_roots = 0; + struct reada_control *top_rc = NULL; + + if (btrfs_fs_closing(fs_info)) + return; + + inode = droptree_get_inode(fs_info); + if (IS_ERR(inode)) + goto out; + + path = btrfs_alloc_path(); + if (!path) + goto out; + + /* + * create a dummy reada_control to use as a parent for all trees + */ + top_rc = btrfs_reada_alloc(NULL, root, NULL, NULL, NULL); + if (!top_rc) + goto out; + reada_control_elem_get(top_rc); + INIT_LIST_HEAD(&droplist); + + if (i_size_read(inode) > 0) { + /* read */ + ret = droptree_read_state(fs_info, inode, top_rc, &droplist); + if (ret) + goto out; + list_for_each_entry(dr, &droplist, list) + ++running_roots; + } + mutex_lock(&fs_info->droptree_lock); + BUG_ON(fs_info->droptree_rc); /* can''t happen */ + fs_info->droptree_rc = top_rc; + mutex_unlock(&fs_info->droptree_lock); + + while (1) { + mutex_lock(&fs_info->droptree_lock); + while (fs_info->droptree_pause_req) { + mutex_unlock(&fs_info->droptree_lock); + wait_event(fs_info->droptree_wait, + fs_info->droptree_pause_req == 0 || + btrfs_fs_closing(fs_info)); + if (btrfs_fs_closing(fs_info)) + goto end; + mutex_lock(&fs_info->droptree_lock); + } + ++fs_info->droptrees_running; + mutex_unlock(&fs_info->droptree_lock); + + /* + * 3 for truncation, including inode update + * for each root, we need to delete the root_item afterwards + */ + trans = btrfs_start_transaction(root, 3 + DROPTREE_MAX_ROOTS); + if (!trans) { + btrfs_free_path(path); + iput(inode); + return; + } + + max_nodes = 0; + for (i = 0; i < BTRFS_MAX_LEVEL; ++i) + max_nodes += fs_info->droptree_limit[i]; + + /* + * global header (version(4), generation(8)) + terminator (8) + */ + prealloc = 20; + + /* + * per root overhead: objectid(8), offset(8), level(1) + + * end marker (2) + */ + prealloc += 19 * DROPTREE_MAX_ROOTS; + + /* + * per node: parent slot(2), start(8), len(8), nritems(2) + map + * we add also room for one UP/DOWN per node + */ + prealloc += (22 + + DT_BITS_TO_U32(BTRFS_NODEPTRS_PER_BLOCK(root)) * 4) * + max_nodes; + prealloc = ALIGN(prealloc, PAGE_CACHE_SIZE); + + /* + * preallocate the space, so writing it later on can''t fail + * + * FIXME allocate block reserve instead, to reserve space + * for the truncation? */ + ret = btrfs_delalloc_reserve_space(inode, prealloc); + if (ret) + goto out; + + /* + * from here on, every error is fatal and must prevent the + * current transaction from comitting, as that would leave an + * inconsistent state on disk + */ + oldsize = i_size_read(inode); + if (oldsize > 0) { + BTRFS_I(inode)->generation = 0; + btrfs_i_size_write(inode, 0); + truncate_pagecache(inode, oldsize, 0); + + ret = btrfs_truncate_inode_items(trans, root, inode, 0, + BTRFS_EXTENT_DATA_KEY); + BUG_ON(ret); /* can''t back out */ + + ret = btrfs_update_inode(trans, root, inode); + BUG_ON(ret); /* can''t back out */ + } + /* + * add more roots until we reach the limit + */ + while (running_roots < DROPTREE_MAX_ROOTS && + !list_empty(list)) { + struct btrfs_root *del_root; + struct droptree_node *dn; + + del_root = list_first_entry(list, struct btrfs_root, + root_list); + list_del(&del_root->root_list); + + ret = btrfs_del_orphan_item(trans, root, + del_root->root_key.objectid); + BUG_ON(ret); /* can''t back out */ + dn = droptree_alloc_node(NULL); + BUG_ON(!dn); /* can''t back out */ + dr = droptree_add_droproot(&droplist, dn, del_root); + BUG_ON(!dr); /* can''t back out */ + ret = droptree_reada_root(top_rc, dr); + BUG_ON(ret); /* can''t back out */ + + ++running_roots; + } + + /* + * kick off the already queued jobs from the last pause, + * and all freshly added roots + */ + droptree_restart(fs_info); + droptree_kick(fs_info); + + /* + * wait for all readaheads to finish. a pause will also cause + * the wait to end + */ + list_for_each_entry(dr, &droplist, list) + btrfs_reada_wait(dr->rc); + + /* + * signal droptree_pause that it can continue. We still have + * the trans handle, so the current transaction won''t commit + * until we''ve written the state to disk + */ + mutex_lock(&fs_info->droptree_lock); + --fs_info->droptrees_running; + mutex_unlock(&fs_info->droptree_lock); + wake_up(&fs_info->droptree_wait); + + /* + * collect all finished droptrees + */ + list_for_each_entry_safe(dr, dtmp, &droplist, list) { + struct droptree_node *dn; + int full; + dn = dr->top_dn; + spin_lock(&dn->lock); + full = dn->map && + droptree_bitmap_full(dn->map, dn->nritems); + spin_unlock(&dn->lock); + if (full) { + struct btrfs_root *del_root = dr->root; + + list_del(&dr->list); + ret = btrfs_del_root(trans, fs_info->tree_root, + &del_root->root_key); + BUG_ON(ret); /* can''t back out */ + if (del_root->in_radix) { + btrfs_free_fs_root(fs_info, del_root); + } else { + free_extent_buffer(del_root->node); + free_extent_buffer(del_root-> + commit_root); + kfree(del_root); + } + kfree(dr->top_dn->map); + kfree(dr->top_dn); + kfree(dr); + --running_roots; + } + } + + if (list_empty(&droplist)) { + /* + * nothing in progress. Just leave the droptree inode + * at length zero and drop out of the loop + */ + btrfs_delalloc_release_space(inode, prealloc); + btrfs_end_transaction(trans, root); + break; + } + + /* we reserved the space for this above */ + ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, + prealloc, prealloc, + prealloc, &alloc_hint); + BUG_ON(ret); /* can''t back out */ + + droptree_save_state(fs_info, inode, trans, &droplist); + + btrfs_end_transaction(trans, root); + + if (btrfs_fs_closing(fs_info)) { + printk("fs is closing, abort loop\n"); + break; + } + + /* restart loop: create new reada_controls for the roots */ + list_for_each_entry(dr, &droplist, list) { + dr->rc = btrfs_reada_alloc(top_rc, dr->root, + &min_key, &max_key, + droptree_reada_fstree); + /* + * FIXME we could handle the allocation failure in + * principle, as we''re currently in a consistent state + */ + BUG_ON(!dr->rc); /* can''t back out */ + kref_get(&dr->rc->refcnt); + } + } +end: + + /* + * on unmount, we come here although we''re in the middle of a deletion. + * This means there are still allocated dropnodes we have to free. We + * free them by going down all the droptree_roots. + */ + while (!list_empty(&droplist)) { + dr = list_first_entry(&droplist, struct droptree_root, list); + list_del(&dr->list); + droptree_free_tree(dr->top_dn); + if (dr->root->in_radix) { + btrfs_free_fs_root(fs_info, dr->root); + } else { + free_extent_buffer(dr->root->node); + free_extent_buffer(dr->root->commit_root); + kfree(dr->root); + } + kfree(dr); + } + /* + * also delete everything from requeue + */ + while (!list_empty(&fs_info->droptree_requeue)) { + struct droptree_node *dn; + + dn = list_first_entry(&fs_info->droptree_requeue, + struct droptree_node, list); + list_del(&dn->list); + kfree(dn->map); + kfree(dn); + } + /* + * restart queue must be empty by now + */ + BUG_ON(!list_empty(&fs_info->droptree_restart)); /* can''t happen */ +out: + if (path) + btrfs_free_path(path); + if (inode) + iput(inode); + if (top_rc) { + mutex_lock(&fs_info->droptree_lock); + fs_info->droptree_rc = NULL; + mutex_unlock(&fs_info->droptree_lock); + reada_control_elem_put(top_rc); + } +} -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Update btrfs_clean_old_snapshots to make use of droptree. Snapshots with old backrefs and snapshots which deletion is always in progress are deleted with the old code, all other snapshots deletions use droptree. Some droptree-related debug code is also added to reada.c. Signed-off-by: Arne Jansen <sensille@gmx.net> --- fs/btrfs/disk-io.c | 1 + fs/btrfs/reada.c | 13 +++++++++++++ fs/btrfs/transaction.c | 35 +++++++++++++++++++++++++++++++++-- 3 files changed, 47 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7b3ddd7..b175cfa 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3041,6 +3041,7 @@ int close_ctree(struct btrfs_root *root) btrfs_pause_balance(root->fs_info); btrfs_scrub_cancel(root); + btrfs_droptree_pause(fs_info); /* wait for any defraggers to finish */ wait_event(fs_info->transaction_wait, diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 0d88163..69a9409 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -1112,6 +1112,19 @@ int btrfs_reada_wait(struct reada_control *rc) dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0); printk(KERN_DEBUG "reada_wait on %p: %d elems\n", rc, atomic_read(&rc->elems)); + mutex_lock(&fs_info->droptree_lock); + + for (i = 0; i < BTRFS_MAX_LEVEL; ++i) { + if (fs_info->droptree_req[i] == 0) + continue; + printk(KERN_DEBUG "droptree req on level %d: %ld out " + "of %ld, queue is %sempty\n", + i, fs_info->droptree_req[i], + fs_info->droptree_limit[i], + list_empty(&fs_info->droptree_queue[i]) ? + "" : "not "); + } + mutex_unlock(&fs_info->droptree_lock); } dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 04b77e3..2d72a7e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1182,6 +1182,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, spin_unlock(&cur_trans->commit_lock); wake_up(&root->fs_info->transaction_blocked_wait); + ret = btrfs_droptree_pause(root->fs_info); + BUG_ON(ret); + spin_lock(&root->fs_info->trans_lock); if (cur_trans->list.prev != &root->fs_info->trans_list) { prev_trans = list_entry(cur_trans->list.prev, @@ -1363,6 +1366,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, trace_btrfs_transaction_commit(root); btrfs_scrub_continue(root); + btrfs_droptree_continue(root->fs_info); if (current->journal_info == trans) current->journal_info = NULL; @@ -1381,12 +1385,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, int btrfs_clean_old_snapshots(struct btrfs_root *root) { LIST_HEAD(list); + LIST_HEAD(new); struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_root_item *root_item = &root->root_item; spin_lock(&fs_info->trans_lock); list_splice_init(&fs_info->dead_roots, &list); spin_unlock(&fs_info->trans_lock); + /* + * in a first pass, pick out all snapshot deletions that have been + * interrupted from a previous mount on an older kernel that didn''t + * support the droptree version of snapshot deletion. We continue + * it with the old code. Also deletions of roots from very old + * filesystems with old-style backrefs will be handled by the old + * code + */ while (!list_empty(&list)) { root = list_entry(list.next, struct btrfs_root, root_list); list_del(&root->root_list); @@ -1394,10 +1408,27 @@ int btrfs_clean_old_snapshots(struct btrfs_root *root) btrfs_kill_all_delayed_nodes(root); if (btrfs_header_backref_rev(root->node) < - BTRFS_MIXED_BACKREF_REV) + BTRFS_MIXED_BACKREF_REV) { btrfs_drop_snapshot(root, NULL, 0, 0); - else + } else if (btrfs_disk_key_objectid(&root_item->drop_progress)) { btrfs_drop_snapshot(root, NULL, 1, 0); + } else { + /* put on list for processing by droptree */ + list_add_tail(&root->root_list, &new); + } } + + droptree_drop_list(fs_info, &new); + while (!list_empty(&new)) { + /* + * if there are any roots left on the list after drop_list, + * delete them with the old code. This can happen in when the + * fs hasn''t got enough space for the droptree inode left. + */ + root = list_entry(list.next, struct btrfs_root, root_list); + list_del(&root->root_list); + btrfs_drop_snapshot(root, NULL, 1, 0); + } + return 0; } -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Arne, (2012/04/13 0:54), Arne Jansen wrote:> This is an implementation of snapshot deletion using the readahead > framework. Multiple snapshots can be deleted at once and the trees > are not enumerated sequentially but in parallel in many branches. > This way readahead can reorder the request to better utilize all > disks. For a more detailed description see inline comments. > > Signed-off-by: Arne Jansen<sensille@gmx.net> > --- > fs/btrfs/Makefile | 2 +- > fs/btrfs/droptree.c | 1916 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 1917 insertions(+), 1 deletions(-) > > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > index 0c4fa2b..620d7c8 100644 > --- a/fs/btrfs/Makefile > +++ b/fs/btrfs/Makefile > @@ -8,7 +8,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ > export.o tree-log.o free-space-cache.o zlib.o lzo.o \ > compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ > - reada.o backref.o ulist.o > + reada.o backref.o ulist.o droptree.o > > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > diff --git a/fs/btrfs/droptree.c b/fs/btrfs/droptree.c > new file mode 100644 > index 0000000..9bc9c23 > --- /dev/null > +++ b/fs/btrfs/droptree.c > @@ -0,0 +1,1916 @@ > +/* > + * Copyright (C) 2011 STRATO. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public > + * License v2 as published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public > + * License along with this program; if not, write to the > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, > + * Boston, MA 021110-1307, USA. > + */ > +#include "ctree.h" > +#include "transaction.h" > +#include "disk-io.h" > +#include "locking.h" > +#include "free-space-cache.h" > +#include "print-tree.h" > + > +/* > + * This implements snapshot deletions with the use of the readahead framework. > + * The fs-tree is not tranversed in sequential (key-) order, but by descending > + * into multiple paths at once. In addition, up to DROPTREE_MAX_ROOTS snapshots > + * will be deleted in parallel. > + * The basic principle is as follows: > + * When a tree node is found, first its refcnt and flags are fetched (via > + * readahead) from the extent allocation tree. Simplified, if the refcnt > + * is> 1, other trees also reference this node and we also have to drop our > + * ref to it and are done. If on the other hand the refcnt is 1, it''s our > + * node and we have to free it (the same holds for data extents). So we fetch > + * the actual node (via readahead) and add all nodes/extents it points to to > + * the queue to again fetch refcnts for them. > + * While the case where refcnt> 1 looks like the easy one, there''s one special > + * case to take into account: If the node still uses non-shared refs and we > + * are the owner of the node, we''re not allowed to just drop our ref, as it > + * would leave the node with an unresolvable backref (it points to our tree). > + * So we first have to convert the refs to shared refs, and not only for this > + * node, but for its full subtree. We can stop descending if we encounter a > + * node of which we are not owner. > + * One big difference to the old snapshot deletion code that sequentially walks > + * the tree is that we can''t represent the current deletion state with a single > + * key. As we delete in several paths simultaneously, we have to record all > + * loose ends in the state. To not get an exponentially growing state, we don''t > + * delete the refs top-down, but bottom-up. In addition, we restrict the > + * currently processed nodes per-level globally. This way, we have a bounded > + * size for the state and can preallocate the needed space before any work > + * starts. During each transction commit, we write the state to a special > + * inode (DROPTREE_INO) recorded in the root tree. > + * For a commit, all outstanding readahead-requests get cancelled and moved > + * to a restart queue. > + * > + * The central data structure here is the droptree_node (dn). It represents a > + * file system extent, meaning a tree node, tree leaf or a data extent. The > + * dn''s are organized in a tree corresponding to the disk structure. Each dn > + * keeps a bitmap that records which of its children are finished. When the > + * last bit gets set by a child, the freeing of the node is triggered. In > + * addition, struct droptree_root represents a fs tree to delete. It is mainly > + * used to keep all roots in a list. The reada_control for this tree is also > + * recorded here. We don''t keep it in the dn''s, as it is being freed and re- > + * created with each transaction commit. > + */ > + > +#define DROPTREE_MAX_ROOTS 50 > + > +/* > + * constants used in the on-disk state for deletion progress > + */ > +#define DROPTREE_STATE_GO_UP 0xffff > +#define DROPTREE_STATE_GO_DOWN 0xfffe > +#define DROPTREE_STATE_END 0xfffd > + > +/* > + * used in droptree inode > + */ > +#define DROPTREE_VERSION 1 > + > +/* > + * helper struct used by main loop to keep all roots currently being deleted > + * in a list. > + */ > +struct droptree_root { > + struct droptree_node *top_dn; > + struct btrfs_root *root; > + struct reada_control *rc; > + struct list_head list; > +}; > + > +/* > + * this structure contains all information needed to carry a > + * node/leaf/data extent through all phases. Each phases fills > + * in the information it got, so don''t assume all information is > + * available at any time. > + */ > +struct droptree_node { > + /* > + * pointer back to our root > + */ > + struct droptree_root *droproot; > + > + /* > + * where our place in the parent is. parent_slot is the bit number > + * in parent->map > + */ > + struct droptree_node *parent; > + int parent_slot; > + > + /* > + * tree of nodes. Both are protected by lock at bottom. > + */ > + struct list_head next; > + struct list_head children; > + > + /* > + * information about myself > + */ > + u64 start; > + u64 len; > + int level; > + u64 generation; > + u64 flags; > + u64 owner; /* only for data extents */ > + u64 offset; /* only for data extents */ > + > + /* > + * bitmap to record the completion of each child node. Protected > + * by lock at bottom. > + */ > + u32 *map; > + int nritems; > + > + /* > + * the readahead for prefetching the extent tree entry > + */ > + struct reada_control *sub_rc; > + > + /* > + * to get out of end_io worker context into readahead worker context. > + * This is always needed if we want to do disk-IO, as otherwise we > + * might end up holding all end_io worker, thus preventing the IO from > + * completion and deadlocking. end_transaction might do disk-IO. > + */ > + struct btrfs_work work; > + > + /* > + * used to queue node either in restart queue or requeue queue. > + * Protected by fs_info->droptree_lock. > + */ > + struct list_head list; > + > + /* > + * convert == 1 means we need to convert the refs in this tree to > + * shared refs. The point where we started the conversion is marked > + * with conversion_point == 1. When we finished conversion, the normal > + * dropping of refs restarts at this point. When we encounter a node > + * that doesn''t need conversion, we mark it with convert_stop == 1. > + * This also means no nodes below this point need conversion. > + */ > + unsigned int convert:1; > + unsigned int conversion_point:1; > + unsigned int convert_stop:1; > + > + /* > + * lock for bitmap and lists > + */ > + spinlock_t lock; > +}; > + > +static struct btrfs_key min_key = { 0 }; > +static struct btrfs_key max_key = { > + .objectid = (u64)-1, > + .type = (u8)-1, > + .offset = (u64)-1 > +}; > + > +static void droptree_fetch_ref(struct btrfs_work *work); > +static void droptree_free_ref(struct btrfs_work *work); > +static void droptree_kick(struct btrfs_fs_info *fs_info); > +static void droptree_free_up(struct btrfs_trans_handle *trans, > + struct droptree_node *dn, int last_ref); > +static void droptree_reada_conv(struct btrfs_root *root, > + struct reada_control *rc, > + u64 wanted_generation, > + struct extent_buffer *eb, > + u64 start, int err, > + struct btrfs_key *top, void *ctx); > +static int droptree_reada_dn(struct droptree_node *dn); > +static struct droptree_node *droptree_alloc_node(struct droptree_root *dr); > +static void droptree_reada_fstree(struct btrfs_root *root, > + struct reada_control *rc, > + u64 wanted_generation, > + struct extent_buffer *eb, > + u64 start, int err, > + struct btrfs_key *top, void *ctx); > + > +/* > + * Bit operations. These are mostly a copy of lib/bitmap.c, with 2 differences: > + * a) it always operates on u32 instead of unsigned long. This makes it easier > + * to save them to disk in a portable format > + * b) the locking has to be provided externally > + */ > +#define DT_BITS_PER_BYTE 8 > +#define DT_BITS_PER_U32 (sizeof(u32) * DT_BITS_PER_BYTE) > +#define DT_BIT_MASK(nr) (1UL<< ((nr) % DT_BITS_PER_U32)) > +#define DT_BIT_WORD(nr) ((nr) / DT_BITS_PER_U32) > +#define DT_BITS_TO_U32(nr) DIV_ROUND_UP(nr, DT_BITS_PER_BYTE * sizeof(u32)) > +#define DT_BITMAP_LAST_WORD_MASK(nbits) \ > +( \ > + ((nbits) % DT_BITS_PER_U32) ? \ > + (1UL<< ((nbits) % DT_BITS_PER_U32))-1 : ~0UL \ > +) > + > +static void droptree_set_bit(int nr, u32 *addr) > +{ > + u32 mask = DT_BIT_MASK(nr); > + u32 *p = ((u32 *)addr) + DT_BIT_WORD(nr); > + > + *p |= mask; > +} > + > +static int droptree_test_bit(int nr, const u32 *addr) > +{ > + return 1UL& (addr[DT_BIT_WORD(nr)]>> (nr& (DT_BITS_PER_U32 - 1))); > +} > + > +static int droptree_bitmap_full(const u32 *addr, int bits) > +{ > + int k; > + int lim = bits / DT_BITS_PER_U32; > + > + for (k = 0; k< lim; ++k) > + if (~addr[k]) > + return 0; > + > + if (bits % DT_BITS_PER_U32) > + if (~addr[k]& DT_BITMAP_LAST_WORD_MASK(bits)) > + return 0; > + > + return 1; > +} > + > +/* > + * This function is called from readahead. > + * Prefetch the extent tree entry for this node so we can later read the > + * refcnt without waiting for disk I/O. This readahead is implemented as > + * a sub readahead from the fstree readahead. > + */ > +static void droptree_reada_ref(struct btrfs_root *root, > + struct reada_control *rc, > + u64 wanted_generation, struct extent_buffer *eb, > + u64 start, int err, struct btrfs_key *top, > + void *ctx) > +{ > + int nritems; > + u64 generation = 0; > + int level = 0; > + int i; > + struct btrfs_fs_info *fs_info = root->fs_info; > + struct droptree_node *dn = ctx; > + int nfound = 0; > + int ret; > + > + mutex_lock(&fs_info->droptree_lock); > + if (err == -EAGAIN || fs_info->droptree_pause_req) { > + /* > + * if cancelled, put to restart queue > + */ > + list_add_tail(&dn->list,&fs_info->droptree_restart); > + mutex_unlock(&fs_info->droptree_lock); > + return; > + } > + mutex_unlock(&fs_info->droptree_lock); > + > + if (eb) { > + level = btrfs_header_level(eb); > + generation = btrfs_header_generation(eb); > + if (wanted_generation != generation) > + err = 1; > + } else { > + BUG_ON(!err); /* can''t happen */ > + } > + > + /* > + * either when level 0 is reached and the prefetch is finished, or > + * when an error occurs, we finish this readahead and kick the > + * reading of the actual refcnt to a worker > + */ > + if (err || level == 0) { > +error: > + dn->work.func = droptree_fetch_ref; > + dn->sub_rc = rc; > + > + /* > + * we don''t want our rc to go away right now, as this might > + * signal the parent rc before we are done. In the next stage, > + * we first add a new readahead to the parent and afterwards > + * clear our sub-reada > + */ > + reada_control_elem_get(rc); > + > + /* > + * we have to push the lookup_extent_info out to a worker, > + * as we are currently running in the context of the end_io > + * workers and lookup_extent_info might do a lookup, thus > + * deadlocking > + */ > + btrfs_queue_worker(&fs_info->readahead_workers,&dn->work); > + > + return; > + } > + > + /* > + * level 0 not reached yet, so continue down the tree with reada > + */ > + nritems = btrfs_header_nritems(eb); > + > + for (i = 0; i< nritems; i++) { > + u64 n_gen; > + struct btrfs_key key; > + struct btrfs_key next_key; > + u64 bytenr; > + > + btrfs_node_key_to_cpu(eb,&key, i); > + if (i + 1< nritems) > + btrfs_node_key_to_cpu(eb,&next_key, i + 1); > + else > + next_key = *top; > + bytenr = btrfs_node_blockptr(eb, i); > + n_gen = btrfs_node_ptr_generation(eb, i); > + > + if (btrfs_comp_cpu_keys(&key,&rc->key_end)< 0&& > + btrfs_comp_cpu_keys(&next_key,&rc->key_start)> 0) { > + ret = reada_add_block(rc, bytenr,&next_key, > + level - 1, n_gen, ctx); > + if (ret) > + goto error; > + ++nfound; > + } > + } > + if (nfound> 1) { > + /* > + * just a safeguard, we searched for a single key, so there > + * must not be more than one entry in the node > + */ > + btrfs_print_leaf(rc->root, eb); > + BUG_ON(1); /* inconsistent fs */ > + } > + if (nfound == 0) { > + /* > + * somewhere the tree changed while we were running down. > + * So start over again from the top > + */ > + ret = droptree_reada_dn(dn); > + if (ret) > + goto error; > + } > +} > + > +/* > + * This is called from a worker, kicked off by droptree_reada_ref. We arrive > + * here after either the extent tree is prefetched or an error occured. In > + * any case, the refcnt is read synchronously now, hopefully without disk I/O. > + * If we encounter any hard errors here, we have no chance but to BUG. > + */ > +static void droptree_fetch_ref(struct btrfs_work *work) > +{ > + struct droptree_node *dn; > + int ret; > + u64 refs; > + u64 flags; > + struct btrfs_trans_handle *trans; > + struct extent_buffer *eb = NULL; > + struct btrfs_root *root; > + struct btrfs_fs_info *fs_info; > + struct reada_control *sub_rc; > + int free_up = 0; > + int is_locked = 0; > + > + dn = container_of(work, struct droptree_node, work); > + > + root = dn->droproot->root; > + fs_info = root->fs_info; > + sub_rc = dn->sub_rc; > + > + trans = btrfs_join_transaction(fs_info->extent_root); > + BUG_ON(!trans); /* can''t back out */ > + > + ret = btrfs_lookup_extent_info(trans, root, dn->start, dn->len, > + &refs,&flags); > + BUG_ON(ret); /* can''t back out */ > + dn->flags = flags; > + > + if (dn->convert&& dn->level>= 0) { > + eb = btrfs_find_create_tree_block(root, dn->start, dn->len); > + BUG_ON(!eb); /* can''t back out */ > + if (!btrfs_buffer_uptodate(eb, dn->generation)) { > + struct reada_control *conv_rc; > + > +fetch_buf: > + /* > + * we might need to convert the ref. To check this, > + * we need to know the header_owner of the block, so > + * we actually need the block''s content. Just add > + * a sub-reada for the content that points back here > + */ > + free_extent_buffer(eb); > + btrfs_end_transaction(trans, fs_info->extent_root); > + > + conv_rc = btrfs_reada_alloc(dn->droproot->rc, > + root, NULL, NULL, > + droptree_reada_conv); > + ret = reada_add_block(conv_rc, dn->start, NULL, > + dn->level, dn->generation, dn); > + BUG_ON(ret< 0); /* can''t back out */ > + reada_start_machine(fs_info); > + > + return; > + } > + if (btrfs_header_owner(eb) != root->root_key.objectid) { > + /* > + * we''re not the owner of the block, so we can stop > + * converting. No blocks below this will need conversion > + */ > + dn->convert_stop = 1; > + free_up = 1; > + } else { > + /* conversion needed, descend */ > + btrfs_tree_read_lock(eb); > + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); > + is_locked = 1; > + droptree_reada_fstree(root, dn->droproot->rc, > + dn->generation, eb, dn->start, 0, > + NULL, dn); > + } > + goto out; > + } > + > + if (refs> 1) { > + /* > + * we did the lookup without proper locking. If the refcnt is 1, > + * no locking is needed, as we hold the only reference to > + * the extent. When the refcnt is>1, it can change at any time. > + * To prevent this, we lock either the extent (for a tree > + * block), or the leaf referencing the extent (for a data > + * extent). Afterwards we repeat the lookup. > + */ > + if (dn->level == -1) > + eb = btrfs_find_create_tree_block(root, > + dn->parent->start, > + dn->parent->len); > + else > + eb = btrfs_find_create_tree_block(root, dn->start, > + dn->len); > + BUG_ON(!eb); /* can''t back out */ > + btrfs_tree_read_lock(eb); > + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); > + is_locked = 1; > + ret = btrfs_lookup_extent_info(trans, root, dn->start, dn->len, > + &refs,&flags); > + BUG_ON(ret); /* can''t back out */ > + dn->flags = flags; > + > + if (refs == 1) { > + /* > + * The additional ref(s) has gone away in the meantime. > + * Now the extent is ours, no need for locking anymore > + */ > + btrfs_tree_read_unlock_blocking(eb); > + free_extent_buffer(eb); > + eb = NULL; > + } else if (!(flags& BTRFS_BLOCK_FLAG_FULL_BACKREF)&& > + dn->level>= 0&& > + !btrfs_buffer_uptodate(eb, dn->generation)) { > + /* > + * we might need to convert the ref. To check this, > + * we need to know the header_owner of the block, so > + * we actually need the block''s contents. Just add > + * a sub-reada for the content that points back here > + */ > + btrfs_tree_read_unlock_blocking(eb); > + > + goto fetch_buf; > + } > + } > + > + /* > + * See if we have to convert the ref to a shared ref before we drop > + * our ref. Above we have made sure the buffer is uptodate. > + */ > + if (refs> 1&& dn->level>= 0&& > + !(flags& BTRFS_BLOCK_FLAG_FULL_BACKREF)&& > + btrfs_header_owner(eb) == root->root_key.objectid) { > + dn->convert = 1; > + /* when done with the conversion, switch to freeing again */ > + dn->conversion_point = 1; > + > + droptree_reada_fstree(root, dn->droproot->rc, > + dn->generation, eb, dn->start, 0, > + NULL, dn); > + } else if (refs == 1&& dn->level>= 0) { > + /* > + * refcnt is 1, descend into lower levels > + */ > + ret = reada_add_block(dn->droproot->rc, dn->start, NULL, > + dn->level, dn->generation, dn); > + WARN_ON(ret< 0); > + reada_start_machine(fs_info); > + } else { > + /* > + * either refcnt is>1, or we''ve reached the bottom of the > + * tree. In any case, drop our reference > + */ > + free_up = 1; > + } > +out: > + if (eb) { > + if (is_locked) > + btrfs_tree_read_unlock_blocking(eb); > + free_extent_buffer(eb); > + } > + > + if (free_up) { > + /* > + * mark node as done in parent. This ends the lifecyle of dn > + */ > + droptree_free_up(trans, dn, refs == 1); > + } > + > + btrfs_end_transaction(trans, fs_info->extent_root); > + /* > + * end the sub reada. This might complete the parent. > + */ > + reada_control_elem_put(sub_rc); > +} > + > +/* > + * mark the slot in the parent as done. This might also complete the parent, > + * so walk the tree up as long as nodes complete > + * > + * can''t be called from end_io worker context, as it needs a transaction > + */ > +static noinline void droptree_free_up(struct btrfs_trans_handle *trans, > + struct droptree_node *dn, int last_ref) > +{ > + struct btrfs_root *root = dn->droproot->root; > + struct btrfs_fs_info *fs_info = root->fs_info; > + > + while (dn) { > + struct droptree_node *parent = dn->parent; > + int slot = dn->parent_slot; > + u64 parent_start = 0; > + int ret; > + > + if (parent&& parent->flags& BTRFS_BLOCK_FLAG_FULL_BACKREF) > + parent_start = parent->start; > + > + if (dn->level>= 0) { > + mutex_lock(&fs_info->droptree_lock); > + --fs_info->droptree_req[dn->level]; > + mutex_unlock(&fs_info->droptree_lock); > + } > + > + if (dn->convert) { > + if (dn->level>= 0&& > + !(dn->flags& BTRFS_BLOCK_FLAG_FULL_BACKREF)&& > + !dn->convert_stop) { > + struct extent_buffer *eb; > + eb = read_tree_block(root, dn->start, dn->len, > + dn->generation); > + BUG_ON(!eb); /* can''t back out */ > + btrfs_tree_lock(eb); > + btrfs_set_lock_blocking(eb); > + ret = btrfs_inc_ref(trans, root, eb, 1, 1); > + BUG_ON(ret); /* can''t back out */ > + ret = btrfs_dec_ref(trans, root, eb, 0, 1); > + BUG_ON(ret); /* can''t back out */ > + ret = btrfs_set_disk_extent_flags(trans, > + root, eb->start, eb->len, > + BTRFS_BLOCK_FLAG_FULL_BACKREF, > + 0); > + BUG_ON(ret); /* can''t back out */ > + btrfs_tree_unlock(eb); > + free_extent_buffer(eb); > + dn->flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF; > + } > + dn->convert = 0; > + if (dn->conversion_point) { > + /* > + * start over again for this node, clean it > + * and enqueue it again > + */ > + dn->conversion_point = 0; > + > + kfree(dn->map); > + dn->map = NULL; > + dn->nritems = 0; > + > + /* > + * just add to the list and let droptree_kick > + * do the actual work of enqueueing > + */ > + mutex_lock(&fs_info->droptree_lock); > + list_add_tail(&dn->list, > + &fs_info->droptree_queue[dn->level]); > + reada_control_elem_get(dn->droproot->rc); > + mutex_unlock(&fs_info->droptree_lock); > + > + goto out; > + } > + } else if (last_ref&& dn->level>= 0) { > + struct extent_buffer *eb; > + > + /* > + * btrfs_free_tree_block needs to read the block to > + * act on the owner recorded in the header. We have > + * read the block some time ago, so hopefully it is > + * still in the cache > + */ > + eb = read_tree_block(root, dn->start, dn->len, > + dn->generation); > + BUG_ON(!eb); /* can''t back out */ > + btrfs_free_tree_block(trans, root, eb, > + parent_start, 1, 0); > + free_extent_buffer(eb); > + } else { > + btrfs_free_extent(trans, root, dn->start, dn->len, > + parent_start, > + root->root_key.objectid, > + dn->owner, dn->offset, 0); > + } > + > + if (!parent) > + break; > + > + /* > + * this free is after the parent check, as we don''t want to > + * free up the top level node. The main loop uses dn->map > + * as an indication if the tree is done. > + */ > + spin_lock(&parent->lock); > + list_del(&dn->next); > + spin_unlock(&parent->lock); > + kfree(dn->map); > + kfree(dn); > + > + /* > + * notify up > + */ > + spin_lock(&parent->lock); > + droptree_set_bit(slot, parent->map); > + if (!droptree_bitmap_full(parent->map, parent->nritems)) { > + spin_unlock(&parent->lock); > + break; > + } > + spin_unlock(&parent->lock); > + > + dn = parent; > + last_ref = 1; > + } > + > +out: > + droptree_kick(fs_info); > +} > + > +/* > + * this callback is used when we need the actual eb to decide whether to > + * convert the refs for this node or not. It just loops back to > + * droptree_reada_fetch_ref > + */ > +static void droptree_reada_conv(struct btrfs_root *root, > + struct reada_control *rc, > + u64 wanted_generation, > + struct extent_buffer *eb, > + u64 start, int err, > + struct btrfs_key *top, void *ctx) > +{ > + struct droptree_node *dn = ctx; > + struct btrfs_fs_info *fs_info = root->fs_info; > + > + if (err == -EAGAIN) { > + /* > + * we''re still in the process of fetching the refs. > + * As we want to start over cleanly after the commit, > + * we also have to give up the sub_rc > + */ > + reada_control_elem_put(dn->sub_rc); > + > + mutex_lock(&fs_info->droptree_lock); > + list_add_tail(&dn->list,&fs_info->droptree_restart); > + mutex_unlock(&fs_info->droptree_lock); > + return; > + } > + > + if (err || eb == NULL) > + BUG(); /* can''t back out */ > + > + /* not yet done with the conversion stage, go back to step 2 */ > + btrfs_queue_worker(&fs_info->readahead_workers,&dn->work); > + > + droptree_kick(fs_info); > +} > + > +/* > + * After having fetched the refcnt for a node and decided we have to descend > + * into it, we arrive here. Called from reada for the actual extent. > + * The main idea is to find all pointers to lower nodes and add them to reada. > + */ > +static void droptree_reada_fstree(struct btrfs_root *root, > + struct reada_control *rc, > + u64 wanted_generation, > + struct extent_buffer *eb, > + u64 start, int err, > + struct btrfs_key *top, void *ctx) > +{ > + int nritems; > + u64 generation; > + int level; > + int i; > + struct droptree_node *dn = ctx; > + struct droptree_node *child; > + struct btrfs_fs_info *fs_info = root->fs_info; > + struct droptree_node **child_map = NULL; > + u32 *finished_map = NULL; > + int nrsent = 0; > + int ret; > + > + if (err == -EAGAIN) { > + mutex_lock(&fs_info->droptree_lock); > + list_add_tail(&dn->list,&fs_info->droptree_restart); > + mutex_unlock(&fs_info->droptree_lock); > + return; > + } > + > + if (err || eb == NULL) { > + /* > + * FIXME we can''t deal with I/O errors here. One possibility > + * would to abandon the subtree and just leave it allocated, > + * wasting the space. Another way would be to turn the fs > + * readonly. > + */ > + BUG(); /* can''t back out */ > + } > + > + level = btrfs_header_level(eb); > + nritems = btrfs_header_nritems(eb); > + generation = btrfs_header_generation(eb); > + > + if (wanted_generation != generation) { > + /* > + * the fstree is supposed to be static, as it is inaccessible > + * from user space. So if there''s a generation mismatch here, > + * something has gone wrong. > + */ > + BUG(); /* can''t back out */ > + } > + > + /* > + * allocate a bitmap if we don''t already have one. In case we restart > + * a snapshot deletion after a mount, the map already contains completed > + * slots. If we have the map, we put it aside for the moment and replace > + * it with a zero-filled map. During the loop, we repopulate it. If we > + * wouldn''t do that, we might end up with a dn already being freed > + * by completed children that got enqueued during the loop. This way > + * we make sure the dn might only be freed during the last round. > + */ > + if (dn->map) { > + struct droptree_node *it; > + /* > + * we are in a restore. build a map of all child nodes that > + * are already present > + */ > + child_map = kzalloc(nritems * sizeof(struct droptree_node), > + GFP_NOFS); > + BUG_ON(!child_map); /* can''t back out */ > + BUG_ON(nritems != dn->nritems); /* inconsistent fs */ > + list_for_each_entry(it,&dn->children, next) { > + BUG_ON(it->parent_slot< 0 || > + it->parent_slot>= nritems); /* incons. fs */ > + child_map[it->parent_slot] = it; > + } > + finished_map = dn->map; > + dn->map = NULL; > + } > + dn->map = kzalloc(DT_BITS_TO_U32(nritems) * sizeof(u32), GFP_NOFS); > + dn->nritems = nritems; > + > + /* > + * fetch refs for all lower nodes > + */ > + for (i = 0; i< nritems; i++) { > + u64 n_gen; > + struct btrfs_key key; > + u64 bytenr; > + u64 num_bytes; > + u64 owner = level - 1; > + u64 offset = 0; > + > + /* > + * in case of recovery, we could have already finished this > + * slot > + */ > + if (finished_map&& droptree_test_bit(i, finished_map)) > + goto next_slot; > + > + if (level == 0) { > + struct btrfs_file_extent_item *fi; > + > + btrfs_item_key_to_cpu(eb,&key, i); > + if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY) > + goto next_slot; > + fi = btrfs_item_ptr(eb, i, > + struct btrfs_file_extent_item); > + if (btrfs_file_extent_type(eb, fi) => + BTRFS_FILE_EXTENT_INLINE) > + goto next_slot; > + bytenr = btrfs_file_extent_disk_bytenr(eb, fi); > + if (bytenr == 0) { > +next_slot: > + spin_lock(&dn->lock); > + droptree_set_bit(i, dn->map); > + if (droptree_bitmap_full(dn->map, nritems)) { > + spin_unlock(&dn->lock); > + goto free; > + } > + spin_unlock(&dn->lock); > + continue; > + } > + num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi); > + key.offset -= btrfs_file_extent_offset(eb, fi); > + owner = key.objectid; > + offset = key.offset; > + n_gen = 0; > + } else { > + btrfs_node_key_to_cpu(eb,&key, i); > + bytenr = btrfs_node_blockptr(eb, i); > + num_bytes = btrfs_level_size(root, 0); > + n_gen = btrfs_node_ptr_generation(eb, i); > + } > + > + if (child_map&& child_map[i]) { > + child = child_map[i]; > + child->generation = n_gen; > + } else { > + child = droptree_alloc_node(dn->droproot); > + BUG_ON(!child); > + child->parent = dn; > + child->parent_slot = i; > + child->level = level - 1; > + child->start = bytenr; > + child->len = num_bytes; > + child->owner = owner; > + child->offset = offset; > + child->generation = n_gen; > + child->convert = dn->convert; > + } > + ++nrsent; > + > + /* > + * limit the number of outstanding requests for a given level. > + * The limit is global to all outstanding snapshot deletions. > + * Only requests for levels>= 0 are limited. The level of this > + * request is level-1. > + */ > + if (level> 0) { > + mutex_lock(&fs_info->droptree_lock); > + if ((fs_info->droptree_pause_req || > + (fs_info->droptree_req[level - 1]>> + fs_info->droptree_limit[level - 1]))) { > + /* > + * limit reached or pause requested, enqueue > + * request and get an rc->elem for it > + */ > + list_add_tail(&child->list, > + &fs_info->droptree_queue[level - 1]); > + reada_control_elem_get(rc); > + mutex_unlock(&fs_info->droptree_lock); > + continue; > + } > + ++fs_info->droptree_req[level - 1]; > + mutex_unlock(&fs_info->droptree_lock); > + } > + if (list_empty(&child->next)) { > + spin_lock(&dn->lock); > + list_add(&child->next,&dn->children); > + spin_unlock(&dn->lock); > + } > + /* > + * this might immediately call back into completion, > + * so dn can become invalid in the last round > + */ > + ret = droptree_reada_dn(child); > + BUG_ON(ret); /* can''t back out */ > + } > + > + if (nrsent == 0) { > +free: > + /* > + * this leaf didn''t contain any EXTENT_DATA items. It can''t be > + * a node, as nodes are never empty. This point might also be > + * reached via the label<free> when we set the last bit our- > + * selves. This is possible when all enqueued readas already > + * finish while we loop. > + * We need to drop our ref and notify our parents, but can''t > + * do this in the context of the end_io workers, as it might > + * cause disk-I/O causing a deadlock. So kick off to a worker. > + */ > + dn->work.func = droptree_free_ref; > + > + /* > + * we don''t want our rc to go away right now, as this might > + * signal the parent rc before we are done. > + */ > + reada_control_elem_get(rc); > + btrfs_queue_worker(&fs_info->readahead_workers,&dn->work); > + } > + > + kfree(child_map); > + kfree(finished_map); > + droptree_kick(fs_info); > +} > + > +/* > + * worker deferred from droptree_reada_fstree in case the extent didn''t contain > + * anything to descend into. Just free this node and notify the parent > + */ > +static void droptree_free_ref(struct btrfs_work *work) > +{ > + struct droptree_node *dn; > + struct btrfs_trans_handle *trans; > + struct reada_control *fs_rc; > + struct btrfs_root *root; > + struct btrfs_fs_info *fs_info; > + > + dn = container_of(work, struct droptree_node, work); > + fs_rc = dn->droproot->rc; > + root = dn->droproot->root; > + fs_info = root->fs_info; > + > + trans = btrfs_join_transaction(fs_info->extent_root); > + BUG_ON(!trans); /* can''t back out */ > + > + droptree_free_up(trans, dn, 1); > + > + btrfs_end_transaction(trans, fs_info->extent_root); > + > + /* > + * end the sub reada. This might complete the parent. > + */ > + reada_control_elem_put(fs_rc); > +} > + > +/* > + * add a node to readahead. For the top-level node, just add the block to the > + * fs-rc, for all other nodes add a sub-reada. > + */ > +static int droptree_reada_dn(struct droptree_node *dn) > +{ > + struct btrfs_key ex_start; > + struct btrfs_key ex_end; > + int ret; > + struct droptree_root *dr = dn->droproot; > + > + ex_start.objectid = dn->start; > + ex_start.type = BTRFS_EXTENT_ITEM_KEY; > + ex_start.offset = dn->len; > + ex_end = ex_start; > + ++ex_end.offset; > + > + if (!dn->parent) > + ret = reada_add_block(dr->rc, dn->start,&max_key, > + dn->level, dn->generation, dn); > + else > + ret = btrfs_reada_add(dr->rc, dr->root->fs_info->extent_root, > + &ex_start,&ex_end, > + droptree_reada_ref, dn, NULL); > + > + return ret; > +} > + > +/* > + * after a restart from a commit, all previously canceled requests need to be > + * restarted. Also we moved the queued dns to the requeue queue, so move them > + * back here > + */ > +static void droptree_restart(struct btrfs_fs_info *fs_info) > +{ > + int ret; > + struct droptree_node *dn; > + > + /* > + * keep the lock over the whole operation, otherwise the enqueued > + * blocks could immediately be handled and the elem count drop to > + * zero before we''re done enqueuing > + */ > + mutex_lock(&fs_info->droptree_lock); > + if (fs_info->droptree_pause_req) { > + mutex_unlock(&fs_info->droptree_lock); > + return; > + } > + > + while (!list_empty(&fs_info->droptree_restart)) { > + dn = list_first_entry(&fs_info->droptree_restart, > + struct droptree_node, list); > + > + list_del_init(&dn->list); > + > + ret = droptree_reada_dn(dn); > + BUG_ON(ret); /* can''t back out */ > + } > + > + while (!list_empty(&fs_info->droptree_requeue)) { > + dn = list_first_entry(&fs_info->droptree_requeue, > + struct droptree_node, list); > + list_del_init(&dn->list); > + list_add_tail(&dn->list,&fs_info->droptree_queue[dn->level]); > + reada_control_elem_get(dn->droproot->rc); > + } > + > + mutex_unlock(&fs_info->droptree_lock); > +} > + > +/* > + * for a commit, everything that''s queued in droptree_queue[level] is put > + * aside into requeue queue. Also the elem on the parent is given up, allowing > + * the count to drop to zero > + */ > +static void droptree_move_to_requeue(struct btrfs_fs_info *fs_info) > +{ > + int i; > + struct droptree_node *dn; > + struct reada_control *rc; > + > + mutex_lock(&fs_info->droptree_lock); > + > + for (i = 0; i< BTRFS_MAX_LEVEL; ++i) { > + while (!list_empty(fs_info->droptree_queue + i)) { > + dn = list_first_entry(fs_info->droptree_queue + i, > + struct droptree_node, list); > + > + list_del_init(&dn->list); > + rc = dn->droproot->rc; > + list_add_tail(&dn->list,&fs_info->droptree_requeue); > + reada_control_elem_put(rc); > + } > + } > + mutex_unlock(&fs_info->droptree_lock); > +} > + > +/* > + * check if we have room in readahead at any level and send respective nodes > + * to readahead > + */ > +static void droptree_kick(struct btrfs_fs_info *fs_info) > +{ > + int i; > + int ret; > + struct droptree_node *dn; > + struct reada_control *rc; > + > + mutex_lock(&fs_info->droptree_lock); > + > + for (i = 0; i< BTRFS_MAX_LEVEL; ++i) { > + while (!list_empty(fs_info->droptree_queue + i)) { > + if (fs_info->droptree_pause_req) { > + mutex_unlock(&fs_info->droptree_lock); > + droptree_move_to_requeue(fs_info); > + return; > + } > + > + if (fs_info->droptree_req[i]>> + fs_info->droptree_limit[i]) > + break; > + > + dn = list_first_entry(fs_info->droptree_queue + i, > + struct droptree_node, list); > + > + list_del_init(&dn->list); > + rc = dn->droproot->rc; > + > + ++fs_info->droptree_req[i]; > + mutex_unlock(&fs_info->droptree_lock); > + > + spin_lock(&dn->parent->lock); > + if (list_empty(&dn->next)) > + list_add(&dn->next,&dn->parent->children); > + spin_unlock(&dn->parent->lock); > + > + ret = droptree_reada_dn(dn); > + BUG_ON(ret); /* can''t back out */ > + > + /* > + * we got an elem on the rc when the dn got enqueued, > + * drop it here so elem can go down to zero > + */ > + reada_control_elem_put(rc); > + mutex_lock(&fs_info->droptree_lock); > + } > + } > + mutex_unlock(&fs_info->droptree_lock); > +} > + > +/* > + * mark the running droptree as paused and cancel add requests. When this > + * returns, droptree is completly paused. > + */ > +int btrfs_droptree_pause(struct btrfs_fs_info *fs_info) > +{ > + struct reada_control *top_rc; > + > + mutex_lock(&fs_info->droptree_lock); > + > + ++fs_info->droptree_pause_req; > + top_rc = fs_info->droptree_rc; > + if (top_rc) > + kref_get(&top_rc->refcnt); > + mutex_unlock(&fs_info->droptree_lock); > + > + if (top_rc) { > + btrfs_reada_abort(fs_info, top_rc); > + btrfs_reada_detach(top_rc); /* free our ref */ > + } > + /* move all queued requests to requeue */ > + droptree_move_to_requeue(fs_info); > + > + mutex_lock(&fs_info->droptree_lock); > + while (fs_info->droptrees_running) { > + mutex_unlock(&fs_info->droptree_lock); > + wait_event(fs_info->droptree_wait, > + fs_info->droptrees_running == 0); > + mutex_lock(&fs_info->droptree_lock); > + } > + mutex_unlock(&fs_info->droptree_lock); > + > + return 0; > +} > + > +void btrfs_droptree_continue(struct btrfs_fs_info *fs_info) > +{ > + mutex_lock(&fs_info->droptree_lock); > + --fs_info->droptree_pause_req; > + mutex_unlock(&fs_info->droptree_lock); > + > + wake_up(&fs_info->droptree_wait); > +} > + > +/* > + * find the special inode used to save droptree state. If it doesn''t exist, > + * create it. Similar to the free_space_cache inodes this is generated in the > + * root tree. > + */ > +static noinline struct inode *droptree_get_inode(struct btrfs_fs_info *fs_info) > +{ > + struct btrfs_key location; > + struct inode *inode; > + struct btrfs_trans_handle *trans; > + struct btrfs_root *root = fs_info->tree_root; > + struct btrfs_disk_key disk_key; > + struct btrfs_inode_item *inode_item; > + struct extent_buffer *leaf; > + struct btrfs_path *path; > + u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC; > + int ret = 0; > + > + location.objectid = BTRFS_DROPTREE_INO_OBJECTID; > + location.type = BTRFS_INODE_ITEM_KEY; > + location.offset = 0; > + > + inode = btrfs_iget(fs_info->sb,&location, root, NULL); > + if (inode&& !IS_ERR(inode)) > + return inode; > + > + path = btrfs_alloc_path(); > + if (!path) > + return ERR_PTR(-ENOMEM); > + inode = NULL; > + > + /* > + * inode does not exist, create it > + */ > + trans = btrfs_start_transaction(root, 1); > + if (!trans) { > + btrfs_free_path(path); > + return ERR_PTR(-ENOMEM); > + } > + > + ret = btrfs_insert_empty_inode(trans, root, path, > + BTRFS_DROPTREE_INO_OBJECTID);> + if (ret) > + goto out;If goto out here, NULL is returned to the caller. So, I think that it should check it by using IS_ERR_OR_NULL() at the caller.> + > + leaf = path->nodes[0]; > + inode_item = btrfs_item_ptr(leaf, path->slots[0], > + struct btrfs_inode_item); > + btrfs_item_key(leaf,&disk_key, path->slots[0]); > + memset_extent_buffer(leaf, 0, (unsigned long)inode_item, > + sizeof(*inode_item)); > + btrfs_set_inode_generation(leaf, inode_item, trans->transid); > + btrfs_set_inode_size(leaf, inode_item, 0); > + btrfs_set_inode_nbytes(leaf, inode_item, 0); > + btrfs_set_inode_uid(leaf, inode_item, 0); > + btrfs_set_inode_gid(leaf, inode_item, 0); > + btrfs_set_inode_mode(leaf, inode_item, S_IFREG | 0600); > + btrfs_set_inode_flags(leaf, inode_item, flags); > + btrfs_set_inode_nlink(leaf, inode_item, 1); > + btrfs_set_inode_transid(leaf, inode_item, trans->transid); > + btrfs_set_inode_block_group(leaf, inode_item, 0); > + btrfs_mark_buffer_dirty(leaf); > + btrfs_release_path(path); > + > + inode = btrfs_iget(fs_info->sb,&location, root, NULL); > + > +out: > + btrfs_free_path(path); > + btrfs_end_transaction(trans, root); > +> + if (IS_ERR(inode)) > + return inode;I think this is unnecessary.> + > + return inode; > +} > + > +/* > + * basic allocation and initialization of a droptree node > + */ > +static struct droptree_node *droptree_alloc_node(struct droptree_root *dr) > +{ > + struct droptree_node *dn; > + > + dn = kzalloc(sizeof(*dn), GFP_NOFS); > + if (!dn) > + return NULL; > + > + dn->droproot = dr; > + dn->parent = NULL; > + dn->parent_slot = 0; > + dn->map = NULL; > + dn->nritems = 0; > + INIT_LIST_HEAD(&dn->children); > + INIT_LIST_HEAD(&dn->next); > + INIT_LIST_HEAD(&dn->list); > + spin_lock_init(&dn->lock); > + > + return dn; > +} > + > +/* > + * add a new top-level node to the list of running droptrees. Allocate the > + * necessary droptree_root for it > + */ > +static struct droptree_root *droptree_add_droproot(struct list_head *list, > + struct droptree_node *dn, > + struct btrfs_root *root) > +{ > + struct droptree_root *dr; > + > + dr = kzalloc(sizeof(*dr), GFP_NOFS); > + if (!dr) > + return NULL; > + > + dn->droproot = dr; > + dr->root = root; > + dr->top_dn = dn; > + list_add_tail(&dr->list, list); > + > + return dr; > +} > + > +/* > + * Free a complete droptree > + * > + * again, recursion would be the easy way, but we have to iterate > + * through the tree. While freeing the nodes, also remove them from > + * restart/requeue-queue. If they''re not empty after all trees have > + * been freed, something is wrong. > + */ > +static noinline void droptree_free_tree(struct droptree_node *dn) > +{ > + while (dn) { > + if (!list_empty(&dn->children)) { > + dn = list_first_entry(&dn->children, > + struct droptree_node, next); > + } else { > + struct droptree_node *parent = dn->parent; > + list_del(&dn->next); > + /* > + * if dn is not enqueued, list is empty and del_init > + * changes nothing > + */ > + list_del_init(&dn->list); > + kfree(dn->map); > + kfree(dn); > + dn = parent; > + } > + } > +} > + > +/* > + * set up the reada_control for a droptree_root and enqueue it so it gets > + * started by droptree_restart > + */ > +static int droptree_reada_root(struct reada_control *top_rc, > + struct droptree_root *dr) > +{ > + struct reada_control *rc; > + u64 start; > + u64 generation; > + int level; > + struct extent_buffer *node; > + struct btrfs_fs_info *fs_info = dr->root->fs_info; > + > + rc = btrfs_reada_alloc(top_rc, dr->root,&min_key,&max_key, > + droptree_reada_fstree); > + if (!rc) > + return -ENOMEM; > + > + dr->rc = rc; > + kref_get(&rc->refcnt); > + > + node = btrfs_root_node(dr->root); > + start = node->start; > + level = btrfs_header_level(node); > + generation = btrfs_header_generation(node); > + free_extent_buffer(node); > + > + dr->top_dn->start = start; > + dr->top_dn->level = level; > + dr->top_dn->len = btrfs_level_size(dr->root, level); > + dr->top_dn->generation = generation; > + > + mutex_lock(&fs_info->droptree_lock); > + ++fs_info->droptree_req[level]; > + mutex_unlock(&fs_info->droptree_lock); > + > + /* > + * add this root to the restart queue. It can''t be started immediately, > + * as the caller is not yet synchronized with the transaction commit > + */ > + mutex_lock(&fs_info->droptree_lock); > + list_add_tail(&dr->top_dn->list,&fs_info->droptree_restart); > + mutex_unlock(&fs_info->droptree_lock); > + > + return 0; > +} > + > +/* > + * write out the state of a tree to the droptree inode. To avoid recursion, > + * we do this iteratively using a dynamically allocated stack structure. > + */ > +static int droptree_save_tree(struct btrfs_fs_info *fs_info, > + struct io_ctl *io, > + struct droptree_node *dn) > +{ > + struct { > + struct list_head *head; > + struct droptree_node *cur; > + } *stack, *sp; > + struct droptree_node *cur; > + int down = 1; > + > + stack = kmalloc(sizeof(*stack) * BTRFS_MAX_LEVEL, GFP_NOFS); > + BUG_ON(!stack); /* can''t back out */ > + sp = stack; > + sp->head =&dn->next; > + sp->cur = dn; > + cur = dn; > + > + while (1) { > + if (down&& cur->nritems&& !cur->convert) { > + int i; > + int l; > + > + /* > + * write out this node before descending down > + */ > + BUG_ON(cur->nritems&& !cur->map); /* can''t happen */ > + io_ctl_set_u16(io, cur->parent_slot); > + io_ctl_set_u64(io, cur->start); > + io_ctl_set_u64(io, cur->len); > + io_ctl_set_u16(io, cur->nritems); > + l = DT_BITS_TO_U32(cur->nritems); > + > + for (i = 0; i< l; ++i) > + io_ctl_set_u32(io, cur->map[i]); > + } > + if (down&& !list_empty(&cur->children)) { > + /* > + * walk down one step > + */ > + if (cur->level> 0) > + io_ctl_set_u16(io, DROPTREE_STATE_GO_DOWN); > + ++sp; > + sp->head =&cur->children; > + cur = list_first_entry(&cur->children, > + struct droptree_node, next); > + sp->cur = cur; > + } else if (cur->next.next != sp->head) { > + /* > + * step to the side > + */ > + cur = list_first_entry(&cur->next, > + struct droptree_node, next); > + sp->cur = cur; > + down = 1; > + } else if (sp != stack) { > + /* > + * walk up > + */ > + if (cur->level>= 0) > + io_ctl_set_u16(io, DROPTREE_STATE_GO_UP); > + --sp; > + cur = sp->cur; > + down = 0; > + } else { > + /* > + * done > + */ > + io_ctl_set_u16(io, DROPTREE_STATE_END); > + break; > + } > + } > + kfree(stack); > + > + return 0; > +} > + > +/* > + * write out the full droptree state to disk > + */ > +static void droptree_save_state(struct btrfs_fs_info *fs_info, > + struct inode *inode, > + struct btrfs_trans_handle *trans, > + struct list_head *droplist) > +{ > + struct io_ctl io_ctl; > + struct droptree_root *dr; > + struct btrfs_root *root = fs_info->tree_root; > + struct extent_state *cached_state = NULL; > + int ret; > + > + io_ctl_init(&io_ctl, inode, root); > + io_ctl.check_crcs = 0; > + io_ctl_prepare_pages(&io_ctl, inode, 0); > + lock_extent_bits(&BTRFS_I(inode)->io_tree, 0, > + i_size_read(inode) - 1, > + 0,&cached_state, GFP_NOFS); > + > + io_ctl_set_u32(&io_ctl, DROPTREE_VERSION); /* version */ > + io_ctl_set_u64(&io_ctl, fs_info->generation); /* generation */ > + > + list_for_each_entry(dr, droplist, list) { > + io_ctl_set_u64(&io_ctl, dr->root->root_key.objectid); > + io_ctl_set_u64(&io_ctl, dr->root->root_key.offset); > + io_ctl_set_u8(&io_ctl, dr->top_dn->level); > + ret = droptree_save_tree(fs_info,&io_ctl, dr->top_dn); > + BUG_ON(ret); /* can''t back out */ > + } > + io_ctl_set_u64(&io_ctl, 0); /* terminator */ > + > + ret = btrfs_dirty_pages(root, inode, io_ctl.pages, > + io_ctl.num_pages, > + 0, i_size_read(inode),&cached_state); > + BUG_ON(ret); /* can''t back out */ > + io_ctl_drop_pages(&io_ctl); > + unlock_extent_cached(&BTRFS_I(inode)->io_tree, 0, > + i_size_read(inode) - 1,&cached_state, > + GFP_NOFS); > + io_ctl_free(&io_ctl); > + > + ret = filemap_write_and_wait(inode->i_mapping); > + BUG_ON(ret); /* can''t back out */ > + > + ret = btrfs_update_inode(trans, root, inode); > + BUG_ON(ret); /* can''t back out */ > +} > + > +/* > + * read the saved state from the droptree inode and prepare everything so > + * it gets started by droptree_restart > + */ > +static int droptree_read_state(struct btrfs_fs_info *fs_info, > + struct inode *inode, > + struct reada_control *top_rc, > + struct list_head *droplist) > +{ > + struct io_ctl io_ctl; > + u32 version; > + u64 generation; > + struct droptree_node **stack; > + int ret = 0; > + > + stack = kmalloc(sizeof(*stack) * BTRFS_MAX_LEVEL, GFP_NOFS); > + if (!stack) > + return -ENOMEM; > + > + io_ctl_init(&io_ctl, inode, fs_info->tree_root); > + io_ctl.check_crcs = 0; > + io_ctl_prepare_pages(&io_ctl, inode, 1); > + > + version = io_ctl_get_u32(&io_ctl); > + if (version != DROPTREE_VERSION) { > + printk(KERN_ERR "btrfs: snapshot deletion state has been saved " > + "with a different version, ignored\n"); > + ret = -EINVAL; > + goto out; > + } > + /* FIXME generation is currently not needed */ > + generation = io_ctl_get_u64(&io_ctl); > + > + while (1) { > + struct btrfs_key key; > + int ret; > + struct btrfs_root *del_root; > + struct droptree_root *dr; > + int level; > + int max_level; > + struct droptree_node *root_dn; > + > + key.objectid = io_ctl_get_u64(&io_ctl); > + if (key.objectid == 0) > + break; > + > + key.type = BTRFS_ROOT_ITEM_KEY; > + key.offset = io_ctl_get_u64(&io_ctl); > + max_level = level = io_ctl_get_u8(&io_ctl); > + > + BUG_ON(level< 0 || level>= BTRFS_MAX_LEVEL); /* incons. fs */ > + del_root = btrfs_read_fs_root_no_radix(fs_info->tree_root, > + &key); > + if (IS_ERR(del_root)) { > + ret = PTR_ERR(del_root); > + BUG(); /* inconsistent fs */ > + } > + > + root_dn = droptree_alloc_node(NULL); > + /* > + * FIXME in this phase is should still be possible to undo > + * everything and return a failure. Same goes for the allocation > + * failures below > + */ > + BUG_ON(!root_dn); /* can''t back out */ > + dr = droptree_add_droproot(droplist, root_dn, del_root); > + BUG_ON(!dr); /* can''t back out */ > + > + stack[level] = root_dn; > + > + while (1) { > + u64 start; > + u64 len; > + u64 nritems; > + u32 *map; > + int n; > + int i; > + int parent_slot; > + struct droptree_node *dn; > + > + parent_slot = io_ctl_get_u16(&io_ctl); > + if (parent_slot == DROPTREE_STATE_GO_UP) { > + ++level; > + BUG_ON(level> max_level); /* incons. fs */ > + continue; > + } > + if (parent_slot == DROPTREE_STATE_GO_DOWN) { > + --level; > + BUG_ON(level< 0); /* incons. fs */ > + continue; > + } > + if (parent_slot == DROPTREE_STATE_END) > + break; > + start = io_ctl_get_u64(&io_ctl); > + if (start == 0) > + break; > + > + len = io_ctl_get_u64(&io_ctl); > + nritems = io_ctl_get_u16(&io_ctl); > + n = DT_BITS_TO_U32(nritems); > + BUG_ON(n> 999999); /* incons. fs */ > + BUG_ON(n == 0); /* incons. fs */ > + > + map = kmalloc(n * sizeof(u32), GFP_NOFS); > + BUG_ON(!map); /* can''t back out */ > + > + for (i = 0; i< n; ++i) > + map[i] = io_ctl_get_u32(&io_ctl); > + > + if (level == max_level) { > + /* only for root node */ > + dn = stack[level]; > + } else { > + dn = droptree_alloc_node(dr); > + BUG_ON(!dn); /* can''t back out */ > + dn->parent = stack[level + 1]; > + dn->parent_slot = parent_slot; > + list_add_tail(&dn->next, > + &stack[level+1]->children); > + stack[level] = dn; > + } > + dn->level = level; > + dn->start = start; > + dn->len = len; > + dn->map = map; > + dn->nritems = nritems; > + dn->generation = 0; > + } > + ret = droptree_reada_root(top_rc, dr); > + BUG_ON(ret); /* can''t back out */ > + } > +out: > + io_ctl_drop_pages(&io_ctl); > + io_ctl_free(&io_ctl); > + kfree(stack); > + > + return ret; > +} > + > +/* > + * called from transaction.c with a list of roots to delete > + */ > +void droptree_drop_list(struct btrfs_fs_info *fs_info, struct list_head *list) > +{ > + struct btrfs_root *root = fs_info->tree_root; > + struct inode *inode = NULL; > + struct btrfs_path *path = NULL; > + int ret; > + struct btrfs_trans_handle *trans; > + u64 alloc_hint = 0; > + u64 prealloc; > + loff_t oldsize; > + long max_nodes; > + int i; > + struct list_head droplist; > + struct droptree_root *dr; > + struct droptree_root *dtmp; > + int running_roots = 0; > + struct reada_control *top_rc = NULL; > + > + if (btrfs_fs_closing(fs_info)) > + return; > + > + inode = droptree_get_inode(fs_info);> + if (IS_ERR(inode))if (IS_ERR_OR_NULL(inode)) Thanks, Tsutomu> + goto out; > + > + path = btrfs_alloc_path(); > + if (!path) > + goto out; > + > + /* > + * create a dummy reada_control to use as a parent for all trees > + */ > + top_rc = btrfs_reada_alloc(NULL, root, NULL, NULL, NULL); > + if (!top_rc) > + goto out; > + reada_control_elem_get(top_rc); > + INIT_LIST_HEAD(&droplist); > + > + if (i_size_read(inode)> 0) { > + /* read */ > + ret = droptree_read_state(fs_info, inode, top_rc,&droplist); > + if (ret) > + goto out; > + list_for_each_entry(dr,&droplist, list) > + ++running_roots; > + } > + mutex_lock(&fs_info->droptree_lock); > + BUG_ON(fs_info->droptree_rc); /* can''t happen */ > + fs_info->droptree_rc = top_rc; > + mutex_unlock(&fs_info->droptree_lock); > + > + while (1) { > + mutex_lock(&fs_info->droptree_lock); > + while (fs_info->droptree_pause_req) { > + mutex_unlock(&fs_info->droptree_lock); > + wait_event(fs_info->droptree_wait, > + fs_info->droptree_pause_req == 0 || > + btrfs_fs_closing(fs_info)); > + if (btrfs_fs_closing(fs_info)) > + goto end; > + mutex_lock(&fs_info->droptree_lock); > + } > + ++fs_info->droptrees_running; > + mutex_unlock(&fs_info->droptree_lock); > + > + /* > + * 3 for truncation, including inode update > + * for each root, we need to delete the root_item afterwards > + */ > + trans = btrfs_start_transaction(root, 3 + DROPTREE_MAX_ROOTS); > + if (!trans) { > + btrfs_free_path(path); > + iput(inode); > + return; > + } > + > + max_nodes = 0; > + for (i = 0; i< BTRFS_MAX_LEVEL; ++i) > + max_nodes += fs_info->droptree_limit[i]; > + > + /* > + * global header (version(4), generation(8)) + terminator (8) > + */ > + prealloc = 20; > + > + /* > + * per root overhead: objectid(8), offset(8), level(1) + > + * end marker (2) > + */ > + prealloc += 19 * DROPTREE_MAX_ROOTS; > + > + /* > + * per node: parent slot(2), start(8), len(8), nritems(2) + map > + * we add also room for one UP/DOWN per node > + */ > + prealloc += (22 + > + DT_BITS_TO_U32(BTRFS_NODEPTRS_PER_BLOCK(root)) * 4) * > + max_nodes; > + prealloc = ALIGN(prealloc, PAGE_CACHE_SIZE); > + > + /* > + * preallocate the space, so writing it later on can''t fail > + * > + * FIXME allocate block reserve instead, to reserve space > + * for the truncation? */ > + ret = btrfs_delalloc_reserve_space(inode, prealloc); > + if (ret) > + goto out; > + > + /* > + * from here on, every error is fatal and must prevent the > + * current transaction from comitting, as that would leave an > + * inconsistent state on disk > + */ > + oldsize = i_size_read(inode); > + if (oldsize> 0) { > + BTRFS_I(inode)->generation = 0; > + btrfs_i_size_write(inode, 0); > + truncate_pagecache(inode, oldsize, 0); > + > + ret = btrfs_truncate_inode_items(trans, root, inode, 0, > + BTRFS_EXTENT_DATA_KEY); > + BUG_ON(ret); /* can''t back out */ > + > + ret = btrfs_update_inode(trans, root, inode); > + BUG_ON(ret); /* can''t back out */ > + } > + /* > + * add more roots until we reach the limit > + */ > + while (running_roots< DROPTREE_MAX_ROOTS&& > + !list_empty(list)) { > + struct btrfs_root *del_root; > + struct droptree_node *dn; > + > + del_root = list_first_entry(list, struct btrfs_root, > + root_list); > + list_del(&del_root->root_list); > + > + ret = btrfs_del_orphan_item(trans, root, > + del_root->root_key.objectid); > + BUG_ON(ret); /* can''t back out */ > + dn = droptree_alloc_node(NULL); > + BUG_ON(!dn); /* can''t back out */ > + dr = droptree_add_droproot(&droplist, dn, del_root); > + BUG_ON(!dr); /* can''t back out */ > + ret = droptree_reada_root(top_rc, dr); > + BUG_ON(ret); /* can''t back out */ > + > + ++running_roots; > + } > + > + /* > + * kick off the already queued jobs from the last pause, > + * and all freshly added roots > + */ > + droptree_restart(fs_info); > + droptree_kick(fs_info); > + > + /* > + * wait for all readaheads to finish. a pause will also cause > + * the wait to end > + */ > + list_for_each_entry(dr,&droplist, list) > + btrfs_reada_wait(dr->rc); > + > + /* > + * signal droptree_pause that it can continue. We still have > + * the trans handle, so the current transaction won''t commit > + * until we''ve written the state to disk > + */ > + mutex_lock(&fs_info->droptree_lock); > + --fs_info->droptrees_running; > + mutex_unlock(&fs_info->droptree_lock); > + wake_up(&fs_info->droptree_wait); > + > + /* > + * collect all finished droptrees > + */ > + list_for_each_entry_safe(dr, dtmp,&droplist, list) { > + struct droptree_node *dn; > + int full; > + dn = dr->top_dn; > + spin_lock(&dn->lock); > + full = dn->map&& > + droptree_bitmap_full(dn->map, dn->nritems); > + spin_unlock(&dn->lock); > + if (full) { > + struct btrfs_root *del_root = dr->root; > + > + list_del(&dr->list); > + ret = btrfs_del_root(trans, fs_info->tree_root, > + &del_root->root_key); > + BUG_ON(ret); /* can''t back out */ > + if (del_root->in_radix) { > + btrfs_free_fs_root(fs_info, del_root); > + } else { > + free_extent_buffer(del_root->node); > + free_extent_buffer(del_root-> > + commit_root); > + kfree(del_root); > + } > + kfree(dr->top_dn->map); > + kfree(dr->top_dn); > + kfree(dr); > + --running_roots; > + } > + } > + > + if (list_empty(&droplist)) { > + /* > + * nothing in progress. Just leave the droptree inode > + * at length zero and drop out of the loop > + */ > + btrfs_delalloc_release_space(inode, prealloc); > + btrfs_end_transaction(trans, root); > + break; > + } > + > + /* we reserved the space for this above */ > + ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, > + prealloc, prealloc, > + prealloc,&alloc_hint); > + BUG_ON(ret); /* can''t back out */ > + > + droptree_save_state(fs_info, inode, trans,&droplist); > + > + btrfs_end_transaction(trans, root); > + > + if (btrfs_fs_closing(fs_info)) { > + printk("fs is closing, abort loop\n"); > + break; > + } > + > + /* restart loop: create new reada_controls for the roots */ > + list_for_each_entry(dr,&droplist, list) { > + dr->rc = btrfs_reada_alloc(top_rc, dr->root, > + &min_key,&max_key, > + droptree_reada_fstree); > + /* > + * FIXME we could handle the allocation failure in > + * principle, as we''re currently in a consistent state > + */ > + BUG_ON(!dr->rc); /* can''t back out */ > + kref_get(&dr->rc->refcnt); > + } > + } > +end: > + > + /* > + * on unmount, we come here although we''re in the middle of a deletion. > + * This means there are still allocated dropnodes we have to free. We > + * free them by going down all the droptree_roots. > + */ > + while (!list_empty(&droplist)) { > + dr = list_first_entry(&droplist, struct droptree_root, list); > + list_del(&dr->list); > + droptree_free_tree(dr->top_dn); > + if (dr->root->in_radix) { > + btrfs_free_fs_root(fs_info, dr->root); > + } else { > + free_extent_buffer(dr->root->node); > + free_extent_buffer(dr->root->commit_root); > + kfree(dr->root); > + } > + kfree(dr); > + } > + /* > + * also delete everything from requeue > + */ > + while (!list_empty(&fs_info->droptree_requeue)) { > + struct droptree_node *dn; > + > + dn = list_first_entry(&fs_info->droptree_requeue, > + struct droptree_node, list); > + list_del(&dn->list); > + kfree(dn->map); > + kfree(dn); > + } > + /* > + * restart queue must be empty by now > + */ > + BUG_ON(!list_empty(&fs_info->droptree_restart)); /* can''t happen */ > +out: > + if (path) > + btrfs_free_path(path); > + if (inode) > + iput(inode); > + if (top_rc) { > + mutex_lock(&fs_info->droptree_lock); > + fs_info->droptree_rc = NULL; > + mutex_unlock(&fs_info->droptree_lock); > + reada_control_elem_put(top_rc); > + } > +}-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/12/2012 11:54 PM, Arne Jansen wrote:> This patchset reimplements snapshot deletion with the help of the readahead > framework. For this callbacks are added to the framework. The main idea is > to traverse many snapshots at once at read many branches at once. This way > readahead get many requests at once (currently about 50000), giving it the > chance to order disk accesses properly. On a single disk, the effect is > currently spoiled by sync operations that still take place, mainly checksum > deletion. The most benefit can be gained with multiple devices, as all devices > can be fully utilized. It scales quite well with the number of devices. > For more details see the commit messages of the individual patches and the > source code comments. > > How it is tested: > I created a test volume using David Sterba''s stress-subvol-git-aging.sh. It > checks out randoms version of the kernel git tree, creating a snapshot from it > from time to time and checks out other versions there, and so on. In the end > the fs had 80 subvols with various degrees of sharing between them. The > following tests were conducted on it: > - delete a subvol using droptree and check the fs with btrfsck afterwards > for consistency > - delete all subvols and verify with btrfs-debug-tree that the extent > allocation tree is clean > - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get > a good pressure on locking > - add various degrees of memory pressure to the previous test to get pages > to expire early from page cache > - enable all relevant kernel debugging options during all tests > > The performance gain on a single drive was about 20%, on 8 drives about 600%. > It depends vastly on the maximum parallelity of the readahead, that is > currently hardcoded to about 50000. This number is subject to 2 factors, the > available RAM and the size of the saved state for a commit. As the full state > has to be saved on commit, a large parallelity leads to a large state. > > Based on this I''ll see if I can add delayed checksum deletions and running > the delayed refs via readahead, to gain a maximum ordering of I/O ops. >Hi Arne, Can you please show us some user cases in this, or can we get some extra benefits from it? :) thanks, liubo> This patchset is also available at > > git://git.kernel.org/pub/scm/linux/kernel/git/arne/linux-btrfs.git droptree > > Arne Jansen (5): > btrfs: extend readahead interface > btrfs: add droptree inode > btrfs: droptree structures and initialization > btrfs: droptree implementation > btrfs: use droptree for snapshot deletion > > fs/btrfs/Makefile | 2 +- > fs/btrfs/btrfs_inode.h | 4 + > fs/btrfs/ctree.h | 78 ++- > fs/btrfs/disk-io.c | 19 + > fs/btrfs/droptree.c | 1916 +++++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/free-space-cache.c | 131 +++- > fs/btrfs/free-space-cache.h | 32 + > fs/btrfs/inode.c | 3 +- > fs/btrfs/reada.c | 494 +++++++++--- > fs/btrfs/scrub.c | 29 +- > fs/btrfs/transaction.c | 35 +- > 11 files changed, 2592 insertions(+), 151 deletions(-) > create mode 100644 fs/btrfs/droptree.c >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2012 at 9:40 PM, Liu Bo <liubo2009@cn.fujitsu.com> wrote:> On 04/12/2012 11:54 PM, Arne Jansen wrote: >> This patchset reimplements snapshot deletion with the help of the readahead >> framework. For this callbacks are added to the framework. The main idea is >> to traverse many snapshots at once at read many branches at once. This way >> readahead get many requests at once (currently about 50000), giving it the >> chance to order disk accesses properly. On a single disk, the effect is >> currently spoiled by sync operations that still take place, mainly checksum >> deletion. The most benefit can be gained with multiple devices, as all devices >> can be fully utilized. It scales quite well with the number of devices. >> For more details see the commit messages of the individual patches and the >> source code comments. >> >> How it is tested: >> I created a test volume using David Sterba''s stress-subvol-git-aging.sh. It >> checks out randoms version of the kernel git tree, creating a snapshot from it >> from time to time and checks out other versions there, and so on. In the end >> the fs had 80 subvols with various degrees of sharing between them. The >> following tests were conducted on it: >> Â - delete a subvol using droptree and check the fs with btrfsck afterwards >> Â Â for consistency >> Â - delete all subvols and verify with btrfs-debug-tree that the extent >> Â Â allocation tree is clean >> Â - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get >> Â Â a good pressure on locking >> Â - add various degrees of memory pressure to the previous test to get pages >> Â Â to expire early from page cache >> Â - enable all relevant kernel debugging options during all tests >> >> The performance gain on a single drive was about 20%, on 8 drives about 600%. >> It depends vastly on the maximum parallelity of the readahead, that is >> currently hardcoded to about 50000. This number is subject to 2 factors, the >> available RAM and the size of the saved state for a commit. As the full state >> has to be saved on commit, a large parallelity leads to a large state. >> >> Based on this I''ll see if I can add delayed checksum deletions and running >> the delayed refs via readahead, to gain a maximum ordering of I/O ops. >> > > Hi Arne, > > Can you please show us some user cases in this, or can we get some extra benefits from it? :) > > thanks, > liuboExpiring old backups routinely takes days to complete due to how long it takes snapshot deletion to finish. This makes maximizing the number of retained backups, or even simply ensuring that we have enough space for the current backup, quite difficult. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13.04.2012 04:53, Tsutomu Itoh wrote:> Hi, Arne, > > (2012/04/13 0:54), Arne Jansen wrote: >> This is an implementation of snapshot deletion using the readahead >> framework. Multiple snapshots can be deleted at once and the trees >> are not enumerated sequentially but in parallel in many branches. >> This way readahead can reorder the request to better utilize all >> disks. For a more detailed description see inline comments. >> >> Signed-off-by: Arne Jansen<sensille@gmx.net> >> ---[snip]>> +/* >> + * read the saved state from the droptree inode and prepare everything so >> + * it gets started by droptree_restart >> + */ >> +static int droptree_read_state(struct btrfs_fs_info *fs_info, >> + struct inode *inode, >> + struct reada_control *top_rc, >> + struct list_head *droplist) >> +{ >> + struct io_ctl io_ctl; >> + u32 version; >> + u64 generation; >> + struct droptree_node **stack; >> + int ret = 0; >> + >> + stack = kmalloc(sizeof(*stack) * BTRFS_MAX_LEVEL, GFP_NOFS); >> + if (!stack) >> + return -ENOMEM; >> + >> + io_ctl_init(&io_ctl, inode, fs_info->tree_root); >> + io_ctl.check_crcs = 0; >> + io_ctl_prepare_pages(&io_ctl, inode, 1); >> + >> + version = io_ctl_get_u32(&io_ctl); >> + if (version != DROPTREE_VERSION) { >> + printk(KERN_ERR "btrfs: snapshot deletion state has been saved " >> + "with a different version, ignored\n"); >> + ret = -EINVAL; >> + goto out; >> + } >> + /* FIXME generation is currently not needed */ >> + generation = io_ctl_get_u64(&io_ctl); >> + >> + while (1) { >> + struct btrfs_key key; >> + int ret; >> + struct btrfs_root *del_root; >> + struct droptree_root *dr; >> + int level; >> + int max_level; >> + struct droptree_node *root_dn; >> + >> + key.objectid = io_ctl_get_u64(&io_ctl); >> + if (key.objectid == 0) >> + break; >> + >> + key.type = BTRFS_ROOT_ITEM_KEY; >> + key.offset = io_ctl_get_u64(&io_ctl); >> + max_level = level = io_ctl_get_u8(&io_ctl); >> + >> + BUG_ON(level< 0 || level>= BTRFS_MAX_LEVEL); /* incons. fs */ >> + del_root = btrfs_read_fs_root_no_radix(fs_info->tree_root, >> + &key); >> + if (IS_ERR(del_root)) { >> + ret = PTR_ERR(del_root); >> + BUG(); /* inconsistent fs */ >> + } >> + >> + root_dn = droptree_alloc_node(NULL); >> + /* >> + * FIXME in this phase is should still be possible to undo >> + * everything and return a failure. Same goes for the allocation >> + * failures below >> + */ >> + BUG_ON(!root_dn); /* can''t back out */ >> + dr = droptree_add_droproot(droplist, root_dn, del_root); >> + BUG_ON(!dr); /* can''t back out */ >> + >> + stack[level] = root_dn; >> + >> + while (1) { >> + u64 start; >> + u64 len; >> + u64 nritems; >> + u32 *map; >> + int n; >> + int i; >> + int parent_slot; >> + struct droptree_node *dn; >> + >> + parent_slot = io_ctl_get_u16(&io_ctl); >> + if (parent_slot == DROPTREE_STATE_GO_UP) { >> + ++level; >> + BUG_ON(level> max_level); /* incons. fs */ >> + continue; >> + } >> + if (parent_slot == DROPTREE_STATE_GO_DOWN) { >> + --level; >> + BUG_ON(level< 0); /* incons. fs */ >> + continue; >> + } >> + if (parent_slot == DROPTREE_STATE_END) >> + break; >> + start = io_ctl_get_u64(&io_ctl); >> + if (start == 0) >> + break; >> + >> + len = io_ctl_get_u64(&io_ctl); >> + nritems = io_ctl_get_u16(&io_ctl); >> + n = DT_BITS_TO_U32(nritems); >> + BUG_ON(n> 999999); /* incons. fs */ >> + BUG_ON(n == 0); /* incons. fs */ >> + >> + map = kmalloc(n * sizeof(u32), GFP_NOFS); >> + BUG_ON(!map); /* can''t back out */ >> + >> + for (i = 0; i< n; ++i) >> + map[i] = io_ctl_get_u32(&io_ctl); >> + >> + if (level == max_level) { >> + /* only for root node */ >> + dn = stack[level]; >> + } else { >> + dn = droptree_alloc_node(dr); >> + BUG_ON(!dn); /* can''t back out */ >> + dn->parent = stack[level + 1]; >> + dn->parent_slot = parent_slot; >> + list_add_tail(&dn->next, >> + &stack[level+1]->children); >> + stack[level] = dn; >> + } >> + dn->level = level; >> + dn->start = start; >> + dn->len = len; >> + dn->map = map; >> + dn->nritems = nritems; >> + dn->generation = 0; >> + } >> + ret = droptree_reada_root(top_rc, dr); >> + BUG_ON(ret); /* can''t back out */ >> + } >> +out: >> + io_ctl_drop_pages(&io_ctl); >> + io_ctl_free(&io_ctl); >> + kfree(stack); >> + >> + return ret; >> +} >> + >> +/* >> + * called from transaction.c with a list of roots to delete >> + */ >> +void droptree_drop_list(struct btrfs_fs_info *fs_info, struct list_head *list) >> +{ >> + struct btrfs_root *root = fs_info->tree_root; >> + struct inode *inode = NULL; >> + struct btrfs_path *path = NULL; >> + int ret; >> + struct btrfs_trans_handle *trans; >> + u64 alloc_hint = 0; >> + u64 prealloc; >> + loff_t oldsize; >> + long max_nodes; >> + int i; >> + struct list_head droplist; >> + struct droptree_root *dr; >> + struct droptree_root *dtmp; >> + int running_roots = 0; >> + struct reada_control *top_rc = NULL; >> + >> + if (btrfs_fs_closing(fs_info)) >> + return; >> + >> + inode = droptree_get_inode(fs_info); > >> + if (IS_ERR(inode)) > > if (IS_ERR_OR_NULL(inode))got it, thanks for catching it. -Arne> > Thanks, > Tsutomu > >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13.04.2012 05:40, Liu Bo wrote:> On 04/12/2012 11:54 PM, Arne Jansen wrote: >> This patchset reimplements snapshot deletion with the help of the readahead >> framework. For this callbacks are added to the framework. The main idea is >> to traverse many snapshots at once at read many branches at once. This way >> readahead get many requests at once (currently about 50000), giving it the >> chance to order disk accesses properly. On a single disk, the effect is >> currently spoiled by sync operations that still take place, mainly checksum >> deletion. The most benefit can be gained with multiple devices, as all devices >> can be fully utilized. It scales quite well with the number of devices. >> For more details see the commit messages of the individual patches and the >> source code comments. >> >> How it is tested: >> I created a test volume using David Sterba''s stress-subvol-git-aging.sh. It >> checks out randoms version of the kernel git tree, creating a snapshot from it >> from time to time and checks out other versions there, and so on. In the end >> the fs had 80 subvols with various degrees of sharing between them. The >> following tests were conducted on it: >> - delete a subvol using droptree and check the fs with btrfsck afterwards >> for consistency >> - delete all subvols and verify with btrfs-debug-tree that the extent >> allocation tree is clean >> - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get >> a good pressure on locking >> - add various degrees of memory pressure to the previous test to get pages >> to expire early from page cache >> - enable all relevant kernel debugging options during all tests >> >> The performance gain on a single drive was about 20%, on 8 drives about 600%. >> It depends vastly on the maximum parallelity of the readahead, that is >> currently hardcoded to about 50000. This number is subject to 2 factors, the >> available RAM and the size of the saved state for a commit. As the full state >> has to be saved on commit, a large parallelity leads to a large state. >> >> Based on this I''ll see if I can add delayed checksum deletions and running >> the delayed refs via readahead, to gain a maximum ordering of I/O ops. >> > > Hi Arne, > > Can you please show us some user cases in this, or can we get some extra benefits from it? :)The case I''m most concerned with is having large filesystems (like 20x3T) with thousands of users on it in thousands of subvolumes and making hourly snapshots. Creating these snapshots is relatively easy, getting rid of them is not. But there are already reports where deleting a single snapshot can take several days. So we really need a huge speedup here, and this is only the beginning :) -Arne> > thanks, > liubo > >> This patchset is also available at >> >> git://git.kernel.org/pub/scm/linux/kernel/git/arne/linux-btrfs.git droptree >> >> Arne Jansen (5): >> btrfs: extend readahead interface >> btrfs: add droptree inode >> btrfs: droptree structures and initialization >> btrfs: droptree implementation >> btrfs: use droptree for snapshot deletion >> >> fs/btrfs/Makefile | 2 +- >> fs/btrfs/btrfs_inode.h | 4 + >> fs/btrfs/ctree.h | 78 ++- >> fs/btrfs/disk-io.c | 19 + >> fs/btrfs/droptree.c | 1916 +++++++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/free-space-cache.c | 131 +++- >> fs/btrfs/free-space-cache.h | 32 + >> fs/btrfs/inode.c | 3 +- >> fs/btrfs/reada.c | 494 +++++++++--- >> fs/btrfs/scrub.c | 29 +- >> fs/btrfs/transaction.c | 35 +- >> 11 files changed, 2592 insertions(+), 151 deletions(-) >> create mode 100644 fs/btrfs/droptree.c >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/13/2012 02:53 PM, Arne Jansen wrote:> On 13.04.2012 05:40, Liu Bo wrote: >> > On 04/12/2012 11:54 PM, Arne Jansen wrote: >>> >> This patchset reimplements snapshot deletion with the help of the readahead >>> >> framework. For this callbacks are added to the framework. The main idea is >>> >> to traverse many snapshots at once at read many branches at once. This way >>> >> readahead get many requests at once (currently about 50000), giving it the >>> >> chance to order disk accesses properly. On a single disk, the effect is >>> >> currently spoiled by sync operations that still take place, mainly checksum >>> >> deletion. The most benefit can be gained with multiple devices, as all devices >>> >> can be fully utilized. It scales quite well with the number of devices. >>> >> For more details see the commit messages of the individual patches and the >>> >> source code comments. >>> >> >>> >> How it is tested: >>> >> I created a test volume using David Sterba''s stress-subvol-git-aging.sh. It >>> >> checks out randoms version of the kernel git tree, creating a snapshot from it >>> >> from time to time and checks out other versions there, and so on. In the end >>> >> the fs had 80 subvols with various degrees of sharing between them. The >>> >> following tests were conducted on it: >>> >> - delete a subvol using droptree and check the fs with btrfsck afterwards >>> >> for consistency >>> >> - delete all subvols and verify with btrfs-debug-tree that the extent >>> >> allocation tree is clean >>> >> - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get >>> >> a good pressure on locking >>> >> - add various degrees of memory pressure to the previous test to get pages >>> >> to expire early from page cache >>> >> - enable all relevant kernel debugging options during all tests >>> >> >>> >> The performance gain on a single drive was about 20%, on 8 drives about 600%. >>> >> It depends vastly on the maximum parallelity of the readahead, that is >>> >> currently hardcoded to about 50000. This number is subject to 2 factors, the >>> >> available RAM and the size of the saved state for a commit. As the full state >>> >> has to be saved on commit, a large parallelity leads to a large state. >>> >> >>> >> Based on this I''ll see if I can add delayed checksum deletions and running >>> >> the delayed refs via readahead, to gain a maximum ordering of I/O ops. >>> >> >> > >> > Hi Arne, >> > >> > Can you please show us some user cases in this, or can we get some extra benefits from it? :) > > The case I''m most concerned with is having large filesystems (like 20x3T) > with thousands of users on it in thousands of subvolumes and making > hourly snapshots. Creating these snapshots is relatively easy, getting rid > of them is not. > But there are already reports where deleting a single snapshot can take > several days. So we really need a huge speedup here, and this is only > the beginning :)I see. I''ve just tested it on 3.4-rc2, I cannot get it through due to the following, could you take a look? Apr 8 14:58:08 kvm kernel: ------------[ cut here ]------------ Apr 8 14:58:08 kvm kernel: kernel BUG at fs/btrfs/droptree.c:418! Apr 8 14:58:08 kvm kernel: invalid opcode: 0000 [#1] SMP Apr 8 14:58:08 kvm kernel: CPU 1 Apr 8 14:58:08 kvm kernel: Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables iptable_filter ipt_REJECT ip_tables bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon pcspkr i2c_i801 iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_ge neric ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs] Apr 8 14:58:08 kvm kernel: Apr 8 14:58:08 kvm kernel: Pid: 532, comm: btrfs-readahead Tainted: G W O 3.4.0-rc1+ #10 LENOVO QiTianM7150/To be filled by O.E.M. Apr 8 14:58:08 kvm kernel: RIP: 0010:[<ffffffffa082f800>] [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] Apr 8 14:58:08 kvm kernel: RSP: 0018:ffff88003418bda0 EFLAGS: 00010286 Apr 8 14:58:08 kvm kernel: RAX: 00000000ffffffff RBX: ffff88007ab74348 RCX: 0000000105585190 Apr 8 14:58:08 kvm kernel: RDX: 000000000000003a RSI: ffffffff81ade6a0 RDI: 0000000000000286 Apr 8 14:58:08 kvm kernel: RBP: ffff88003418be10 R08: 000000000000003f R09: 0000000000000003 Apr 8 14:58:08 kvm kernel: R10: 0000000000000002 R11: 0000000000008340 R12: ffff880004194718 Apr 8 14:58:08 kvm kernel: R13: ffff88004004e000 R14: ffff880034b9b000 R15: ffff88000c64a820 Apr 8 14:58:08 kvm kernel: FS: 0000000000000000(0000) GS:ffff88007da80000(0000) knlGS:0000000000000000 Apr 8 14:58:08 kvm kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Apr 8 14:58:08 kvm kernel: CR2: 0000003842d454a4 CR3: 000000003d0a0000 CR4: 00000000000407e0 Apr 8 14:58:08 kvm kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Apr 8 14:58:08 kvm kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Apr 8 14:58:08 kvm kernel: Process btrfs-readahead (pid: 532, threadinfo ffff88003418a000, task ffff880076d3a040) Apr 8 14:58:08 kvm kernel: Stack: Apr 8 14:58:08 kvm kernel: ffff880078f06d40 ffff88007690ecd8 ffff88000c605758 ffff880078f06d80 Apr 8 14:58:08 kvm kernel: ffff880036916740 ffff88007ab742c0 0000000000000002 000000000000064f Apr 8 14:58:08 kvm kernel: ffff88003418be10 ffff88007690ecc0 ffff88007690ed10 ffff88007690ecd8 Apr 8 14:58:08 kvm kernel: Call Trace: Apr 8 14:58:08 kvm kernel: [<ffffffffa08032df>] worker_loop+0x14f/0x5a0 [btrfs] Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] Apr 8 14:58:08 kvm kernel: [<ffffffff8106f1ae>] kthread+0x9e/0xb0 Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea4>] kernel_thread_helper+0x4/0x10 Apr 8 14:58:08 kvm kernel: [<ffffffff8106f110>] ? kthread_freezable_should_stop+0x70/0x70 Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea0>] ? gs_change+0x13/0x13 Apr 8 14:58:08 kvm kernel: Code: fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe be fe 01 00 00 48 c7 c7 58 7b 83 a0 e8 e9 dd 81 e0 44 8b 55 a8 e9 77 ff ff ff <0f> 0b eb fe 0f 0b eb fe 90 90 90 90 90 90 90 90 55 48 89 e5 41 Apr 8 14:58:08 kvm kernel: RIP [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] thanks, -- liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13.04.2012 09:10, Liu Bo wrote:> On 04/13/2012 02:53 PM, Arne Jansen wrote: > >> On 13.04.2012 05:40, Liu Bo wrote: >>>> On 04/12/2012 11:54 PM, Arne Jansen wrote: >>>>>> This patchset reimplements snapshot deletion with the help of the readahead >>>>>> framework. For this callbacks are added to the framework. The main idea is >>>>>> to traverse many snapshots at once at read many branches at once. This way >>>>>> readahead get many requests at once (currently about 50000), giving it the >>>>>> chance to order disk accesses properly. On a single disk, the effect is >>>>>> currently spoiled by sync operations that still take place, mainly checksum >>>>>> deletion. The most benefit can be gained with multiple devices, as all devices >>>>>> can be fully utilized. It scales quite well with the number of devices. >>>>>> For more details see the commit messages of the individual patches and the >>>>>> source code comments. >>>>>> >>>>>> How it is tested: >>>>>> I created a test volume using David Sterba''s stress-subvol-git-aging.sh. It >>>>>> checks out randoms version of the kernel git tree, creating a snapshot from it >>>>>> from time to time and checks out other versions there, and so on. In the end >>>>>> the fs had 80 subvols with various degrees of sharing between them. The >>>>>> following tests were conducted on it: >>>>>> - delete a subvol using droptree and check the fs with btrfsck afterwards >>>>>> for consistency >>>>>> - delete all subvols and verify with btrfs-debug-tree that the extent >>>>>> allocation tree is clean >>>>>> - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get >>>>>> a good pressure on locking >>>>>> - add various degrees of memory pressure to the previous test to get pages >>>>>> to expire early from page cache >>>>>> - enable all relevant kernel debugging options during all tests >>>>>> >>>>>> The performance gain on a single drive was about 20%, on 8 drives about 600%. >>>>>> It depends vastly on the maximum parallelity of the readahead, that is >>>>>> currently hardcoded to about 50000. This number is subject to 2 factors, the >>>>>> available RAM and the size of the saved state for a commit. As the full state >>>>>> has to be saved on commit, a large parallelity leads to a large state. >>>>>> >>>>>> Based on this I''ll see if I can add delayed checksum deletions and running >>>>>> the delayed refs via readahead, to gain a maximum ordering of I/O ops. >>>>>> >>>> >>>> Hi Arne, >>>> >>>> Can you please show us some user cases in this, or can we get some extra benefits from it? :) >> >> The case I''m most concerned with is having large filesystems (like 20x3T) >> with thousands of users on it in thousands of subvolumes and making >> hourly snapshots. Creating these snapshots is relatively easy, getting rid >> of them is not. >> But there are already reports where deleting a single snapshot can take >> several days. So we really need a huge speedup here, and this is only >> the beginning :) > > > I see. > > I''ve just tested it on 3.4-rc2, I cannot get it through due to the following, could you take a look? > > Apr 8 14:58:08 kvm kernel: ------------[ cut here ]------------ > Apr 8 14:58:08 kvm kernel: kernel BUG at fs/btrfs/droptree.c:418!might be out of memory. How much does this vm (?) have? Can you try to reduce the constants in disk-io.c:2003-2005? Thanks, Arne> Apr 8 14:58:08 kvm kernel: invalid opcode: 0000 [#1] SMP > Apr 8 14:58:08 kvm kernel: CPU 1 > Apr 8 14:58:08 kvm kernel: Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables iptable_filter ipt_REJECT ip_tables bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon pcspkr i2c_i801 iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]> Apr 8 14:58:08 kvm kernel: > Apr 8 14:58:08 kvm kernel: Pid: 532, comm: btrfs-readahead Tainted: G W O 3.4.0-rc1+ #10 LENOVO QiTianM7150/To be filled by O.E.M. > Apr 8 14:58:08 kvm kernel: RIP: 0010:[<ffffffffa082f800>] [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] > Apr 8 14:58:08 kvm kernel: RSP: 0018:ffff88003418bda0 EFLAGS: 00010286 > Apr 8 14:58:08 kvm kernel: RAX: 00000000ffffffff RBX: ffff88007ab74348 RCX: 0000000105585190 > Apr 8 14:58:08 kvm kernel: RDX: 000000000000003a RSI: ffffffff81ade6a0 RDI: 0000000000000286 > Apr 8 14:58:08 kvm kernel: RBP: ffff88003418be10 R08: 000000000000003f R09: 0000000000000003 > Apr 8 14:58:08 kvm kernel: R10: 0000000000000002 R11: 0000000000008340 R12: ffff880004194718 > Apr 8 14:58:08 kvm kernel: R13: ffff88004004e000 R14: ffff880034b9b000 R15: ffff88000c64a820 > Apr 8 14:58:08 kvm kernel: FS: 0000000000000000(0000) GS:ffff88007da80000(0000) knlGS:0000000000000000 > Apr 8 14:58:08 kvm kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Apr 8 14:58:08 kvm kernel: CR2: 0000003842d454a4 CR3: 000000003d0a0000 CR4: 00000000000407e0 > Apr 8 14:58:08 kvm kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Apr 8 14:58:08 kvm kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Apr 8 14:58:08 kvm kernel: Process btrfs-readahead (pid: 532, threadinfo ffff88003418a000, task ffff880076d3a040) > Apr 8 14:58:08 kvm kernel: Stack: > Apr 8 14:58:08 kvm kernel: ffff880078f06d40 ffff88007690ecd8 ffff88000c605758 ffff880078f06d80 > Apr 8 14:58:08 kvm kernel: ffff880036916740 ffff88007ab742c0 0000000000000002 000000000000064f > Apr 8 14:58:08 kvm kernel: ffff88003418be10 ffff88007690ecc0 ffff88007690ed10 ffff88007690ecd8 > Apr 8 14:58:08 kvm kernel: Call Trace: > Apr 8 14:58:08 kvm kernel: [<ffffffffa08032df>] worker_loop+0x14f/0x5a0 [btrfs] > Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] > Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] > Apr 8 14:58:08 kvm kernel: [<ffffffff8106f1ae>] kthread+0x9e/0xb0 > Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea4>] kernel_thread_helper+0x4/0x10 > Apr 8 14:58:08 kvm kernel: [<ffffffff8106f110>] ? kthread_freezable_should_stop+0x70/0x70 > Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea0>] ? gs_change+0x13/0x13 > Apr 8 14:58:08 kvm kernel: Code: fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe be fe 01 00 00 48 c7 c7 58 7b 83 a0 e8 e9 dd 81 e0 44 8b 55 a8 e9 77 ff ff ff <0f> 0b eb fe 0f 0b eb fe 90 90 90 90 90 90 90 90 55 48 89 e5 41 > Apr 8 14:58:08 kvm kernel: RIP [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] > > > thanks,-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/13/2012 03:10 PM, Liu Bo wrote:> On 04/13/2012 02:53 PM, Arne Jansen wrote: > >> On 13.04.2012 05:40, Liu Bo wrote: >>>> On 04/12/2012 11:54 PM, Arne Jansen wrote: >>>>>> This patchset reimplements snapshot deletion with the help of the readahead >>>>>> framework. For this callbacks are added to the framework. The main idea is >>>>>> to traverse many snapshots at once at read many branches at once. This way >>>>>> readahead get many requests at once (currently about 50000), giving it the >>>>>> chance to order disk accesses properly. On a single disk, the effect is >>>>>> currently spoiled by sync operations that still take place, mainly checksum >>>>>> deletion. The most benefit can be gained with multiple devices, as all devices >>>>>> can be fully utilized. It scales quite well with the number of devices. >>>>>> For more details see the commit messages of the individual patches and the >>>>>> source code comments. >>>>>> >>>>>> How it is tested: >>>>>> I created a test volume using David Sterba''s stress-subvol-git-aging.sh. It >>>>>> checks out randoms version of the kernel git tree, creating a snapshot from it >>>>>> from time to time and checks out other versions there, and so on. In the end >>>>>> the fs had 80 subvols with various degrees of sharing between them. The >>>>>> following tests were conducted on it: >>>>>> - delete a subvol using droptree and check the fs with btrfsck afterwards >>>>>> for consistency >>>>>> - delete all subvols and verify with btrfs-debug-tree that the extent >>>>>> allocation tree is clean >>>>>> - delete 70 subvols, and in parallel empty the other 10 with rm -rf to get >>>>>> a good pressure on locking >>>>>> - add various degrees of memory pressure to the previous test to get pages >>>>>> to expire early from page cache >>>>>> - enable all relevant kernel debugging options during all tests >>>>>> >>>>>> The performance gain on a single drive was about 20%, on 8 drives about 600%. >>>>>> It depends vastly on the maximum parallelity of the readahead, that is >>>>>> currently hardcoded to about 50000. This number is subject to 2 factors, the >>>>>> available RAM and the size of the saved state for a commit. As the full state >>>>>> has to be saved on commit, a large parallelity leads to a large state. >>>>>> >>>>>> Based on this I''ll see if I can add delayed checksum deletions and running >>>>>> the delayed refs via readahead, to gain a maximum ordering of I/O ops. >>>>>> >>>> Hi Arne, >>>> >>>> Can you please show us some user cases in this, or can we get some extra benefits from it? :) >> The case I''m most concerned with is having large filesystems (like 20x3T) >> with thousands of users on it in thousands of subvolumes and making >> hourly snapshots. Creating these snapshots is relatively easy, getting rid >> of them is not. >> But there are already reports where deleting a single snapshot can take >> several days. So we really need a huge speedup here, and this is only >> the beginning :) > > > I see. > > I''ve just tested it on 3.4-rc2, I cannot get it through due to the following, could you take a look? > > Apr 8 14:58:08 kvm kernel: ------------[ cut here ]------------ > Apr 8 14:58:08 kvm kernel: kernel BUG at fs/btrfs/droptree.c:418! > Apr 8 14:58:08 kvm kernel: invalid opcode: 0000 [#1] SMP > Apr 8 14:58:08 kvm kernel: CPU 1 > Apr 8 14:58:08 kvm kernel: Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables iptable_filter ipt_REJECT ip_tables bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon pcspkr i2c_i801 iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]> Apr 8 14:58:08 kvm kernel: > Apr 8 14:58:08 kvm kernel: Pid: 532, comm: btrfs-readahead Tainted: G W O 3.4.0-rc1+ #10 LENOVO QiTianM7150/To be filled by O.E.M. > Apr 8 14:58:08 kvm kernel: RIP: 0010:[<ffffffffa082f800>] [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] > Apr 8 14:58:08 kvm kernel: RSP: 0018:ffff88003418bda0 EFLAGS: 00010286 > Apr 8 14:58:08 kvm kernel: RAX: 00000000ffffffff RBX: ffff88007ab74348 RCX: 0000000105585190 > Apr 8 14:58:08 kvm kernel: RDX: 000000000000003a RSI: ffffffff81ade6a0 RDI: 0000000000000286 > Apr 8 14:58:08 kvm kernel: RBP: ffff88003418be10 R08: 000000000000003f R09: 0000000000000003 > Apr 8 14:58:08 kvm kernel: R10: 0000000000000002 R11: 0000000000008340 R12: ffff880004194718 > Apr 8 14:58:08 kvm kernel: R13: ffff88004004e000 R14: ffff880034b9b000 R15: ffff88000c64a820 > Apr 8 14:58:08 kvm kernel: FS: 0000000000000000(0000) GS:ffff88007da80000(0000) knlGS:0000000000000000 > Apr 8 14:58:08 kvm kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Apr 8 14:58:08 kvm kernel: CR2: 0000003842d454a4 CR3: 000000003d0a0000 CR4: 00000000000407e0 > Apr 8 14:58:08 kvm kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Apr 8 14:58:08 kvm kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Apr 8 14:58:08 kvm kernel: Process btrfs-readahead (pid: 532, threadinfo ffff88003418a000, task ffff880076d3a040) > Apr 8 14:58:08 kvm kernel: Stack: > Apr 8 14:58:08 kvm kernel: ffff880078f06d40 ffff88007690ecd8 ffff88000c605758 ffff880078f06d80 > Apr 8 14:58:08 kvm kernel: ffff880036916740 ffff88007ab742c0 0000000000000002 000000000000064f > Apr 8 14:58:08 kvm kernel: ffff88003418be10 ffff88007690ecc0 ffff88007690ed10 ffff88007690ecd8 > Apr 8 14:58:08 kvm kernel: Call Trace: > Apr 8 14:58:08 kvm kernel: [<ffffffffa08032df>] worker_loop+0x14f/0x5a0 [btrfs] > Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] > Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] > Apr 8 14:58:08 kvm kernel: [<ffffffff8106f1ae>] kthread+0x9e/0xb0 > Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea4>] kernel_thread_helper+0x4/0x10 > Apr 8 14:58:08 kvm kernel: [<ffffffff8106f110>] ? kthread_freezable_should_stop+0x70/0x70 > Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea0>] ? gs_change+0x13/0x13 > Apr 8 14:58:08 kvm kernel: Code: fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe be fe 01 00 00 48 c7 c7 58 7b 83 a0 e8 e9 dd 81 e0 44 8b 55 a8 e9 77 ff ff ff <0f> 0b eb fe 0f 0b eb fe 90 90 90 90 90 90 90 90 55 48 89 e5 41 > Apr 8 14:58:08 kvm kernel: RIP [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] > > > thanks,The script: umount /mnt/btrfs mkfs.btrfs /dev/sdb7 mount /dev/sdb7 /mnt/btrfs echo "fio" fio fio.jobs echo "remount 1" umount /mnt/btrfs; mount /dev/sdb7 /mnt/btrfs; for i in `seq 1 1 2000`; do btrfs sub snap /mnt/btrfs /mnt/btrfs/s$i > /dev/null 2>&1; done echo "remount 2" umount /mnt/btrfs; mount /dev/sdb7 /mnt/btrfs; for i in `seq 1 1 2000`; do btrfs sub delete /mnt/btrfs/s$i > /dev/null 2>&1; done echo "umount" time umount /mnt/btrfs fio.jobs: [global] group_reporting bs=4k rw=randrw sync=0 ioengine=sync directory=/mnt/btrfs/ [READ] filename=foobar size=200M thanks, -- liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/13/2012 03:19 PM, Arne Jansen wrote:> > might be out of memory. How much does this vm (?) have? > Can you try to reduce the constants in disk-io.c:2003-2005? > > Thanks, > Arne >Seems not related to an OOM: # free -m total used free shared buffers cached Mem: 1973 478 1494 0 49 315 -/+ buffers/cache: 113 1859 Swap: 0 0 0 thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13.04.2012 09:20, Liu Bo wrote:> On 04/13/2012 03:10 PM, Liu Bo wrote: > >> On 04/13/2012 02:53 PM, Arne Jansen wrote: >> >>> On 13.04.2012 05:40, Liu Bo wrote:>> >> >> I see. >> >> I''ve just tested it on 3.4-rc2, I cannot get it through due to the following, could you take a look? >> >> Apr 8 14:58:08 kvm kernel: ------------[ cut here ]------------ >> Apr 8 14:58:08 kvm kernel: kernel BUG at fs/btrfs/droptree.c:418! >> Apr 8 14:58:08 kvm kernel: invalid opcode: 0000 [#1] SMP >> Apr 8 14:58:08 kvm kernel: CPU 1 >> Apr 8 14:58:08 kvm kernel: Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables iptable_filter ipt_REJECT ip_tables bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon pcspkr i2c_i801 iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]>> Apr 8 14:58:08 kvm kernel: >> Apr 8 14:58:08 kvm kernel: Pid: 532, comm: btrfs-readahead Tainted: G W O 3.4.0-rc1+ #10 LENOVO QiTianM7150/To be filled by O.E.M. >> Apr 8 14:58:08 kvm kernel: RIP: 0010:[<ffffffffa082f800>] [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] >> Apr 8 14:58:08 kvm kernel: RSP: 0018:ffff88003418bda0 EFLAGS: 00010286 >> Apr 8 14:58:08 kvm kernel: RAX: 00000000ffffffff RBX: ffff88007ab74348 RCX: 0000000105585190 >> Apr 8 14:58:08 kvm kernel: RDX: 000000000000003a RSI: ffffffff81ade6a0 RDI: 0000000000000286 >> Apr 8 14:58:08 kvm kernel: RBP: ffff88003418be10 R08: 000000000000003f R09: 0000000000000003 >> Apr 8 14:58:08 kvm kernel: R10: 0000000000000002 R11: 0000000000008340 R12: ffff880004194718 >> Apr 8 14:58:08 kvm kernel: R13: ffff88004004e000 R14: ffff880034b9b000 R15: ffff88000c64a820 >> Apr 8 14:58:08 kvm kernel: FS: 0000000000000000(0000) GS:ffff88007da80000(0000) knlGS:0000000000000000 >> Apr 8 14:58:08 kvm kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> Apr 8 14:58:08 kvm kernel: CR2: 0000003842d454a4 CR3: 000000003d0a0000 CR4: 00000000000407e0 >> Apr 8 14:58:08 kvm kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> Apr 8 14:58:08 kvm kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> Apr 8 14:58:08 kvm kernel: Process btrfs-readahead (pid: 532, threadinfo ffff88003418a000, task ffff880076d3a040) >> Apr 8 14:58:08 kvm kernel: Stack: >> Apr 8 14:58:08 kvm kernel: ffff880078f06d40 ffff88007690ecd8 ffff88000c605758 ffff880078f06d80 >> Apr 8 14:58:08 kvm kernel: ffff880036916740 ffff88007ab742c0 0000000000000002 000000000000064f >> Apr 8 14:58:08 kvm kernel: ffff88003418be10 ffff88007690ecc0 ffff88007690ed10 ffff88007690ecd8 >> Apr 8 14:58:08 kvm kernel: Call Trace: >> Apr 8 14:58:08 kvm kernel: [<ffffffffa08032df>] worker_loop+0x14f/0x5a0 [btrfs] >> Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] >> Apr 8 14:58:08 kvm kernel: [<ffffffffa0803190>] ? btrfs_queue_worker+0x300/0x300 [btrfs] >> Apr 8 14:58:08 kvm kernel: [<ffffffff8106f1ae>] kthread+0x9e/0xb0 >> Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea4>] kernel_thread_helper+0x4/0x10 >> Apr 8 14:58:08 kvm kernel: [<ffffffff8106f110>] ? kthread_freezable_should_stop+0x70/0x70 >> Apr 8 14:58:08 kvm kernel: [<ffffffff814fbea0>] ? gs_change+0x13/0x13 >> Apr 8 14:58:08 kvm kernel: Code: fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe be fe 01 00 00 48 c7 c7 58 7b 83 a0 e8 e9 dd 81 e0 44 8b 55 a8 e9 77 ff ff ff <0f> 0b eb fe 0f 0b eb fe 90 90 90 90 90 90 90 90 55 48 89 e5 41 >> Apr 8 14:58:08 kvm kernel: RIP [<ffffffffa082f800>] droptree_fetch_ref+0x4b0/0x4c0 [btrfs] >> >> >> thanks, > > > The script: > > umount /mnt/btrfs > mkfs.btrfs /dev/sdb7 > mount /dev/sdb7 /mnt/btrfs > > echo "fio" > fio fio.jobs > > echo "remount 1" > umount /mnt/btrfs; mount /dev/sdb7 /mnt/btrfs; > > for i in `seq 1 1 2000`; > do > btrfs sub snap /mnt/btrfs /mnt/btrfs/s$i > /dev/null 2>&1; > done > > echo "remount 2" > umount /mnt/btrfs; mount /dev/sdb7 /mnt/btrfs; > > for i in `seq 1 1 2000`; > do > btrfs sub delete /mnt/btrfs/s$i > /dev/null 2>&1; > done > > echo "umount" > time umount /mnt/btrfs > > fio.jobs: > > [global] > group_reporting > bs=4k > rw=randrw > sync=0 > ioengine=sync > directory=/mnt/btrfs/ > > [READ] > filename=foobar > size=200M > > thanks,unfortunatly it doesn''t crash for me. Have you tried with smaller numbers? Nearly all cases where it can run into this BUG are OOM related, besides a btrfs_map_block failure. Is this reproducible? Maybe you can add one or two strategic printks to see where it comes from... Thanks, Arne -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13.04.2012 09:43, Liu Bo wrote:> On 04/13/2012 03:19 PM, Arne Jansen wrote: > >> >> might be out of memory. How much does this vm (?) have? >> Can you try to reduce the constants in disk-io.c:2003-2005? >> >> Thanks, >> Arne >> > > > Seems not related to an OOM: > # free -m > total used free shared buffers cached > Mem: 1973 478 1494 0 49 315 > -/+ buffers/cache: 113 1859 > Swap: 0 0 0 > >so it seems I''ve lost a patch along the way that was supposed to fix that: "btrfs: fix race in reada". I pushed my tree again with this patch on top. thanks, Arne> thanks, > liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/17/2012 03:35 PM, Arne Jansen wrote:> On 13.04.2012 09:43, Liu Bo wrote: >> On 04/13/2012 03:19 PM, Arne Jansen wrote: >> >>> might be out of memory. How much does this vm (?) have? >>> Can you try to reduce the constants in disk-io.c:2003-2005? >>> >>> Thanks, >>> Arne >>> >> >> Seems not related to an OOM: >> # free -m >> total used free shared buffers cached >> Mem: 1973 478 1494 0 49 315 >> -/+ buffers/cache: 113 1859 >> Swap: 0 0 0 >> >> > > so it seems I''ve lost a patch along the way that was supposed to fix > that: "btrfs: fix race in reada". I pushed my tree again with this > patch on top. > > thanks, > Arne >OK, I''ll apply that patch and take some time to test it again. thanks, liubo>> thanks, >> liubo >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/17/2012 03:35 PM, Arne Jansen wrote:> On 13.04.2012 09:43, Liu Bo wrote: >> > On 04/13/2012 03:19 PM, Arne Jansen wrote: >> > >>> >> >>> >> might be out of memory. How much does this vm (?) have? >>> >> Can you try to reduce the constants in disk-io.c:2003-2005? >>> >> >>> >> Thanks, >>> >> Arne >>> >> >> > >> > >> > Seems not related to an OOM: >> > # free -m >> > total used free shared buffers cached >> > Mem: 1973 478 1494 0 49 315 >> > -/+ buffers/cache: 113 1859 >> > Swap: 0 0 0 >> > >> > > > so it seems I''ve lost a patch along the way that was supposed to fix > that: "btrfs: fix race in reada". I pushed my tree again with this > patch on top. >Hi Arne, Sorry for the long delay. I''ve tested the droptree patch (1/5->5/5) on the latest upstream 3.4-rc4 along with the missed patch "btrfs: fix race in reada". And I''ve got several different bugs and hangs, one of them is the following: Btrfs loaded device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 4 /dev/sdb6 btrfs: disk space caching is enabled Btrfs detected SSD devices, enabling SSD mode device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 8 /dev/sdb6 btrfs: disk space caching is enabled Btrfs detected SSD devices, enabling SSD mode device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 2011 /dev/sdb6 btrfs: disk space caching is enabled Btrfs detected SSD devices, enabling SSD mode btrfs: block rsv returned -28 ------------[ cut here ]------------ WARNING: at fs/btrfs/extent-tree.c:6220 btrfs_alloc_free_block+0x353/0x370 [btrfs]() Hardware name: QiTianM7150 Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_ helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs] Pid: 4605, comm: btrfs-transacti Tainted: G O 3.4.0-rc4+ #11 Call Trace: [<ffffffff8104d6df>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8104d73a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0726b43>] btrfs_alloc_free_block+0x353/0x370 [btrfs] [<ffffffffa0755c71>] ? read_extent_buffer+0xd1/0x130 [btrfs] [<ffffffffa072e04a>] ? btree_read_extent_buffer_pages.clone.2+0xca/0x140 [btrfs] [<ffffffffa0710a62>] __btrfs_cow_block+0x142/0x570 [btrfs] [<ffffffffa07130dd>] ? read_block_for_search+0x14d/0x3e0 [btrfs] [<ffffffffa0711492>] btrfs_cow_block+0x102/0x210 [btrfs] [<ffffffffa0716b5c>] btrfs_search_slot+0x42c/0x960 [btrfs] [<ffffffffa07853b9>] btrfs_delete_delayed_items+0x99/0x340 [btrfs] [<ffffffff81152662>] ? kmem_cache_alloc+0x152/0x190 [<ffffffffa0786212>] btrfs_run_delayed_items+0x112/0x160 [btrfs] [<ffffffffa07346df>] btrfs_commit_transaction+0x36f/0xa80 [btrfs] [<ffffffffa07351c2>] ? start_transaction+0x92/0x320 [btrfs] [<ffffffff8106fb60>] ? wake_up_bit+0x40/0x40 [<ffffffffa072fb5b>] transaction_kthread+0x26b/0x2e0 [btrfs] [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [<ffffffff8106f4be>] kthread+0x9e/0xb0 [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff814fc860>] ? gs_change+0x13/0x13 ---[ end trace 504e7bc5e13ed457 ]--- btrfs: block rsv returned -28 ------------[ cut here ]------------ WARNING: at fs/btrfs/extent-tree.c:6220 btrfs_alloc_free_block+0x353/0x370 [btrfs]() Hardware name: QiTianM7150 Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_ helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs] Pid: 4605, comm: btrfs-transacti Tainted: G W O 3.4.0-rc4+ #11 Call Trace: [<ffffffff8104d6df>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8104d73a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0726b43>] btrfs_alloc_free_block+0x353/0x370 [btrfs] [<ffffffffa0755c71>] ? read_extent_buffer+0xd1/0x130 [btrfs] [<ffffffffa072e04a>] ? btree_read_extent_buffer_pages.clone.2+0xca/0x140 [btrfs] [<ffffffffa0710a62>] __btrfs_cow_block+0x142/0x570 [btrfs] [<ffffffffa07130dd>] ? read_block_for_search+0x14d/0x3e0 [btrfs] [<ffffffffa0711492>] btrfs_cow_block+0x102/0x210 [btrfs] [<ffffffffa0716b5c>] btrfs_search_slot+0x42c/0x960 [btrfs] [<ffffffffa07853b9>] btrfs_delete_delayed_items+0x99/0x340 [btrfs] [<ffffffff81152662>] ? kmem_cache_alloc+0x152/0x190 [<ffffffffa0786212>] btrfs_run_delayed_items+0x112/0x160 [btrfs] [<ffffffffa07346df>] btrfs_commit_transaction+0x36f/0xa80 [btrfs] [<ffffffffa07351c2>] ? start_transaction+0x92/0x320 [btrfs] [<ffffffff8106fb60>] ? wake_up_bit+0x40/0x40 [<ffffffffa072fb5b>] transaction_kthread+0x26b/0x2e0 [btrfs] [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] [<ffffffff8106f4be>] kthread+0x9e/0xb0 [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff814fc860>] ? gs_change+0x13/0x13 ---[ end trace 504e7bc5e13ed458 ]--- ------------[ cut here ]------------ kernel BUG at fs/btrfs/transaction.c:1550! invalid opcode: 0000 [#1] SMP CPU 1 Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_ helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs] Pid: 4604, comm: btrfs-cleaner Tainted: G W O 3.4.0-rc4+ #11 LENOVO QiTianM7150/To be filled by O.E.M. RIP: 0010:[<ffffffffa0732de0>] [<ffffffffa0732de0>] btrfs_clean_old_snapshots+0x1d0/0x1e0 [btrfs] RSP: 0018:ffff880057113df0 EFLAGS: 00010286 RAX: 00000000ffff8800 RBX: ffff880057113e10 RCX: ffff880052166780 RDX: ffff880057113e10 RSI: ffff880057113e10 RDI: ffff880052166780 RBP: ffff880057113e60 R08: ffff880057113e10 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff880000000000 R13: 0000160000000000 R14: ffff880079ecb800 R15: ffff880057113e20 FS: 0000000000000000(0000) GS:ffff88007da80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003842cd3ac0 CR3: 00000000542eb000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process btrfs-cleaner (pid: 4604, threadinfo ffff880057112000, task ffff880075d60b30) Stack: ffff880052166780 ffff880076918000 ffff880057113fd8 ffff880052166400 ffff880052166780 ffff880052166780 ffff880052166f80 ffff880079f16f80 ffff880057113e80 ffff880079ecb800 ffff880057113e80 ffff880057113e98 Call Trace: [<ffffffffa072d360>] cleaner_kthread+0x160/0x1c0 [btrfs] [<ffffffffa072d200>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] [<ffffffffa072d200>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] [<ffffffff8106f4be>] kthread+0x9e/0xb0 [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff814fc860>] ? gs_change+0x13/0x13 Code: 22 b2 e0 31 c9 31 f6 ba 01 00 00 00 4c 89 e7 e8 47 24 ff ff 48 39 5d b0 75 d9 48 83 c4 48 31 c0 5b 41 5c 41 5d 41 5e 41 5f c9 c3 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 RIP [<ffffffffa0732de0>] btrfs_clean_old_snapshots+0x1d0/0x1e0 [btrfs] RSP <ffff880057113df0> ---[ end trace 504e7bc5e13ed459 ]--- kernel BUG at fs/btrfs/transaction.c:1550! refers to int btrfs_clean_old_snapshots(struct btrfs_root *root) { [...] while (!list_empty(&list)) { [...] BUG_ON(ret < 0); ---> the bug } thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Liu Bo, thanks for testing :) so one block reserve ran out of space. Did you use the same test as before? On SSD? Thanks, Arne On 27.04.2012 05:16, Liu Bo wrote:> > Sorry for the long delay. > > I''ve tested the droptree patch (1/5->5/5) on the latest upstream 3.4-rc4 > along with the missed patch "btrfs: fix race in reada". > > And I''ve got several different bugs and hangs, one of them is the following: > > Btrfs loaded > device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 4 /dev/sdb6 > btrfs: disk space caching is enabled > Btrfs detected SSD devices, enabling SSD mode > device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 8 /dev/sdb6 > btrfs: disk space caching is enabled > Btrfs detected SSD devices, enabling SSD mode > device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 2011 /dev/sdb6 > btrfs: disk space caching is enabled > Btrfs detected SSD devices, enabling SSD mode > btrfs: block rsv returned -28 > ------------[ cut here ]------------ > WARNING: at fs/btrfs/extent-tree.c:6220 btrfs_alloc_free_block+0x353/0x370 [btrfs]() > Hardware name: QiTianM7150 > Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]> Pid: 4605, comm: btrfs-transacti Tainted: G O 3.4.0-rc4+ #11 > Call Trace: > [<ffffffff8104d6df>] warn_slowpath_common+0x7f/0xc0 > [<ffffffff8104d73a>] warn_slowpath_null+0x1a/0x20 > [<ffffffffa0726b43>] btrfs_alloc_free_block+0x353/0x370 [btrfs] > [<ffffffffa0755c71>] ? read_extent_buffer+0xd1/0x130 [btrfs] > [<ffffffffa072e04a>] ? btree_read_extent_buffer_pages.clone.2+0xca/0x140 [btrfs] > [<ffffffffa0710a62>] __btrfs_cow_block+0x142/0x570 [btrfs] > [<ffffffffa07130dd>] ? read_block_for_search+0x14d/0x3e0 [btrfs] > [<ffffffffa0711492>] btrfs_cow_block+0x102/0x210 [btrfs] > [<ffffffffa0716b5c>] btrfs_search_slot+0x42c/0x960 [btrfs] > [<ffffffffa07853b9>] btrfs_delete_delayed_items+0x99/0x340 [btrfs] > [<ffffffff81152662>] ? kmem_cache_alloc+0x152/0x190 > [<ffffffffa0786212>] btrfs_run_delayed_items+0x112/0x160 [btrfs] > [<ffffffffa07346df>] btrfs_commit_transaction+0x36f/0xa80 [btrfs] > [<ffffffffa07351c2>] ? start_transaction+0x92/0x320 [btrfs] > [<ffffffff8106fb60>] ? wake_up_bit+0x40/0x40 > [<ffffffffa072fb5b>] transaction_kthread+0x26b/0x2e0 [btrfs] > [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] > [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] > [<ffffffff8106f4be>] kthread+0x9e/0xb0 > [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 > [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff814fc860>] ? gs_change+0x13/0x13 > ---[ end trace 504e7bc5e13ed457 ]--- > btrfs: block rsv returned -28 > ------------[ cut here ]------------ > WARNING: at fs/btrfs/extent-tree.c:6220 btrfs_alloc_free_block+0x353/0x370 [btrfs]() > Hardware name: QiTianM7150 > Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]> Pid: 4605, comm: btrfs-transacti Tainted: G W O 3.4.0-rc4+ #11 > Call Trace: > [<ffffffff8104d6df>] warn_slowpath_common+0x7f/0xc0 > [<ffffffff8104d73a>] warn_slowpath_null+0x1a/0x20 > [<ffffffffa0726b43>] btrfs_alloc_free_block+0x353/0x370 [btrfs] > [<ffffffffa0755c71>] ? read_extent_buffer+0xd1/0x130 [btrfs] > [<ffffffffa072e04a>] ? btree_read_extent_buffer_pages.clone.2+0xca/0x140 [btrfs] > [<ffffffffa0710a62>] __btrfs_cow_block+0x142/0x570 [btrfs] > [<ffffffffa07130dd>] ? read_block_for_search+0x14d/0x3e0 [btrfs] > [<ffffffffa0711492>] btrfs_cow_block+0x102/0x210 [btrfs] > [<ffffffffa0716b5c>] btrfs_search_slot+0x42c/0x960 [btrfs] > [<ffffffffa07853b9>] btrfs_delete_delayed_items+0x99/0x340 [btrfs] > [<ffffffff81152662>] ? kmem_cache_alloc+0x152/0x190 > [<ffffffffa0786212>] btrfs_run_delayed_items+0x112/0x160 [btrfs] > [<ffffffffa07346df>] btrfs_commit_transaction+0x36f/0xa80 [btrfs] > [<ffffffffa07351c2>] ? start_transaction+0x92/0x320 [btrfs] > [<ffffffff8106fb60>] ? wake_up_bit+0x40/0x40 > [<ffffffffa072fb5b>] transaction_kthread+0x26b/0x2e0 [btrfs] > [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] > [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] > [<ffffffff8106f4be>] kthread+0x9e/0xb0 > [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 > [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff814fc860>] ? gs_change+0x13/0x13 > ---[ end trace 504e7bc5e13ed458 ]--- > ------------[ cut here ]------------ > kernel BUG at fs/btrfs/transaction.c:1550! > invalid opcode: 0000 [#1] SMP > CPU 1 > Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]> > Pid: 4604, comm: btrfs-cleaner Tainted: G W O 3.4.0-rc4+ #11 LENOVO QiTianM7150/To be filled by O.E.M. > RIP: 0010:[<ffffffffa0732de0>] [<ffffffffa0732de0>] btrfs_clean_old_snapshots+0x1d0/0x1e0 [btrfs] > RSP: 0018:ffff880057113df0 EFLAGS: 00010286 > RAX: 00000000ffff8800 RBX: ffff880057113e10 RCX: ffff880052166780 > RDX: ffff880057113e10 RSI: ffff880057113e10 RDI: ffff880052166780 > RBP: ffff880057113e60 R08: ffff880057113e10 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000001 R12: ffff880000000000 > R13: 0000160000000000 R14: ffff880079ecb800 R15: ffff880057113e20 > FS: 0000000000000000(0000) GS:ffff88007da80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000003842cd3ac0 CR3: 00000000542eb000 CR4: 00000000000407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process btrfs-cleaner (pid: 4604, threadinfo ffff880057112000, task ffff880075d60b30) > Stack: > ffff880052166780 ffff880076918000 ffff880057113fd8 ffff880052166400 > ffff880052166780 ffff880052166780 ffff880052166f80 ffff880079f16f80 > ffff880057113e80 ffff880079ecb800 ffff880057113e80 ffff880057113e98 > Call Trace: > [<ffffffffa072d360>] cleaner_kthread+0x160/0x1c0 [btrfs] > [<ffffffffa072d200>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] > [<ffffffffa072d200>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] > [<ffffffff8106f4be>] kthread+0x9e/0xb0 > [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 > [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff814fc860>] ? gs_change+0x13/0x13 > Code: 22 b2 e0 31 c9 31 f6 ba 01 00 00 00 4c 89 e7 e8 47 24 ff ff 48 39 5d b0 75 d9 48 83 c4 48 31 c0 5b 41 5c 41 5d 41 5e 41 5f c9 c3 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 > RIP [<ffffffffa0732de0>] btrfs_clean_old_snapshots+0x1d0/0x1e0 [btrfs] > RSP <ffff880057113df0> > ---[ end trace 504e7bc5e13ed459 ]--- > > > kernel BUG at fs/btrfs/transaction.c:1550! > refers to > int btrfs_clean_old_snapshots(struct btrfs_root *root) > { > [...] > while (!list_empty(&list)) { > [...] > BUG_ON(ret < 0); ---> the bug > } > > thanks, > liubo-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/27/2012 02:13 PM, Arne Jansen wrote:> Hi Liu Bo, > > thanks for testing :) > > so one block reserve ran out of space. Did you use the same test as > before? On SSD? >Yes, the same test on SSD, but with different number of snapshot: 1000, 2000 respectively :) For the block reserve ENOSPC, I guess it is the global rsv, since we''ve just reverted a patch about that in rc4. Have you tried your patch on the latest upstream 3.4-rc4? Still no crash? ps: I''ve made two small changes while applying the patchset, which locates eb->first_page and lock_extent_bit, and I don''t think these can lead to bugs. thanks, liubo> Thanks, > Arne > > On 27.04.2012 05:16, Liu Bo wrote: >> Sorry for the long delay. >> >> I''ve tested the droptree patch (1/5->5/5) on the latest upstream 3.4-rc4 >> along with the missed patch "btrfs: fix race in reada". >> >> And I''ve got several different bugs and hangs, one of them is the following: >> >> Btrfs loaded >> device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 4 /dev/sdb6 >> btrfs: disk space caching is enabled >> Btrfs detected SSD devices, enabling SSD mode >> device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 8 /dev/sdb6 >> btrfs: disk space caching is enabled >> Btrfs detected SSD devices, enabling SSD mode >> device fsid e7b013a1-0d11-4162-83de-8404360f520a devid 1 transid 2011 /dev/sdb6 >> btrfs: disk space caching is enabled >> Btrfs detected SSD devices, enabling SSD mode >> btrfs: block rsv returned -28 >> ------------[ cut here ]------------ >> WARNING: at fs/btrfs/extent-tree.c:6220 btrfs_alloc_free_block+0x353/0x370 [btrfs]() >> Hardware name: QiTianM7150 >> Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]>> Pid: 4605, comm: btrfs-transacti Tainted: G O 3.4.0-rc4+ #11 >> Call Trace: >> [<ffffffff8104d6df>] warn_slowpath_common+0x7f/0xc0 >> [<ffffffff8104d73a>] warn_slowpath_null+0x1a/0x20 >> [<ffffffffa0726b43>] btrfs_alloc_free_block+0x353/0x370 [btrfs] >> [<ffffffffa0755c71>] ? read_extent_buffer+0xd1/0x130 [btrfs] >> [<ffffffffa072e04a>] ? btree_read_extent_buffer_pages.clone.2+0xca/0x140 [btrfs] >> [<ffffffffa0710a62>] __btrfs_cow_block+0x142/0x570 [btrfs] >> [<ffffffffa07130dd>] ? read_block_for_search+0x14d/0x3e0 [btrfs] >> [<ffffffffa0711492>] btrfs_cow_block+0x102/0x210 [btrfs] >> [<ffffffffa0716b5c>] btrfs_search_slot+0x42c/0x960 [btrfs] >> [<ffffffffa07853b9>] btrfs_delete_delayed_items+0x99/0x340 [btrfs] >> [<ffffffff81152662>] ? kmem_cache_alloc+0x152/0x190 >> [<ffffffffa0786212>] btrfs_run_delayed_items+0x112/0x160 [btrfs] >> [<ffffffffa07346df>] btrfs_commit_transaction+0x36f/0xa80 [btrfs] >> [<ffffffffa07351c2>] ? start_transaction+0x92/0x320 [btrfs] >> [<ffffffff8106fb60>] ? wake_up_bit+0x40/0x40 >> [<ffffffffa072fb5b>] transaction_kthread+0x26b/0x2e0 [btrfs] >> [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] >> [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] >> [<ffffffff8106f4be>] kthread+0x9e/0xb0 >> [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 >> [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 >> [<ffffffff814fc860>] ? gs_change+0x13/0x13 >> ---[ end trace 504e7bc5e13ed457 ]--- >> btrfs: block rsv returned -28 >> ------------[ cut here ]------------ >> WARNING: at fs/btrfs/extent-tree.c:6220 btrfs_alloc_free_block+0x353/0x370 [btrfs]() >> Hardware name: QiTianM7150 >> Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]>> Pid: 4605, comm: btrfs-transacti Tainted: G W O 3.4.0-rc4+ #11 >> Call Trace: >> [<ffffffff8104d6df>] warn_slowpath_common+0x7f/0xc0 >> [<ffffffff8104d73a>] warn_slowpath_null+0x1a/0x20 >> [<ffffffffa0726b43>] btrfs_alloc_free_block+0x353/0x370 [btrfs] >> [<ffffffffa0755c71>] ? read_extent_buffer+0xd1/0x130 [btrfs] >> [<ffffffffa072e04a>] ? btree_read_extent_buffer_pages.clone.2+0xca/0x140 [btrfs] >> [<ffffffffa0710a62>] __btrfs_cow_block+0x142/0x570 [btrfs] >> [<ffffffffa07130dd>] ? read_block_for_search+0x14d/0x3e0 [btrfs] >> [<ffffffffa0711492>] btrfs_cow_block+0x102/0x210 [btrfs] >> [<ffffffffa0716b5c>] btrfs_search_slot+0x42c/0x960 [btrfs] >> [<ffffffffa07853b9>] btrfs_delete_delayed_items+0x99/0x340 [btrfs] >> [<ffffffff81152662>] ? kmem_cache_alloc+0x152/0x190 >> [<ffffffffa0786212>] btrfs_run_delayed_items+0x112/0x160 [btrfs] >> [<ffffffffa07346df>] btrfs_commit_transaction+0x36f/0xa80 [btrfs] >> [<ffffffffa07351c2>] ? start_transaction+0x92/0x320 [btrfs] >> [<ffffffff8106fb60>] ? wake_up_bit+0x40/0x40 >> [<ffffffffa072fb5b>] transaction_kthread+0x26b/0x2e0 [btrfs] >> [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] >> [<ffffffffa072f8f0>] ? btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs] >> [<ffffffff8106f4be>] kthread+0x9e/0xb0 >> [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 >> [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 >> [<ffffffff814fc860>] ? gs_change+0x13/0x13 >> ---[ end trace 504e7bc5e13ed458 ]--- >> ------------[ cut here ]------------ >> kernel BUG at fs/btrfs/transaction.c:1550! >> invalid opcode: 0000 [#1] SMP >> CPU 1 >> Modules linked in: btrfs(O) zlib_deflate libcrc32c ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables ipt_REJECT bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm ppdev sg parport_pc parport coretemp hwmon i2c_i801 pcspkr iTCO_wdt iTCO_vendor_support sky2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: btrfs]>> >> Pid: 4604, comm: btrfs-cleaner Tainted: G W O 3.4.0-rc4+ #11 LENOVO QiTianM7150/To be filled by O.E.M. >> RIP: 0010:[<ffffffffa0732de0>] [<ffffffffa0732de0>] btrfs_clean_old_snapshots+0x1d0/0x1e0 [btrfs] >> RSP: 0018:ffff880057113df0 EFLAGS: 00010286 >> RAX: 00000000ffff8800 RBX: ffff880057113e10 RCX: ffff880052166780 >> RDX: ffff880057113e10 RSI: ffff880057113e10 RDI: ffff880052166780 >> RBP: ffff880057113e60 R08: ffff880057113e10 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000001 R12: ffff880000000000 >> R13: 0000160000000000 R14: ffff880079ecb800 R15: ffff880057113e20 >> FS: 0000000000000000(0000) GS:ffff88007da80000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> CR2: 0000003842cd3ac0 CR3: 00000000542eb000 CR4: 00000000000407e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> Process btrfs-cleaner (pid: 4604, threadinfo ffff880057112000, task ffff880075d60b30) >> Stack: >> ffff880052166780 ffff880076918000 ffff880057113fd8 ffff880052166400 >> ffff880052166780 ffff880052166780 ffff880052166f80 ffff880079f16f80 >> ffff880057113e80 ffff880079ecb800 ffff880057113e80 ffff880057113e98 >> Call Trace: >> [<ffffffffa072d360>] cleaner_kthread+0x160/0x1c0 [btrfs] >> [<ffffffffa072d200>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] >> [<ffffffffa072d200>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] >> [<ffffffff8106f4be>] kthread+0x9e/0xb0 >> [<ffffffff814fc864>] kernel_thread_helper+0x4/0x10 >> [<ffffffff8106f420>] ? kthread_freezable_should_stop+0x70/0x70 >> [<ffffffff814fc860>] ? gs_change+0x13/0x13 >> Code: 22 b2 e0 31 c9 31 f6 ba 01 00 00 00 4c 89 e7 e8 47 24 ff ff 48 39 5d b0 75 d9 48 83 c4 48 31 c0 5b 41 5c 41 5d 41 5e 41 5f c9 c3 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 >> RIP [<ffffffffa0732de0>] btrfs_clean_old_snapshots+0x1d0/0x1e0 [btrfs] >> RSP <ffff880057113df0> >> ---[ end trace 504e7bc5e13ed459 ]--- >> >> >> kernel BUG at fs/btrfs/transaction.c:1550! >> refers to >> int btrfs_clean_old_snapshots(struct btrfs_root *root) >> { >> [...] >> while (!list_empty(&list)) { >> [...] >> BUG_ON(ret < 0); ---> the bug >> } >> >> thanks, >> liubo > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2012 at 05:54:38PM +0200, Arne Jansen wrote:> @@ -97,30 +119,87 @@ struct reada_machine_work { > +/* > + * this is the default callback for readahead. It just descends into the > + * tree within the range given at creation. if an error occurs, just cut > + * this part of the tree > + */ > +static void readahead_descend(struct btrfs_root *root, struct reada_control *rc, > + u64 wanted_generation, struct extent_buffer *eb, > + u64 start, int err, struct btrfs_key *top, > + void *ctx) > +{ > + int nritems; > + u64 generation; > + int level; > + int i; > + > + BUG_ON(err == -EAGAIN); /* FIXME: not yet implemented, don''t cancel > + * readahead with default callback */ > + > + if (err || eb == NULL) { > + /* > + * this is the error case, the extent buffer has not been > + * read correctly. We won''t access anything from it and > + * just cleanup our data structures. Effectively this will > + * cut the branch below this node from read ahead. > + */ > + return; > + } > + > + level = btrfs_header_level(eb); > + if (level == 0) { > + /* > + * if this is a leaf, ignore the content. > + */ > + return; > + } > + > + nritems = btrfs_header_nritems(eb); > + generation = btrfs_header_generation(eb); > + > + /* > + * if the generation doesn''t match, just ignore this node. > + * This will cut off a branch from prefetch. Alternatively one could > + * start a new (sub-) prefetch for this branch, starting again from > + * root. > + */ > + if (wanted_generation != generation) > + return;I think I saw passing wanted_generation = 0 somewheree, but cannot find it now again. Is it an expected value for the default RA callback, meaning eg. ''any generation I find'' ?> + > + for (i = 0; i < nritems; i++) { > + u64 n_gen; > + struct btrfs_key key; > + struct btrfs_key next_key; > + u64 bytenr; > + > + btrfs_node_key_to_cpu(eb, &key, i); > + if (i + 1 < nritems) > + btrfs_node_key_to_cpu(eb, &next_key, i + 1); > + else > + next_key = *top; > + bytenr = btrfs_node_blockptr(eb, i); > + n_gen = btrfs_node_ptr_generation(eb, i); > + > + if (btrfs_comp_cpu_keys(&key, &rc->key_end) < 0 && > + btrfs_comp_cpu_keys(&next_key, &rc->key_start) > 0) > + reada_add_block(rc, bytenr, &next_key, > + level - 1, n_gen, ctx); > + } > +} > > @@ -142,65 +221,21 @@ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, > re->scheduled_for = NULL; > spin_unlock(&re->lock); > > - if (err == 0) { > - nritems = level ? btrfs_header_nritems(eb) : 0; > - generation = btrfs_header_generation(eb); > - /* > - * FIXME: currently we just set nritems to 0 if this is a leaf, > - * effectively ignoring the content. In a next step we could > - * trigger more readahead depending from the content, e.g. > - * fetch the checksums for the extents in the leaf. > - */ > - } else { > + /* > + * call hooks for all registered readaheads > + */ > + list_for_each_entry(rec, &list, list) { > + btrfs_tree_read_lock(eb); > /* > - * this is the error case, the extent buffer has not been > - * read correctly. We won''t access anything from it and > - * just cleanup our data structures. Effectively this will > - * cut the branch below this node from read ahead. > + * we set the lock to blocking, as the callback might want to > + * sleep on allocations.What about a more finer control given to the callbacks? The blocking lock may be unnecessary if the callback does not sleep. My idea is to add a field to ''struct reada_uptodate_ctx'', preset with BTRFS_READ_LOCK by default, but let the RA user to set it to its needs.> */ > - nritems = 0; > - generation = 0; > + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); > + rec->rc->callback(root, rec->rc, rec->generation, eb, start, > + err, &re->top, rec->ctx); > + btrfs_tree_read_unlock_blocking(eb); > } > > @@ -521,12 +593,87 @@ static void reada_control_release(struct kref *kref) > +/* > + * context to pass from reada_add_block to worker in case the extent is > + * already uptodate in memory > + */ > +struct reada_uptodate_ctx { > + struct btrfs_key top; > + struct extent_buffer *eb; > + struct reada_control *rc; > + u64 logical; > + u64 generation; > + void *ctx; > + struct btrfs_work work;eg. int lock_type; int want_lock_blocking;> +}; > + > +/* worker for immediate processing of uptodate blocks */ > +static void reada_add_block_uptodate(struct btrfs_work *work) > +{ > + struct reada_uptodate_ctx *ruc; > + > + ruc = container_of(work, struct reada_uptodate_ctx, work); > + > + btrfs_tree_read_lock(ruc->eb); > + /* > + * we set the lock to blocking, as the callback might want to sleep > + * on allocations. > + */ > + btrfs_set_lock_blocking_rw(ruc->eb, BTRFS_READ_LOCK);same here> + ruc->rc->callback(ruc->rc->root, ruc->rc, ruc->generation, ruc->eb, > + ruc->logical, 0, &ruc->top, ruc->ctx); > + btrfs_tree_read_unlock_blocking(ruc->eb); > + > + reada_control_elem_put(ruc->rc); > + free_extent_buffer(ruc->eb); > + kfree(ruc); > +} > + > @@ -886,17 +1074,18 @@ struct reada_control *btrfs_reada_add(struct btrfs_root *root, > .offset = (u64)-1 > }; > > - rc = kzalloc(sizeof(*rc), GFP_NOFS); > + rc = btrfs_reada_alloc(parent, root, key_start, key_end, callback); > if (!rc) > - return ERR_PTR(-ENOMEM); > + return -ENOMEM; > > - rc->root = root; > - rc->key_start = *key_start; > - rc->key_end = *key_end; > - atomic_set(&rc->elems, 0); > - init_waitqueue_head(&rc->wait); > - kref_init(&rc->refcnt); > - kref_get(&rc->refcnt); /* one ref for having elements */ > + if (rcp) { > + *rcp = rc; > + /* > + * as we return the rc, get an addition ref on it for(additional)> + * the caller > + */ > + kref_get(&rc->refcnt); > + } > > node = btrfs_root_node(root); > start = node->start; > @@ -904,35 +1093,36 @@ struct reada_control *btrfs_reada_add(struct btrfs_root *root, > +int btrfs_reada_wait(struct reada_control *rc) > { > - struct reada_control *rc = handle; > + struct btrfs_fs_info *fs_info = rc->root->fs_info; > + int i; > > while (atomic_read(&rc->elems)) { > wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0, > - 5 * HZ); > - dump_devs(rc->root->fs_info, rc->elems < 10 ? 1 : 0); > + 1 * HZ);I think it''s recommended to use msecs_to_jiffies instead of HZ.> + dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0); > + printk(KERN_DEBUG "reada_wait on %p: %d elems\n", rc, > + atomic_read(&rc->elems)); > } > > - dump_devs(rc->root->fs_info, rc->elems < 10 ? 1 : 0); > + dump_devs(fs_info, atomic_read(&rc->elems) < 10 ? 1 : 0); > > kref_put(&rc->refcnt, reada_control_release); > > return 0; > }-- The reference counting changed in a non-trivial way, I''d like to have another look just at that to be sure, but from current review round it looks ok. The RA changes look independet, do you intend to submit it earlier that with the whole droptree? It''d get wider testing. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/09/12 16:48, David Sterba wrote:> On Thu, Apr 12, 2012 at 05:54:38PM +0200, Arne Jansen wrote: >> @@ -97,30 +119,87 @@ struct reada_machine_work { >> +/* >> + * this is the default callback for readahead. It just descends into the >> + * tree within the range given at creation. if an error occurs, just cut >> + * this part of the tree >> + */ >> +static void readahead_descend(struct btrfs_root *root, struct reada_control *rc, >> + u64 wanted_generation, struct extent_buffer *eb, >> + u64 start, int err, struct btrfs_key *top, >> + void *ctx) >> +{ >> + int nritems; >> + u64 generation; >> + int level; >> + int i; >> + >> + BUG_ON(err == -EAGAIN); /* FIXME: not yet implemented, don''t cancel >> + * readahead with default callback */ >> + >> + if (err || eb == NULL) { >> + /* >> + * this is the error case, the extent buffer has not been >> + * read correctly. We won''t access anything from it and >> + * just cleanup our data structures. Effectively this will >> + * cut the branch below this node from read ahead. >> + */ >> + return; >> + } >> + >> + level = btrfs_header_level(eb); >> + if (level == 0) { >> + /* >> + * if this is a leaf, ignore the content. >> + */ >> + return; >> + } >> + >> + nritems = btrfs_header_nritems(eb); >> + generation = btrfs_header_generation(eb); >> + >> + /* >> + * if the generation doesn''t match, just ignore this node. >> + * This will cut off a branch from prefetch. Alternatively one could >> + * start a new (sub-) prefetch for this branch, starting again from >> + * root. >> + */ >> + if (wanted_generation != generation) >> + return; > > I think I saw passing wanted_generation = 0 somewheree, but cannot find > it now again. Is it an expected value for the default RA callback, > meaning eg. ''any generation I find'' ?No. This here is just the default callback. You''ve seen wanted_generation = 0 in the droptree code, where a custom callback is set that doesn''t check the generation.> >> + >> + for (i = 0; i< nritems; i++) { >> + u64 n_gen; >> + struct btrfs_key key; >> + struct btrfs_key next_key; >> + u64 bytenr; >> + >> + btrfs_node_key_to_cpu(eb,&key, i); >> + if (i + 1< nritems) >> + btrfs_node_key_to_cpu(eb,&next_key, i + 1); >> + else >> + next_key = *top; >> + bytenr = btrfs_node_blockptr(eb, i); >> + n_gen = btrfs_node_ptr_generation(eb, i); >> + >> + if (btrfs_comp_cpu_keys(&key,&rc->key_end)< 0&& >> + btrfs_comp_cpu_keys(&next_key,&rc->key_start)> 0) >> + reada_add_block(rc, bytenr,&next_key, >> + level - 1, n_gen, ctx); >> + } >> +} >> >> @@ -142,65 +221,21 @@ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, >> re->scheduled_for = NULL; >> spin_unlock(&re->lock); >> >> - if (err == 0) { >> - nritems = level ? btrfs_header_nritems(eb) : 0; >> - generation = btrfs_header_generation(eb); >> - /* >> - * FIXME: currently we just set nritems to 0 if this is a leaf, >> - * effectively ignoring the content. In a next step we could >> - * trigger more readahead depending from the content, e.g. >> - * fetch the checksums for the extents in the leaf. >> - */ >> - } else { >> + /* >> + * call hooks for all registered readaheads >> + */ >> + list_for_each_entry(rec,&list, list) { >> + btrfs_tree_read_lock(eb); >> /* >> - * this is the error case, the extent buffer has not been >> - * read correctly. We won''t access anything from it and >> - * just cleanup our data structures. Effectively this will >> - * cut the branch below this node from read ahead. >> + * we set the lock to blocking, as the callback might want to >> + * sleep on allocations. > > What about a more finer control given to the callbacks? The blocking > lock may be unnecessary if the callback does not sleep.I thought about that, but it would add a bit more complexity. So I decided for the simpler version in the first run. There is definitely room for optimization here.> > My idea is to add a field to ''struct reada_uptodate_ctx'', preset with > BTRFS_READ_LOCK by default, but let the RA user to set it to its needs.The struct is only used in the special case the extent is already uptodate, to pass the parameters to the worker in this case. The user has no influence on that. It could either be stored per request in struct reada_extctl or per reada in struct reada_control. But this would also not be optimal. There better way would be to always pass a spinning lock and let the user change it to blocking if needed. The problem is that the user would need to pass the info back, as there''s no single function that can unlock a spinning and a blocking read lock. I only had a short look at it and haven''t found an easy way to build it. Or, just leave the locking completely to the user (the callback).> >> */ >> - nritems = 0; >> - generation = 0; >> + btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); >> + rec->rc->callback(root, rec->rc, rec->generation, eb, start, >> + err,&re->top, rec->ctx); >> + btrfs_tree_read_unlock_blocking(eb); >> } >> >> @@ -521,12 +593,87 @@ static void reada_control_release(struct kref *kref) >> +/* >> + * context to pass from reada_add_block to worker in case the extent is >> + * already uptodate in memory >> + */ >> +struct reada_uptodate_ctx { >> + struct btrfs_key top; >> + struct extent_buffer *eb; >> + struct reada_control *rc; >> + u64 logical; >> + u64 generation; >> + void *ctx; >> + struct btrfs_work work; > > eg. > int lock_type; > int want_lock_blocking; > > >> +}; >> + >> +/* worker for immediate processing of uptodate blocks */ >> +static void reada_add_block_uptodate(struct btrfs_work *work) >> +{ >> + struct reada_uptodate_ctx *ruc; >> + >> + ruc = container_of(work, struct reada_uptodate_ctx, work); >> + >> + btrfs_tree_read_lock(ruc->eb); >> + /* >> + * we set the lock to blocking, as the callback might want to sleep >> + * on allocations. >> + */ >> + btrfs_set_lock_blocking_rw(ruc->eb, BTRFS_READ_LOCK); > > same here > >> + ruc->rc->callback(ruc->rc->root, ruc->rc, ruc->generation, ruc->eb, >> + ruc->logical, 0,&ruc->top, ruc->ctx); >> + btrfs_tree_read_unlock_blocking(ruc->eb); >> + >> + reada_control_elem_put(ruc->rc); >> + free_extent_buffer(ruc->eb); >> + kfree(ruc); >> +} >> + >> @@ -886,17 +1074,18 @@ struct reada_control *btrfs_reada_add(struct btrfs_root *root, >> .offset = (u64)-1 >> }; >> >> - rc = kzalloc(sizeof(*rc), GFP_NOFS); >> + rc = btrfs_reada_alloc(parent, root, key_start, key_end, callback); >> if (!rc) >> - return ERR_PTR(-ENOMEM); >> + return -ENOMEM; >> >> - rc->root = root; >> - rc->key_start = *key_start; >> - rc->key_end = *key_end; >> - atomic_set(&rc->elems, 0); >> - init_waitqueue_head(&rc->wait); >> - kref_init(&rc->refcnt); >> - kref_get(&rc->refcnt); /* one ref for having elements */ >> + if (rcp) { >> + *rcp = rc; >> + /* >> + * as we return the rc, get an addition ref on it for > (additional) > >> + * the caller >> + */ >> + kref_get(&rc->refcnt); >> + } >> >> node = btrfs_root_node(root); >> start = node->start; >> @@ -904,35 +1093,36 @@ struct reada_control *btrfs_reada_add(struct btrfs_root *root, >> +int btrfs_reada_wait(struct reada_control *rc) >> { >> - struct reada_control *rc = handle; >> + struct btrfs_fs_info *fs_info = rc->root->fs_info; >> + int i; >> >> while (atomic_read(&rc->elems)) { >> wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0, >> - 5 * HZ); >> - dump_devs(rc->root->fs_info, rc->elems< 10 ? 1 : 0); >> + 1 * HZ); > > I think it''s recommended to use msecs_to_jiffies instead of HZ. > >> + dump_devs(fs_info, atomic_read(&rc->elems)< 10 ? 1 : 0); >> + printk(KERN_DEBUG "reada_wait on %p: %d elems\n", rc, >> + atomic_read(&rc->elems)); >> } >> >> - dump_devs(rc->root->fs_info, rc->elems< 10 ? 1 : 0); >> + dump_devs(fs_info, atomic_read(&rc->elems)< 10 ? 1 : 0); >> >> kref_put(&rc->refcnt, reada_control_release); >> >> return 0; >> } > > -- > > The reference counting changed in a non-trivial way, I''d like to have > another look just at that to be sure, but from current review round it > looks ok. > > The RA changes look independet, do you intend to submit it earlier that > with the whole droptree? It''d get wider testing.Currently the only user of reada is scrub. It would gain some precision by this patch. The testing is a good argument, too. -Arne> > > david-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html