This patchset is against one of project ideas, RBtree lock contention: "Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads." The goal is to use RCU, but we take it as a long term one, and instead we use rwlock until we find a mature rcu structure for lockless read. So what we need to do is to make the code RCU friendly, and the idea mainly comes from Chris Mason: Quoted: "I think the extent_state code can be much more RCU friendly if we separate the operations on the tree from operations on the individual state. In general, we can gain a lot of performance if we are able to reduce the write locks taken at endio time. Especially for reads, these are critical." The patchset is also available in: git://github.com/liubogithub/btrfs-work.git rwlock-for-extent-state I''ve run through xfstests, and no bugs jump out by then. I made a simple test to show the difference on my box: $ cat 6_FIO/fio-4thread-4M-sync-read [global] group_reporting thread numjobs=4 bs=4M rw=read sync=0 ioengine=sync directory=/mnt/btrfs/ [READ] filename=foobar size=4000M *results:* w/o patch w patch READ bandwidth(aggrb) 849MB/s 971MB/s MORE TESTS ARE WELCOME! v1->v2: drop changes on invalidatepage() and rebase to the latest btrfs upstream. Liu Bo (4): Btrfs: use radix tree for checksum Btrfs: merge adjacent states as much as possible Btrfs: use large extent range for read and its endio Btrfs: apply rwlock for extent state fs/btrfs/extent_io.c | 712 +++++++++++++++++++++++++++++++++++++++----------- fs/btrfs/extent_io.h | 5 +- fs/btrfs/inode.c | 7 +- 3 files changed, 568 insertions(+), 156 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
We used to issue a checksum to an extent state of 4K range for read endio, but now we want to use larger range for performance optimization, so instead we create a radix tree for checksum, where an item stands for checksum of 4K data. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/extent_io.c | 84 ++++++++++++-------------------------------------- fs/btrfs/extent_io.h | 2 + fs/btrfs/inode.c | 7 +--- 3 files changed, 23 insertions(+), 70 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2c8f7b2..2923ede 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -117,10 +117,12 @@ void extent_io_tree_init(struct extent_io_tree *tree, { tree->state = RB_ROOT; INIT_RADIX_TREE(&tree->buffer, GFP_ATOMIC); + INIT_RADIX_TREE(&tree->csum, GFP_ATOMIC); tree->ops = NULL; tree->dirty_bytes = 0; spin_lock_init(&tree->lock); spin_lock_init(&tree->buffer_lock); + spin_lock_init(&tree->csum_lock); tree->mapping = mapping; } @@ -703,15 +705,6 @@ static void cache_state(struct extent_state *state, } } -static void uncache_state(struct extent_state **cached_ptr) -{ - if (cached_ptr && (*cached_ptr)) { - struct extent_state *state = *cached_ptr; - *cached_ptr = NULL; - free_extent_state(state); - } -} - /* * set some bits on a range in the tree. This may require allocations or * sleeping, so the gfp mask is used to indicate what is allowed. @@ -1666,56 +1659,32 @@ out: */ int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) { - struct rb_node *node; - struct extent_state *state; int ret = 0; - spin_lock(&tree->lock); - /* - * this search will find all the extents that end after - * our range starts. - */ - node = tree_search(tree, start); - if (!node) { - ret = -ENOENT; - goto out; - } - state = rb_entry(node, struct extent_state, rb_node); - if (state->start != start) { - ret = -ENOENT; - goto out; - } - state->private = private; -out: - spin_unlock(&tree->lock); + spin_lock(&tree->csum_lock); + ret = radix_tree_insert(&tree->csum, (unsigned long)start, + (void *)((unsigned long)private << 1)); + BUG_ON(ret); + spin_unlock(&tree->csum_lock); return ret; } int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private) { - struct rb_node *node; - struct extent_state *state; - int ret = 0; + void **slot = NULL; - spin_lock(&tree->lock); - /* - * this search will find all the extents that end after - * our range starts. - */ - node = tree_search(tree, start); - if (!node) { - ret = -ENOENT; - goto out; - } - state = rb_entry(node, struct extent_state, rb_node); - if (state->start != start) { - ret = -ENOENT; - goto out; + spin_lock(&tree->csum_lock); + slot = radix_tree_lookup_slot(&tree->csum, (unsigned long)start); + if (!slot) { + spin_unlock(&tree->csum_lock); + return -ENOENT; } - *private = state->private; -out: - spin_unlock(&tree->lock); - return ret; + *private = (u64)(*slot) >> 1; + + radix_tree_delete(&tree->csum, (unsigned long)start); + spin_unlock(&tree->csum_lock); + + return 0; } /* @@ -2294,7 +2263,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err) do { struct page *page = bvec->bv_page; struct extent_state *cached = NULL; - struct extent_state *state; pr_debug("end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, " "mirror=%ld\n", bio->bi_vcnt, bio->bi_idx, err, @@ -2313,21 +2281,10 @@ static void end_bio_extent_readpage(struct bio *bio, int err) if (++bvec <= bvec_end) prefetchw(&bvec->bv_page->flags); - spin_lock(&tree->lock); - state = find_first_extent_bit_state(tree, start, EXTENT_LOCKED); - if (state && state->start == start) { - /* - * take a reference on the state, unlock will drop - * the ref - */ - cache_state(state, &cached); - } - spin_unlock(&tree->lock); - mirror = (int)(unsigned long)bio->bi_bdev; if (uptodate && tree->ops && tree->ops->readpage_end_io_hook) { ret = tree->ops->readpage_end_io_hook(page, start, end, - state, mirror); + NULL, mirror); if (ret) { /* no IO indicated but software detected errors * in the block, either checksum errors or @@ -2369,7 +2326,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err) test_bit(BIO_UPTODATE, &bio->bi_flags); if (err) uptodate = 0; - uncache_state(&cached); continue; } } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 25900af..c896962 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -96,11 +96,13 @@ struct extent_io_ops { struct extent_io_tree { struct rb_root state; struct radix_tree_root buffer; + struct radix_tree_root csum; struct address_space *mapping; u64 dirty_bytes; int track_uptodate; spinlock_t lock; spinlock_t buffer_lock; + spinlock_t csum_lock; struct extent_io_ops *ops; }; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f6ab6f5..da0da44 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2008,12 +2008,7 @@ static int btrfs_readpage_end_io_hook(struct page *page, u64 start, u64 end, return 0; } - if (state && state->start == start) { - private = state->private; - ret = 0; - } else { - ret = get_state_private(io_tree, start, &private); - } + ret = get_state_private(io_tree, start, &private); kaddr = kmap_atomic(page); if (ret) goto zeroit; -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
In order to reduce write locks, we do merge_state as much as much as possible. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/extent_io.c | 47 +++++++++++++++++++++++++++-------------------- 1 files changed, 27 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2923ede..081fe13 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -276,29 +276,36 @@ static void merge_state(struct extent_io_tree *tree, if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) return; - other_node = rb_prev(&state->rb_node); - if (other_node) { + while (1) { + other_node = rb_prev(&state->rb_node); + if (!other_node) + break; other = rb_entry(other_node, struct extent_state, rb_node); - if (other->end == state->start - 1 && - other->state == state->state) { - merge_cb(tree, state, other); - state->start = other->start; - other->tree = NULL; - rb_erase(&other->rb_node, &tree->state); - free_extent_state(other); - } + if (other->end != state->start - 1 || + other->state != state->state) + break; + + merge_cb(tree, state, other); + state->start = other->start; + other->tree = NULL; + rb_erase(&other->rb_node, &tree->state); + free_extent_state(other); } - other_node = rb_next(&state->rb_node); - if (other_node) { + + while (1) { + other_node = rb_next(&state->rb_node); + if (!other_node) + break; other = rb_entry(other_node, struct extent_state, rb_node); - if (other->start == state->end + 1 && - other->state == state->state) { - merge_cb(tree, state, other); - state->end = other->end; - other->tree = NULL; - rb_erase(&other->rb_node, &tree->state); - free_extent_state(other); - } + if (other->start != state->end + 1 || + other->state != state->state) + break; + + merge_cb(tree, state, other); + state->end = other->end; + other->tree = NULL; + rb_erase(&other->rb_node, &tree->state); + free_extent_state(other); } } -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2012-Jun-13 10:19 UTC
[PATCH 3/4] Btrfs: use large extent range for read and its endio
we use larger extent state range for both readpages and read endio, so that we can lock or unlock less and avoid most of split ops, then we''ll reduce write locks taken at endio time. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/extent_io.c | 201 +++++++++++++++++++++++++++++++++++++++++++++----- 1 files changed, 182 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 081fe13..bb66e3c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2258,18 +2258,26 @@ static void end_bio_extent_readpage(struct bio *bio, int err) struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; struct bio_vec *bvec = bio->bi_io_vec; struct extent_io_tree *tree; + struct extent_state *cached = NULL; u64 start; u64 end; int whole_page; int mirror; int ret; + u64 up_start, up_end, un_start, un_end; + int up_first, un_first; + int for_uptodate[bio->bi_vcnt]; + int i = 0; + + up_start = un_start = (u64)-1; + up_end = un_end = 0; + up_first = un_first = 1; if (err) uptodate = 0; do { struct page *page = bvec->bv_page; - struct extent_state *cached = NULL; pr_debug("end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, " "mirror=%ld\n", bio->bi_vcnt, bio->bi_idx, err, @@ -2280,11 +2288,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err) bvec->bv_offset; end = start + bvec->bv_len - 1; - if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE) - whole_page = 1; - else - whole_page = 0; - if (++bvec <= bvec_end) prefetchw(&bvec->bv_page->flags); @@ -2337,14 +2340,71 @@ static void end_bio_extent_readpage(struct bio *bio, int err) } } + if (uptodate) + for_uptodate[i++] = 1; + else + for_uptodate[i++] = 0; + if (uptodate && tree->track_uptodate) { - set_extent_uptodate(tree, start, end, &cached, - GFP_ATOMIC); + if (up_first) { + up_start = start; + up_end = end; + up_first = 0; + } else { + if (up_start == end + 1) { + up_start = start; + } else if (up_end == start - 1) { + up_end = end; + } else { + set_extent_uptodate( + tree, up_start, up_end, + &cached, GFP_ATOMIC); + up_start = start; + up_end = end; + } + } } - unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC); + + if (un_first) { + un_start = start; + un_end = end; + un_first = 0; + } else { + if (un_start == end + 1) { + un_start = start; + } else if (un_end == start - 1) { + un_end = end; + } else { + unlock_extent_cached(tree, un_start, un_end, + &cached, GFP_ATOMIC); + un_start = start; + un_end = end; + } + } + } while (bvec <= bvec_end); + + cached = NULL; + if (up_start < up_end) + set_extent_uptodate(tree, up_start, up_end, &cached, + GFP_ATOMIC); + if (un_start < un_end) + unlock_extent_cached(tree, un_start, un_end, &cached, + GFP_ATOMIC); + + i = 0; + bvec = bio->bi_io_vec; + do { + struct page *page = bvec->bv_page; + + tree = &BTRFS_I(page->mapping->host)->io_tree; + + if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE) + whole_page = 1; + else + whole_page = 0; if (whole_page) { - if (uptodate) { + if (for_uptodate[i++]) { SetPageUptodate(page); } else { ClearPageUptodate(page); @@ -2352,7 +2412,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) } unlock_page(page); } else { - if (uptodate) { + if (for_uptodate[i++]) { check_page_uptodate(tree, page); } else { ClearPageUptodate(page); @@ -2360,6 +2420,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) } check_page_locked(tree, page); } + ++bvec; } while (bvec <= bvec_end); bio_put(bio); @@ -2520,7 +2581,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree, struct page *page, get_extent_t *get_extent, struct bio **bio, int mirror_num, - unsigned long *bio_flags) + unsigned long *bio_flags, int range_lock) { struct inode *inode = page->mapping->host; u64 start = (u64)page->index << PAGE_CACHE_SHIFT; @@ -2554,6 +2615,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree, end = page_end; while (1) { + if (range_lock) + break; lock_extent(tree, start, end); ordered = btrfs_lookup_ordered_extent(inode, start); if (!ordered) @@ -2703,7 +2766,7 @@ int extent_read_full_page(struct extent_io_tree *tree, struct page *page, int ret; ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num, - &bio_flags); + &bio_flags, 0); if (bio) ret = submit_one_bio(READ, bio, mirror_num, bio_flags); return ret; @@ -3497,6 +3560,59 @@ int extent_writepages(struct extent_io_tree *tree, return ret; } +struct page_list { + struct page *page; + struct list_head list; +}; + +static int process_batch_pages(struct extent_io_tree *tree, + struct address_space *mapping, + struct list_head *lock_pages, int *page_cnt, + u64 lock_start, u64 lock_end, + get_extent_t get_extent, struct bio **bio, + unsigned long *bio_flags) +{ + u64 page_start; + struct page_list *plist; + + while (1) { + struct btrfs_ordered_extent *ordered = NULL; + + lock_extent(tree, lock_start, lock_end); + page_start = lock_start; + while (page_start < lock_end) { + ordered = btrfs_lookup_ordered_extent(mapping->host, + page_start); + if (ordered) { + page_start = ordered->file_offset; + break; + } + page_start += PAGE_CACHE_SIZE; + } + if (!ordered) + break; + unlock_extent(tree, lock_start, lock_end); + btrfs_start_ordered_extent(mapping->host, ordered, 1); + btrfs_put_ordered_extent(ordered); + } + + plist = NULL; + while (!list_empty(lock_pages)) { + plist = list_entry(lock_pages->prev, struct page_list, list); + + __extent_read_full_page(tree, plist->page, get_extent, + bio, 0, bio_flags, 1); + page_cache_release(plist->page); + list_del(&plist->list); + plist->page = NULL; + kfree(plist); + (*page_cnt)--; + } + + WARN_ON((*page_cnt)); + return 0; +} + int extent_readpages(struct extent_io_tree *tree, struct address_space *mapping, struct list_head *pages, unsigned nr_pages, @@ -3505,7 +3621,17 @@ int extent_readpages(struct extent_io_tree *tree, struct bio *bio = NULL; unsigned page_idx; unsigned long bio_flags = 0; - + u64 page_start; + u64 page_end; + u64 lock_start = (u64)-1; + u64 lock_end = 0; + struct page_list *plist; + int page_cnt = 0; + LIST_HEAD(lock_pages); + int first = 1; + + lock_start = (u64)-1; + lock_end = 0; for (page_idx = 0; page_idx < nr_pages; page_idx++) { struct page *page = list_entry(pages->prev, struct page, lru); @@ -3513,12 +3639,49 @@ int extent_readpages(struct extent_io_tree *tree, list_del(&page->lru); if (!add_to_page_cache_lru(page, mapping, page->index, GFP_NOFS)) { - __extent_read_full_page(tree, page, get_extent, - &bio, 0, &bio_flags); + page_start = (u64)page_offset(page); + page_end = page_start + PAGE_CACHE_SIZE - 1; + + if (first) { + lock_start = page_start; + lock_end = page_end; + first = 0; + } else { + /* + * |--lock range--||--page range--| + * or + * |--page range--||--lock range--| + */ + if (lock_start != page_end - 1 && + lock_end != page_start - 1) { + process_batch_pages(tree, mapping, + &lock_pages, &page_cnt, + lock_start, lock_end, + get_extent, &bio, &bio_flags); + + lock_start = page_start; + lock_end = page_end; + } else { + lock_start + min(lock_start, page_start); + lock_end = max(lock_end, page_end); + } + } + plist = kmalloc(sizeof(*plist), GFP_NOFS); + BUG_ON(!plist); + plist->page = page; + list_add(&plist->list, &lock_pages); + page_cache_get(page); + page_cnt++; } page_cache_release(page); } + + if (!list_empty(&lock_pages)) + process_batch_pages(tree, mapping, &lock_pages, &page_cnt, + lock_start, lock_end, get_extent, &bio, &bio_flags); BUG_ON(!list_empty(pages)); + if (bio) return submit_one_bio(READ, bio, 0, bio_flags); return 0; @@ -4493,9 +4656,9 @@ int read_extent_buffer_pages(struct extent_io_tree *tree, page = extent_buffer_page(eb, i); if (!PageUptodate(page)) { ClearPageError(page); - err = __extent_read_full_page(tree, page, - get_extent, &bio, - mirror_num, &bio_flags); + err = __extent_read_full_page( + tree, page, get_extent, &bio, + mirror_num, &bio_flags, 0); if (err) ret = err; } else { -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
We used to protect both extent state tree and an individual state''s state by tree->lock, but this can be an obstacle of lockless read. So we seperate them here: o tree->lock protects the tree o state->lock protects the state. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/extent_io.c | 380 ++++++++++++++++++++++++++++++++++++++++++++------ fs/btrfs/extent_io.h | 3 +- 2 files changed, 336 insertions(+), 47 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index bb66e3c..4c6b743 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -27,7 +27,7 @@ static struct kmem_cache *extent_buffer_cache; static LIST_HEAD(buffers); static LIST_HEAD(states); -#define LEAK_DEBUG 0 +#define LEAK_DEBUG 1 #if LEAK_DEBUG static DEFINE_SPINLOCK(leak_lock); #endif @@ -120,7 +120,7 @@ void extent_io_tree_init(struct extent_io_tree *tree, INIT_RADIX_TREE(&tree->csum, GFP_ATOMIC); tree->ops = NULL; tree->dirty_bytes = 0; - spin_lock_init(&tree->lock); + rwlock_init(&tree->lock); spin_lock_init(&tree->buffer_lock); spin_lock_init(&tree->csum_lock); tree->mapping = mapping; @@ -146,6 +146,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask) #endif atomic_set(&state->refs, 1); init_waitqueue_head(&state->wq); + spin_lock_init(&state->lock); trace_alloc_extent_state(state, mask, _RET_IP_); return state; } @@ -281,6 +282,7 @@ static void merge_state(struct extent_io_tree *tree, if (!other_node) break; other = rb_entry(other_node, struct extent_state, rb_node); + /* FIXME: need other->lock? */ if (other->end != state->start - 1 || other->state != state->state) break; @@ -297,6 +299,7 @@ static void merge_state(struct extent_io_tree *tree, if (!other_node) break; other = rb_entry(other_node, struct extent_state, rb_node); + /* FIXME: need other->lock? */ if (other->start != state->end + 1 || other->state != state->state) break; @@ -364,7 +367,10 @@ static int insert_state(struct extent_io_tree *tree, return -EEXIST; } state->tree = tree; + + spin_lock(&state->lock); merge_state(tree, state); + spin_unlock(&state->lock); return 0; } @@ -410,6 +416,23 @@ static int split_state(struct extent_io_tree *tree, struct extent_state *orig, return 0; } +static struct extent_state * +alloc_extent_state_atomic(struct extent_state *prealloc) +{ + if (!prealloc) + prealloc = alloc_extent_state(GFP_ATOMIC); + + return prealloc; +} + +enum extent_lock_type { + EXTENT_READ = 0, + EXTENT_WRITE = 1, + EXTENT_RLOCKED = 2, + EXTENT_WLOCKED = 3, + EXTENT_LAST = 4, +}; + static struct extent_state *next_state(struct extent_state *state) { struct rb_node *next = rb_next(&state->rb_node); @@ -426,13 +449,17 @@ static struct extent_state *next_state(struct extent_state *state) * If no bits are set on the state struct after clearing things, the * struct is freed and removed from the tree */ -static struct extent_state *clear_state_bit(struct extent_io_tree *tree, - struct extent_state *state, - int *bits, int wake) +static int __clear_state_bit(struct extent_io_tree *tree, + struct extent_state *state, + int *bits, int wake, int check) { - struct extent_state *next; int bits_to_clear = *bits & ~EXTENT_CTLBITS; + if (check) { + if ((state->state & ~bits_to_clear) == 0) + return 1; + } + if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) { u64 range = state->end - state->start + 1; WARN_ON(range > tree->dirty_bytes); @@ -442,7 +469,17 @@ static struct extent_state *clear_state_bit(struct extent_io_tree *tree, state->state &= ~bits_to_clear; if (wake) wake_up(&state->wq); + return 0; +} + +static struct extent_state * +try_free_or_merge_state(struct extent_io_tree *tree, struct extent_state *state) +{ + struct extent_state *next = NULL; + + BUG_ON(!spin_is_locked(&state->lock)); if (state->state == 0) { + spin_unlock(&state->lock); next = next_state(state); if (state->tree) { rb_erase(&state->rb_node, &tree->state); @@ -453,18 +490,17 @@ static struct extent_state *clear_state_bit(struct extent_io_tree *tree, } } else { merge_state(tree, state); + spin_unlock(&state->lock); next = next_state(state); } return next; } -static struct extent_state * -alloc_extent_state_atomic(struct extent_state *prealloc) +static struct extent_state *clear_state_bit(struct extent_io_tree *tree, + struct extent_state *state, int *bits, int wake) { - if (!prealloc) - prealloc = alloc_extent_state(GFP_ATOMIC); - - return prealloc; + __clear_state_bit(tree, state, bits, wake, 0); + return try_free_or_merge_state(tree, state); } void extent_io_tree_panic(struct extent_io_tree *tree, int err) @@ -474,6 +510,97 @@ void extent_io_tree_panic(struct extent_io_tree *tree, int err) "thread while locked."); } +static int test_merge_state(struct extent_io_tree *tree, + struct extent_state *state) +{ + struct extent_state *other; + struct rb_node *other_node; + + if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) + return 0; + + other_node = rb_prev(&state->rb_node); + if (other_node) { + other = rb_entry(other_node, struct extent_state, rb_node); + if (other->end == state->start - 1 && + other->state == state->state) + return 1; + } + other_node = rb_next(&state->rb_node); + if (other_node) { + other = rb_entry(other_node, struct extent_state, rb_node); + if (other->start == state->end + 1 && + other->state == state->state) + return 1; + } + + return 0; +} + +static void process_merge_state(struct extent_io_tree *tree, u64 start) +{ + struct extent_state *state = NULL; + struct rb_node *node = NULL; + + if (!tree || start == (u64)-1) { + WARN_ON(1); + return; + } + + write_lock(&tree->lock); + node = tree_search(tree, start); + if (!node) { + printk(KERN_INFO "write side: not find states" + " to merge %llu\n", start); + goto out; + } + state = rb_entry(node, struct extent_state, rb_node); + /* should merge all states around this one */ + spin_lock(&state->lock); + merge_state(tree, state); + spin_unlock(&state->lock); +out: + write_unlock(&tree->lock); +} + +static void extent_rw_lock(struct extent_io_tree *tree, int *rw) +{ + int lock = *rw; + + if (lock == EXTENT_READ) { + read_lock(&tree->lock); + *rw = EXTENT_RLOCKED; + } else if (lock == EXTENT_WRITE) { + write_lock(&tree->lock); + *rw = EXTENT_WLOCKED; + } else { + WARN_ON(1); + } +} + +static void extent_rw_unlock(struct extent_io_tree *tree, int *rw) +{ + int lock = *rw; + + if (lock == EXTENT_RLOCKED) + read_unlock(&tree->lock); + if (lock == EXTENT_WLOCKED) + write_unlock(&tree->lock); + *rw = EXTENT_READ; +} + +static int extent_rw_flip(struct extent_io_tree *tree, int *rw) +{ + int lock = *rw; + + if (lock == EXTENT_RLOCKED) { + read_unlock(&tree->lock); + *rw = EXTENT_WRITE; + return 1; + } + return 0; +} + /* * clear some bits on a range in the tree. This may require splitting * or inserting elements in the tree, so the gfp mask is used to @@ -496,8 +623,13 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state *prealloc = NULL; struct rb_node *node; u64 last_end; + u64 orig_start = start; int err; int clear = 0; + int rw = EXTENT_READ; + int free = 0; + int merge = 0; + int check = 0; if (delete) bits |= ~EXTENT_CTLBITS; @@ -512,7 +644,8 @@ again: return -ENOMEM; } - spin_lock(&tree->lock); + /* XXX: after this we''re EXTENT_RLOCKED/EXTENT_WLOCKED */ + extent_rw_lock(tree, &rw); if (cached_state) { cached = *cached_state; @@ -545,8 +678,10 @@ hit_next: WARN_ON(state->end < start); last_end = state->end; + spin_lock(&state->lock); /* the state doesn''t have the wanted bits, go ahead */ if (!(state->state & bits)) { + spin_unlock(&state->lock); state = next_state(state); goto next; } @@ -568,6 +703,11 @@ hit_next: */ if (state->start < start) { + /* split needs a write lock */ + if (extent_rw_flip(tree, &rw)) { + spin_unlock(&state->lock); + goto again; + } prealloc = alloc_extent_state_atomic(prealloc); BUG_ON(!prealloc); err = split_state(tree, state, prealloc, start); @@ -575,11 +715,15 @@ hit_next: extent_io_tree_panic(tree, err); prealloc = NULL; - if (err) + if (err) { + spin_unlock(&state->lock); goto out; + } if (state->end <= end) { state = clear_state_bit(tree, state, &bits, wake); goto next; + } else { + spin_unlock(&state->lock); } goto search_again; } @@ -590,22 +734,44 @@ hit_next: * on the first half */ if (state->start <= end && state->end > end) { + /* split needs a write lock */ + if (extent_rw_flip(tree, &rw)) { + spin_unlock(&state->lock); + goto again; + } prealloc = alloc_extent_state_atomic(prealloc); BUG_ON(!prealloc); err = split_state(tree, state, prealloc, end + 1); if (err) extent_io_tree_panic(tree, err); + spin_unlock(&state->lock); if (wake) wake_up(&state->wq); + spin_lock(&prealloc->lock); clear_state_bit(tree, prealloc, &bits, wake); prealloc = NULL; goto out; } - state = clear_state_bit(tree, state, &bits, wake); + check = (rw == EXTENT_RLOCKED) ? 1 : 0; + free = __clear_state_bit(tree, state, &bits, wake, check); + if (free && rw == EXTENT_RLOCKED) { + /* this one will be freed, so it needs a write lock */ + spin_unlock(&state->lock); + extent_rw_flip(tree, &rw); + goto again; + } + if (rw == EXTENT_RLOCKED) { + merge = test_merge_state(tree, state); + spin_unlock(&state->lock); + state = next_state(state); + } else { + /* this one will unlock state->lock for us */ + state = try_free_or_merge_state(tree, state); + } next: if (last_end == (u64)-1) goto out; @@ -615,16 +781,18 @@ next: goto search_again; out: - spin_unlock(&tree->lock); + extent_rw_unlock(tree, &rw); if (prealloc) free_extent_state(prealloc); + if (merge) + process_merge_state(tree, orig_start); return 0; search_again: if (start > end) goto out; - spin_unlock(&tree->lock); + extent_rw_unlock(tree, &rw); if (mask & __GFP_WAIT) cond_resched(); goto again; @@ -637,9 +805,9 @@ static void wait_on_state(struct extent_io_tree *tree, { DEFINE_WAIT(wait); prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE); - spin_unlock(&tree->lock); + read_unlock(&tree->lock); schedule(); - spin_lock(&tree->lock); + read_lock(&tree->lock); finish_wait(&state->wq, &wait); } @@ -653,7 +821,7 @@ void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int bits) struct extent_state *state; struct rb_node *node; - spin_lock(&tree->lock); + read_lock(&tree->lock); again: while (1) { /* @@ -669,22 +837,27 @@ again: if (state->start > end) goto out; + spin_lock(&state->lock); if (state->state & bits) { + spin_unlock(&state->lock); start = state->start; atomic_inc(&state->refs); wait_on_state(tree, state); free_extent_state(state); goto again; } + spin_unlock(&state->lock); start = state->end + 1; if (start > end) break; - cond_resched_lock(&tree->lock); + read_unlock(&tree->lock); + cond_resched(); + read_lock(&tree->lock); } out: - spin_unlock(&tree->lock); + read_unlock(&tree->lock); } static void set_state_bits(struct extent_io_tree *tree, @@ -734,6 +907,9 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int err = 0; u64 last_start; u64 last_end; + u64 orig_start = start; + int rw = EXTENT_READ; + int merge = 0; bits |= EXTENT_FIRST_DELALLOC; again: @@ -742,7 +918,8 @@ again: BUG_ON(!prealloc); } - spin_lock(&tree->lock); + /* XXX: after this we''re EXTENT_RLOCKED/EXTENT_WLOCKED */ + extent_rw_lock(tree, &rw); if (cached_state && *cached_state) { state = *cached_state; if (state->start <= start && state->end > start && @@ -757,6 +934,9 @@ again: */ node = tree_search(tree, start); if (!node) { + /* XXX: insert need a write lock */ + if (extent_rw_flip(tree, &rw)) + goto again; prealloc = alloc_extent_state_atomic(prealloc); BUG_ON(!prealloc); err = insert_state(tree, prealloc, start, end, &bits); @@ -771,6 +951,7 @@ hit_next: last_start = state->start; last_end = state->end; + spin_lock(&state->lock); /* * | ---- desired range ---- | * | state | @@ -779,6 +960,7 @@ hit_next: */ if (state->start == start && state->end <= end) { if (state->state & exclusive_bits) { + spin_unlock(&state->lock); *failed_start = state->start; err = -EEXIST; goto out; @@ -786,7 +968,13 @@ hit_next: set_state_bits(tree, state, &bits); cache_state(state, cached_state); - merge_state(tree, state); + /* XXX */ + if (rw == EXTENT_RLOCKED) + merge = test_merge_state(tree, state); + else + merge_state(tree, state); + spin_unlock(&state->lock); + if (last_end == (u64)-1) goto out; start = last_end + 1; @@ -815,11 +1003,18 @@ hit_next: */ if (state->start < start) { if (state->state & exclusive_bits) { + spin_unlock(&state->lock); *failed_start = start; err = -EEXIST; goto out; } + /* XXX: split needs a write lock */ + if (extent_rw_flip(tree, &rw)) { + spin_unlock(&state->lock); + goto again; + } + prealloc = alloc_extent_state_atomic(prealloc); BUG_ON(!prealloc); err = split_state(tree, state, prealloc, start); @@ -827,12 +1022,15 @@ hit_next: extent_io_tree_panic(tree, err); prealloc = NULL; - if (err) + if (err) { + spin_unlock(&state->lock); goto out; + } if (state->end <= end) { set_state_bits(tree, state, &bits); cache_state(state, cached_state); merge_state(tree, state); + spin_unlock(&state->lock); if (last_end == (u64)-1) goto out; start = last_end + 1; @@ -840,6 +1038,8 @@ hit_next: if (start < end && state && state->start == start && !need_resched()) goto hit_next; + } else { + spin_unlock(&state->lock); } goto search_again; } @@ -852,6 +1052,12 @@ hit_next: */ if (state->start > start) { u64 this_end; + + spin_unlock(&state->lock); + /* XXX: split need a write lock */ + if (extent_rw_flip(tree, &rw)) + goto again; + if (end < last_start) this_end = end; else @@ -869,7 +1075,9 @@ hit_next: if (err) extent_io_tree_panic(tree, err); + spin_lock(&prealloc->lock); cache_state(prealloc, cached_state); + spin_unlock(&prealloc->lock); prealloc = NULL; start = this_end + 1; goto search_again; @@ -882,20 +1090,30 @@ hit_next: */ if (state->start <= end && state->end > end) { if (state->state & exclusive_bits) { + spin_unlock(&state->lock); *failed_start = start; err = -EEXIST; goto out; } + /* XXX: split need a write lock */ + if (extent_rw_flip(tree, &rw)) { + spin_unlock(&state->lock); + goto again; + } + prealloc = alloc_extent_state_atomic(prealloc); BUG_ON(!prealloc); err = split_state(tree, state, prealloc, end + 1); if (err) extent_io_tree_panic(tree, err); + spin_unlock(&state->lock); + spin_lock(&prealloc->lock); set_state_bits(tree, prealloc, &bits); cache_state(prealloc, cached_state); merge_state(tree, prealloc); + spin_unlock(&prealloc->lock); prealloc = NULL; goto out; } @@ -903,16 +1121,18 @@ hit_next: goto search_again; out: - spin_unlock(&tree->lock); + extent_rw_unlock(tree, &rw); if (prealloc) free_extent_state(prealloc); + if (merge) + process_merge_state(tree, orig_start); return err; search_again: if (start > end) goto out; - spin_unlock(&tree->lock); + extent_rw_unlock(tree, &rw); if (mask & __GFP_WAIT) cond_resched(); goto again; @@ -951,6 +1171,9 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int err = 0; u64 last_start; u64 last_end; + u64 orig_start = start; + int rw = EXTENT_READ; + int merge = 0; again: if (!prealloc && (mask & __GFP_WAIT)) { @@ -959,13 +1182,18 @@ again: return -ENOMEM; } - spin_lock(&tree->lock); + /* XXX: after this we''re EXTENT_RLOCKED/EXTENT_WLOCKED */ + extent_rw_lock(tree, &rw); /* * this search will find all the extents that end after * our range starts. */ node = tree_search(tree, start); if (!node) { + /* XXX: insert need a write lock */ + if (extent_rw_flip(tree, &rw)) + goto again; + prealloc = alloc_extent_state_atomic(prealloc); if (!prealloc) { err = -ENOMEM; @@ -982,6 +1210,7 @@ hit_next: last_start = state->start; last_end = state->end; + spin_lock(&state->lock); /* * | ---- desired range ---- | * | state | @@ -990,16 +1219,26 @@ hit_next: */ if (state->start == start && state->end <= end) { set_state_bits(tree, state, &bits); - state = clear_state_bit(tree, state, &clear_bits, 0); + __clear_state_bit(tree, state, &clear_bits, 0, 0); + if (rw == EXTENT_LOCKED) + merge = test_merge_state(tree, state); + else + merge_state(tree, state); + spin_unlock(&state->lock); + if (last_end == (u64)-1) goto out; + start = last_end + 1; + state = next_state(state); if (start < end && state && state->start == start && !need_resched()) goto hit_next; goto search_again; } + WARN_ON(1); + /* * | ---- desired range ---- | * | state | @@ -1017,8 +1256,15 @@ hit_next: * desired bit on it. */ if (state->start < start) { + /* XXX: split need a write lock */ + if (extent_rw_flip(tree, &rw)) { + spin_unlock(&state->lock); + goto again; + } + prealloc = alloc_extent_state_atomic(prealloc); if (!prealloc) { + spin_unlock(&state->lock); err = -ENOMEM; goto out; } @@ -1026,8 +1272,10 @@ hit_next: if (err) extent_io_tree_panic(tree, err); prealloc = NULL; - if (err) + if (err) { + spin_unlock(&state->lock); goto out; + } if (state->end <= end) { set_state_bits(tree, state, &bits); state = clear_state_bit(tree, state, &clear_bits, 0); @@ -1037,6 +1285,8 @@ hit_next: if (start < end && state && state->start == start && !need_resched()) goto hit_next; + } else { + spin_unlock(&state->lock); } goto search_again; } @@ -1049,11 +1299,17 @@ hit_next: */ if (state->start > start) { u64 this_end; + + spin_unlock(&state->lock); if (end < last_start) this_end = end; else this_end = last_start - 1; + /* XXX: insert need a write lock */ + if (extent_rw_flip(tree, &rw)) + goto again; + prealloc = alloc_extent_state_atomic(prealloc); if (!prealloc) { err = -ENOMEM; @@ -1079,6 +1335,10 @@ hit_next: * on the first half */ if (state->start <= end && state->end > end) { + /* XXX: split need a write lock */ + if (extent_rw_flip(tree, &rw)) + goto again; + prealloc = alloc_extent_state_atomic(prealloc); if (!prealloc) { err = -ENOMEM; @@ -1089,7 +1349,11 @@ hit_next: if (err) extent_io_tree_panic(tree, err); + spin_unlock(&state->lock); + spin_lock(&prealloc->lock); + set_state_bits(tree, prealloc, &bits); + /* will unlock prealloc lock for us */ clear_state_bit(tree, prealloc, &clear_bits, 0); prealloc = NULL; goto out; @@ -1098,16 +1362,20 @@ hit_next: goto search_again; out: - spin_unlock(&tree->lock); + /* XXX */ + extent_rw_unlock(tree, &rw); if (prealloc) free_extent_state(prealloc); + if (merge) + process_merge_state(tree, orig_start); return err; search_again: if (start > end) goto out; - spin_unlock(&tree->lock); + /* XXX */ + extent_rw_unlock(tree, &rw); if (mask & __GFP_WAIT) cond_resched(); goto again; @@ -1267,8 +1535,12 @@ struct extent_state *find_first_extent_bit_state(struct extent_io_tree *tree, while (1) { state = rb_entry(node, struct extent_state, rb_node); - if (state->end >= start && (state->state & bits)) + spin_lock(&state->lock); + if (state->end >= start && (state->state & bits)) { + spin_unlock(&state->lock); return state; + } + spin_unlock(&state->lock); node = rb_next(node); if (!node) @@ -1291,14 +1563,14 @@ int find_first_extent_bit(struct extent_io_tree *tree, u64 start, struct extent_state *state; int ret = 1; - spin_lock(&tree->lock); + read_lock(&tree->lock); state = find_first_extent_bit_state(tree, start, bits); if (state) { *start_ret = state->start; *end_ret = state->end; ret = 0; } - spin_unlock(&tree->lock); + read_unlock(&tree->lock); return ret; } @@ -1318,7 +1590,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, u64 found = 0; u64 total_bytes = 0; - spin_lock(&tree->lock); + read_lock(&tree->lock); /* * this search will find all the extents that end after @@ -1333,15 +1605,20 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, while (1) { state = rb_entry(node, struct extent_state, rb_node); + spin_lock(&state->lock); if (found && (state->start != cur_start || (state->state & EXTENT_BOUNDARY))) { + spin_unlock(&state->lock); goto out; } if (!(state->state & EXTENT_DELALLOC)) { + spin_unlock(&state->lock); if (!found) *end = state->end; goto out; } + spin_unlock(&state->lock); + if (!found) { *start = state->start; *cached_state = state; @@ -1358,7 +1635,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, break; } out: - spin_unlock(&tree->lock); + read_unlock(&tree->lock); return found; } @@ -1619,7 +1896,7 @@ u64 count_range_bits(struct extent_io_tree *tree, return 0; } - spin_lock(&tree->lock); + read_lock(&tree->lock); if (cur_start == 0 && bits == EXTENT_DIRTY) { total_bytes = tree->dirty_bytes; goto out; @@ -1638,7 +1915,9 @@ u64 count_range_bits(struct extent_io_tree *tree, break; if (contig && found && state->start > last + 1) break; + spin_lock(&state->lock); if (state->end >= cur_start && (state->state & bits) == bits) { + spin_unlock(&state->lock); total_bytes += min(search_end, state->end) + 1 - max(cur_start, state->start); if (total_bytes >= max_bytes) @@ -1649,14 +1928,18 @@ u64 count_range_bits(struct extent_io_tree *tree, } last = state->end; } else if (contig && found) { + spin_unlock(&state->lock); break; + } else { + spin_unlock(&state->lock); } + node = rb_next(node); if (!node) break; } out: - spin_unlock(&tree->lock); + read_unlock(&tree->lock); return total_bytes; } @@ -1707,7 +1990,7 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, struct rb_node *node; int bitset = 0; - spin_lock(&tree->lock); + read_lock(&tree->lock); if (cached && cached->tree && cached->start <= start && cached->end > start) node = &cached->rb_node; @@ -1724,13 +2007,18 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, if (state->start > end) break; + spin_lock(&state->lock); if (state->state & bits) { + spin_unlock(&state->lock); bitset = 1; if (!filled) break; } else if (filled) { + spin_unlock(&state->lock); bitset = 0; break; + } else { + spin_unlock(&state->lock); } if (state->end == (u64)-1) @@ -1746,7 +2034,7 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, break; } } - spin_unlock(&tree->lock); + read_unlock(&tree->lock); return bitset; } @@ -1959,11 +2247,11 @@ static int clean_io_failure(u64 start, struct page *page) goto out; } - spin_lock(&BTRFS_I(inode)->io_tree.lock); + read_lock(&BTRFS_I(inode)->io_tree.lock); state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree, failrec->start, EXTENT_LOCKED); - spin_unlock(&BTRFS_I(inode)->io_tree.lock); + read_unlock(&BTRFS_I(inode)->io_tree.lock); if (state && state->start == failrec->start) { map_tree = &BTRFS_I(inode)->root->fs_info->mapping_tree; @@ -2097,12 +2385,12 @@ static int bio_readpage_error(struct bio *failed_bio, struct page *page, } if (!state) { - spin_lock(&tree->lock); + read_lock(&tree->lock); state = find_first_extent_bit_state(tree, failrec->start, EXTENT_LOCKED); if (state && state->start != failrec->start) state = NULL; - spin_unlock(&tree->lock); + read_unlock(&tree->lock); } /* @@ -2701,7 +2989,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree, set_extent_uptodate(tree, cur, cur + iosize - 1, &cached, GFP_NOFS); unlock_extent_cached(tree, cur, cur + iosize - 1, - &cached, GFP_NOFS); + &cached, GFP_NOFS); cur = cur + iosize; pg_offset += iosize; continue; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index c896962..7a54d9f 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -100,7 +100,7 @@ struct extent_io_tree { struct address_space *mapping; u64 dirty_bytes; int track_uptodate; - spinlock_t lock; + rwlock_t lock; spinlock_t buffer_lock; spinlock_t csum_lock; struct extent_io_ops *ops; @@ -116,6 +116,7 @@ struct extent_state { wait_queue_head_t wq; atomic_t refs; unsigned long state; + spinlock_t lock; /* for use by the FS */ u64 private; -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) > {[...]> + ret = radix_tree_insert(&tree->csum, (unsigned long)start, > + (void *)((unsigned long)private<< 1));Will this fail for 64bit files on 32bit hosts?> + BUG_ON(ret);I wonder if we can patch BUG_ON() to break the build if its only argument is "ret". - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/14/2012 12:07 AM, Zach Brown wrote:> >> int set_state_private(struct extent_io_tree *tree, u64 start, u64 >> private) >> { > [...] >> + ret = radix_tree_insert(&tree->csum, (unsigned long)start, >> + (void *)((unsigned long)private<< 1)); > > Will this fail for 64bit files on 32bit hosts?In theory it will fail, but crc32c return u32, so private will be originally u32, and it''d be ok on 32bit hosts.> >> + BUG_ON(ret); > > I wonder if we can patch BUG_ON() to break the build if its only > argument is "ret". >why? thanks, liubo> - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 13, 2012 at 06:19:08PM +0800, Liu Bo wrote:> We used to issue a checksum to an extent state of 4K range for read endio, > but now we want to use larger range for performance optimization, so instead we > create a radix tree for checksum, where an item stands for checksum of 4K data. > > Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> > --- > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h > index 25900af..c896962 100644 > --- a/fs/btrfs/extent_io.h > +++ b/fs/btrfs/extent_io.h > @@ -96,11 +96,13 @@ struct extent_io_ops { > struct extent_io_tree { > struct rb_root state; > struct radix_tree_root buffer; > + struct radix_tree_root csum; > struct address_space *mapping; > u64 dirty_bytes; > int track_uptodate; > spinlock_t lock; > spinlock_t buffer_lock; > + spinlock_t csum_lock;I''m afraid 3 spinlocks sharing the same cacheline will bite and hurt performance. struct extent_io_tree { struct rb_root state; /* 0 8 */ struct radix_tree_root buffer; /* 8 16 */ struct radix_tree_root csum; /* 24 16 */ struct address_space * mapping; /* 40 8 */ u64 dirty_bytes; /* 48 8 */ int track_uptodate; /* 56 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ rwlock_t lock; /* 64 8 */ spinlock_t buffer_lock; /* 72 4 */ spinlock_t csum_lock; /* 76 4 */ struct extent_io_ops * ops; /* 80 8 */ /* size: 88, cachelines: 2, members: 10 */ /* sum members: 84, holes: 1, sum holes: 4 */ /* last cacheline: 24 bytes */ }; so we have to live with 2 cachenlines, maybe it''ll be enough to split the structure to 2 sections depending on the access paterns. Definetely for later, but I think it should be noted as the patchset tries to improve performance.> struct extent_io_ops *ops; > };david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2012-Jun-14 16:12 UTC
Re: [PATCH 3/4] Btrfs: use large extent range for read and its endio
On Wed, Jun 13, 2012 at 06:19:10PM +0800, Liu Bo wrote:> we use larger extent state range for both readpages and read endio, so that > we can lock or unlock less and avoid most of split ops, then we''ll reduce write > locks taken at endio time. > > Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> > --- > fs/btrfs/extent_io.c | 201 +++++++++++++++++++++++++++++++++++++++++++++----- > 1 files changed, 182 insertions(+), 19 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 081fe13..bb66e3c 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2258,18 +2258,26 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; > struct bio_vec *bvec = bio->bi_io_vec; > struct extent_io_tree *tree; > + struct extent_state *cached = NULL; > u64 start; > u64 end; > int whole_page; > int mirror; > int ret; > + u64 up_start, up_end, un_start, un_end; > + int up_first, un_first; > + int for_uptodate[bio->bi_vcnt];The array size depends on an argument, compiler will handle it by dynamically expanding the stackframe. What is the expected maximum value for bi_vcnt ? There''s no hard limit AFAICS, the value gets incremented on each page in the bio. If the function is called from the worker thread, it''s not much of a problem even for values like 128. Alternate way to avoid over-limit stack consumption is to declare an array to hold a fair number of items (eg. 16), and fall back to kmalloc otherwise.> + int i = 0; > + > + up_start = un_start = (u64)-1; > + up_end = un_end = 0; > + up_first = un_first = 1; > > if (err) > uptodate = 0; > > do { > struct page *page = bvec->bv_page; > - struct extent_state *cached = NULL; > > pr_debug("end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, " > "mirror=%ld\n", bio->bi_vcnt, bio->bi_idx, err, > @@ -2352,7 +2412,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > } > unlock_page(page); > } else { > - if (uptodate) { > + if (for_uptodate[i++]) { > check_page_uptodate(tree, page); > } else { > clearpageuptodate(page); > @@ -2360,6 +2420,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > } > check_page_locked(tree, page); > } > + ++bvec;can you please move the i++ increments here? From reading the code it''s not clear on first sight if the sideeffects are ok.> } while (bvec <= bvec_end); > > bio_put(bio); > @@ -2554,6 +2615,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree, > > end = page_end; > while (1) { > + if (range_lock) > + break;with range_lock set, this is equivalent to calling 2896 set_page_extent_mapped(page); (plus the cleancache code), a few lines above. Is this inteded? It''s actually called with ''1'' from a single place, process_batch_pages().> lock_extent(tree, start, end); > ordered = btrfs_lookup_ordered_extent(inode, start); > if (!ordered) > @@ -3497,6 +3560,59 @@ int extent_writepages(struct extent_io_tree *tree, > return ret; > } > > +struct page_list { > + struct page *page; > + struct list_head list; > +}; > + > +static int process_batch_pages(struct extent_io_tree *tree, > + struct address_space *mapping, > + struct list_head *lock_pages, int *page_cnt, > + u64 lock_start, u64 lock_end, > + get_extent_t get_extent, struct bio **bio, > + unsigned long *bio_flags) > +{ > + u64 page_start; > + struct page_list *plist; > + > + while (1) { > + struct btrfs_ordered_extent *ordered = NULL; > + > + lock_extent(tree, lock_start, lock_end); > + page_start = lock_start; > + while (page_start < lock_end) { > + ordered = btrfs_lookup_ordered_extent(mapping->host, > + page_start); > + if (ordered) { > + page_start = ordered->file_offset; > + break; > + } > + page_start += PAGE_CACHE_SIZE; > + } > + if (!ordered) > + break;You''re leaving the range [lock_start, lock_end) locked after taking this branch. Intended?> + unlock_extent(tree, lock_start, lock_end); > + btrfs_start_ordered_extent(mapping->host, ordered, 1); > + btrfs_put_ordered_extent(ordered); > + } > + > + plist = NULL; > + while (!list_empty(lock_pages)) { > + plist = list_entry(lock_pages->prev, struct page_list, list); > + > + __extent_read_full_page(tree, plist->page, get_extent, > + bio, 0, bio_flags, 1); > + page_cache_release(plist->page); > + list_del(&plist->list); > + plist->page = NULL;you can drop this assignment> + kfree(plist); > + (*page_cnt)--; > + } > + > + WARN_ON((*page_cnt)); > + return 0; > +} > + > int extent_readpages(struct extent_io_tree *tree, > struct address_space *mapping, > struct list_head *pages, unsigned nr_pages,-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> + BUG_ON(ret); >> >> I wonder if we can patch BUG_ON() to break the build if its only >> argument is "ret". >> > > why?Well, I''m mostly joking :). That would be a very silly change to make. But only mostly joking. btrfs does have a real fragility problem from all these incomplete error handling paths: $ grep ''BUG_ON(ret.*)'' fs/btrfs/*.c | wc -l 197 We should be fixing these, not adding more. I don''t think any patches should be merged which add more of these. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2012-Jun-15 01:50 UTC
Re: [PATCH 3/4] Btrfs: use large extent range for read and its endio
On 06/15/2012 12:12 AM, David Sterba wrote:> On Wed, Jun 13, 2012 at 06:19:10PM +0800, Liu Bo wrote: >> we use larger extent state range for both readpages and read endio, so that >> we can lock or unlock less and avoid most of split ops, then we''ll reduce write >> locks taken at endio time. >> >> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> >> --- >> fs/btrfs/extent_io.c | 201 +++++++++++++++++++++++++++++++++++++++++++++----- >> 1 files changed, 182 insertions(+), 19 deletions(-) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 081fe13..bb66e3c 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -2258,18 +2258,26 @@ static void end_bio_extent_readpage(struct bio *bio, int err) >> struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; >> struct bio_vec *bvec = bio->bi_io_vec; >> struct extent_io_tree *tree; >> + struct extent_state *cached = NULL; >> u64 start; >> u64 end; >> int whole_page; >> int mirror; >> int ret; >> + u64 up_start, up_end, un_start, un_end; >> + int up_first, un_first; >> + int for_uptodate[bio->bi_vcnt]; > > The array size depends on an argument, compiler will handle it by > dynamically expanding the stackframe. What is the expected maximum value > for bi_vcnt ? There''s no hard limit AFAICS, the value gets incremented > on each page in the bio. If the function is called from the worker > thread, it''s not much of a problem even for values like 128. > > Alternate way to avoid over-limit stack consumption is to declare an > array to hold a fair number of items (eg. 16), and fall back to kmalloc > otherwise. >hmm, you''re right. I''m going to use kmalloc with KERNEL_ATOMIC.>> + int i = 0; >> + >> + up_start = un_start = (u64)-1; >> + up_end = un_end = 0; >> + up_first = un_first = 1; >> >> if (err) >> uptodate = 0; >> >> do { >> struct page *page = bvec->bv_page; >> - struct extent_state *cached = NULL; >> >> pr_debug("end_bio_extent_readpage: bi_vcnt=%d, idx=%d, err=%d, " >> "mirror=%ld\n", bio->bi_vcnt, bio->bi_idx, err, >> @@ -2352,7 +2412,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) >> } >> unlock_page(page); >> } else { >> - if (uptodate) { >> + if (for_uptodate[i++]) { >> check_page_uptodate(tree, page); >> } else { >> clearpageuptodate(page); >> @@ -2360,6 +2420,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) >> } >> check_page_locked(tree, page); >> } >> + ++bvec; > > can you please move the i++ increments here? From reading the code it''s > not clear on first sight if the sideeffects are ok. >Sure, will update it.>> } while (bvec <= bvec_end); >> >> bio_put(bio); >> @@ -2554,6 +2615,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree, >> >> end = page_end; >> while (1) { >> + if (range_lock) >> + break; > > with range_lock set, this is equivalent to calling > > 2896 set_page_extent_mapped(page); > > (plus the cleancache code), a few lines above. Is this inteded? It''s > actually called with ''1'' from a single place, process_batch_pages(). >Yeah, I want to skip the lock part, since I''ve already locked this range of extent. After that, we''ll have a continuous range of extent for read.>> lock_extent(tree, start, end); >> ordered = btrfs_lookup_ordered_extent(inode, start); >> if (!ordered) >> @@ -3497,6 +3560,59 @@ int extent_writepages(struct extent_io_tree *tree, >> return ret; >> } >> >> +struct page_list { >> + struct page *page; >> + struct list_head list; >> +}; >> + >> +static int process_batch_pages(struct extent_io_tree *tree, >> + struct address_space *mapping, >> + struct list_head *lock_pages, int *page_cnt, >> + u64 lock_start, u64 lock_end, >> + get_extent_t get_extent, struct bio **bio, >> + unsigned long *bio_flags) >> +{ >> + u64 page_start; >> + struct page_list *plist; >> + >> + while (1) { >> + struct btrfs_ordered_extent *ordered = NULL; >> + >> + lock_extent(tree, lock_start, lock_end); >> + page_start = lock_start; >> + while (page_start < lock_end) { >> + ordered = btrfs_lookup_ordered_extent(mapping->host, >> + page_start); >> + if (ordered) { >> + page_start = ordered->file_offset; >> + break; >> + } >> + page_start += PAGE_CACHE_SIZE; >> + } >> + if (!ordered) >> + break; > > You''re leaving the range [lock_start, lock_end) locked after taking this > branch. Intended? >Yeah, this is what it should be. Usually we do: o lock the range that is going to be sent to read (at read_page) o set the range uptodate and unlock it (at end_io)>> + unlock_extent(tree, lock_start, lock_end); >> + btrfs_start_ordered_extent(mapping->host, ordered, 1); >> + btrfs_put_ordered_extent(ordered); >> + } >> + >> + plist = NULL; >> + while (!list_empty(lock_pages)) { >> + plist = list_entry(lock_pages->prev, struct page_list, list); >> + >> + __extent_read_full_page(tree, plist->page, get_extent, >> + bio, 0, bio_flags, 1); >> + page_cache_release(plist->page); >> + list_del(&plist->list); >> + plist->page = NULL; > > you can drop this assignment >Sure. Thanks a lot for reviewing! :) thanks, liubo>> + kfree(plist); >> + (*page_cnt)--; >> + } >> + >> + WARN_ON((*page_cnt)); >> + return 0; >> +} >> + >> int extent_readpages(struct extent_io_tree *tree, >> struct address_space *mapping, >> struct list_head *pages, unsigned nr_pages, >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 13, 2012 at 07:50:52PM -0600, Liu Bo wrote:> On 06/14/2012 12:07 AM, Zach Brown wrote: > > > > >> int set_state_private(struct extent_io_tree *tree, u64 start, u64 > >> private) > >> { > > [...] > >> + ret = radix_tree_insert(&tree->csum, (unsigned long)start, > >> + (void *)((unsigned long)private<< 1)); > > > > Will this fail for 64bit files on 32bit hosts? > > > In theory it will fail, but crc32c return u32, so private will be originally u32, > and it''d be ok on 32bit hosts.The (unsigned long)start part looks wrong though. This is the byte offset from 0, so on a 32 bit machine you won''t be able to have large files. The page cache also has this limitation, but it gains extra bits counting page indexes instead of byte indexes. I''ve made that change here and I''m benchmarking it on my big flash ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/06/2012 11:37 PM, Chris Mason wrote:> On Wed, Jun 13, 2012 at 07:50:52PM -0600, Liu Bo wrote: >> On 06/14/2012 12:07 AM, Zach Brown wrote: >> >>>> int set_state_private(struct extent_io_tree *tree, u64 start, u64 >>>> private) >>>> { >>> [...] >>>> + ret = radix_tree_insert(&tree->csum, (unsigned long)start, >>>> + (void *)((unsigned long)private<< 1)); >>> Will this fail for 64bit files on 32bit hosts? >> >> In theory it will fail, but crc32c return u32, so private will be originally u32, >> and it''d be ok on 32bit hosts. > > The (unsigned long)start part looks wrong though. This is the byte offset > from 0, so on a 32 bit machine you won''t be able to have large files. > > The page cache also has this limitation, but it gains extra bits > counting page indexes instead of byte indexes. >I see.> I''ve made that change here and I''m benchmarking it on my big flash ;) >Thanks a lot. :) I must note that this patchset is still very initial, and this week I''ve fixed a deadlock bug hidden in the 4th patch (it can be triggered by xfstests 208). I''m planning to set up a worker thread or just use ''endio_meta'' thread for merge_state and do more tuning work to lessen writer lock. thanks, liubo> -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html