Liu Bo
2011-May-19 08:11 UTC
[PATCH 0/9] Btrfs: improve write ahead log with sub transaction
I''ve been working to try to improve the write-ahead log''s performance, and I found that the bottleneck addresses in the checksum items, especially when we want to make a random write on a large file, e.g a 4G file. Then a idea for this suggested by Chris is to use sub transaction ids and just to log the part of inode that had changed since either the last log commit or the last transaction commit. And as we also push the sub transid into the btree blocks, we''ll get much faster tree walks. As a result, we abandon the original brute force approach, which is "to delete all items of the inode in log", to making sure we get the most uptodate copies of everything, and instead we manage to "find and merge", i.e. finding extents in the log tree and merging in the new extents from the file. This patchset puts the above idea into code, and although the code is now more complex, it brings us a great deal of performance improvement. Beside the improvement of log, patch 8 fixes a small but critical bug of log code with sub transaction. Here I have some test results to show, I use sysbench to do "random write + fsync". ==sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run] == Sysbench args: - Number of threads: 1 - Extra file open flags: 0 - 2 files, 4Gb each - Block size 4Kb - Number of random requests for random IO: 10000 - Read/Write ratio for combined random IO test: 1.50 - Periodic FSYNC enabled, calling fsync() each 100 requests. - Calling fsync() at the end of test, Enabled. - Using synchronous I/O mode - Doing random write test Sysbench results: == Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total Read 0b Written 39.062Mb Total transferred 39.062Mb ==a) without patch: (*SPEED* : 451.01Kb/sec) 112.75 Requests/sec executed b) with patch: (*SPEED* : 4.3621Mb/sec) 1116.71 Requests/sec executed Liu Bo (10): Btrfs: introduce sub transaction stuff Btrfs: modify should_cow_block to update block''s generation Btrfs: modify btrfs_drop_extents API Btrfs: introduce first sub trans Btrfs: still update inode transid when size remains unchanged Btrfs: main log stuff Btrfs: add checksum check for log Btrfs: fix a bug of log check Btrfs: kick off useless code Btrfs: ship trans->transid to trans->transaction->transid fs/btrfs/btrfs_inode.h | 12 ++- fs/btrfs/ctree.c | 71 ++++++++++----- fs/btrfs/ctree.h | 5 +- fs/btrfs/disk-io.c | 9 +- fs/btrfs/extent-tree.c | 10 ++- fs/btrfs/file.c | 22 ++--- fs/btrfs/inode.c | 28 ++++-- fs/btrfs/ioctl.c | 6 +- fs/btrfs/relocation.c | 6 +- fs/btrfs/transaction.c | 13 ++- fs/btrfs/transaction.h | 19 ++++- fs/btrfs/tree-defrag.c | 2 +- fs/btrfs/tree-log.c | 222 ++++++++++++++++++++++++++++++++--------------- 13 files changed, 279 insertions(+), 146 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Introduce a new concept "sub transaction", the relation between transaction and sub transaction is transaction A ---> transid = x sub trans a(1) ---> sub_transid = x+1 sub trans a(2) ---> sub_transid = x+2 ... ... sub trans a(n-1) ---> sub_transid = x+n-1 sub trans a(n) ---> sub_transid = x+n transaction B ---> transid = x+n+1 ... ... And the most important is a) a trans handler''s transid now gets value from sub transid instead of transid. b) when a transaction commits, transid may not added by 1, but depend on the biggest sub_transaction of the last neighbour transaction, i.e. B->transid = a(n)->transid + 1, (B->transid - A->transid) >= 1 c) we start a new sub transaction after a fsync. We also ship some ''trans->transid'' to ''trans->transaction->transid'' to ensure btrfs works well and to get rid of WARNings. These are used for the new log code. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/ctree.c | 35 ++++++++++++++++++----------------- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 7 ++++--- fs/btrfs/extent-tree.c | 10 ++++++---- fs/btrfs/inode.c | 4 ++-- fs/btrfs/ioctl.c | 2 +- fs/btrfs/relocation.c | 6 +++--- fs/btrfs/transaction.c | 13 +++++++++---- fs/btrfs/transaction.h | 1 + fs/btrfs/tree-defrag.c | 2 +- fs/btrfs/tree-log.c | 16 ++++++++++++++-- 11 files changed, 60 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 84d7ca1..0c3b515 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -201,9 +201,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans, int level; struct btrfs_disk_key disk_key; - WARN_ON(root->ref_cows && trans->transid !+ WARN_ON(root->ref_cows && trans->transaction->transid ! root->fs_info->running_transaction->transid); - WARN_ON(root->ref_cows && trans->transid != root->last_trans); + WARN_ON(root->ref_cows && trans->transid < root->last_trans); level = btrfs_header_level(buf); if (level == 0) @@ -398,9 +398,9 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, btrfs_assert_tree_locked(buf); - WARN_ON(root->ref_cows && trans->transid !+ WARN_ON(root->ref_cows && trans->transaction->transid ! root->fs_info->running_transaction->transid); - WARN_ON(root->ref_cows && trans->transid != root->last_trans); + WARN_ON(root->ref_cows && trans->transid < root->last_trans); level = btrfs_header_level(buf); @@ -466,7 +466,8 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, else parent_start = 0; - WARN_ON(trans->transid != btrfs_header_generation(parent)); + WARN_ON(btrfs_header_generation(parent) < + trans->transaction->transid); btrfs_set_node_blockptr(parent, parent_slot, cow->start); btrfs_set_node_ptr_generation(parent, parent_slot, @@ -487,7 +488,7 @@ static inline int should_cow_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf) { - if (btrfs_header_generation(buf) == trans->transid && + if (btrfs_header_generation(buf) >= trans->transaction->transid && !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) && !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID && btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) @@ -515,7 +516,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, root->fs_info->running_transaction->transid); WARN_ON(1); } - if (trans->transid != root->fs_info->generation) { + if (trans->transaction->transid != root->fs_info->generation) { printk(KERN_CRIT "trans %llu running %llu\n", (unsigned long long)trans->transid, (unsigned long long)root->fs_info->generation); @@ -618,7 +619,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans, if (trans->transaction != root->fs_info->running_transaction) WARN_ON(1); - if (trans->transid != root->fs_info->generation) + if (trans->transaction->transid != root->fs_info->generation) WARN_ON(1); parent_nritems = btrfs_header_nritems(parent); @@ -898,7 +899,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, mid = path->nodes[level]; WARN_ON(!path->locks[level]); - WARN_ON(btrfs_header_generation(mid) != trans->transid); + WARN_ON(btrfs_header_generation(mid) < trans->transaction->transid); orig_ptr = btrfs_node_blockptr(mid, orig_slot); @@ -1105,7 +1106,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans, return 1; mid = path->nodes[level]; - WARN_ON(btrfs_header_generation(mid) != trans->transid); + WARN_ON(btrfs_header_generation(mid) < trans->transaction->transid); if (level < BTRFS_MAX_LEVEL - 1) parent = path->nodes[level + 1]; @@ -1842,8 +1843,8 @@ static int push_node_left(struct btrfs_trans_handle *trans, src_nritems = btrfs_header_nritems(src); dst_nritems = btrfs_header_nritems(dst); push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems; - WARN_ON(btrfs_header_generation(src) != trans->transid); - WARN_ON(btrfs_header_generation(dst) != trans->transid); + WARN_ON(btrfs_header_generation(src) < trans->transaction->transid); + WARN_ON(btrfs_header_generation(dst) < trans->transaction->transid); if (!empty && src_nritems <= 8) return 1; @@ -1905,8 +1906,8 @@ static int balance_node_right(struct btrfs_trans_handle *trans, int dst_nritems; int ret = 0; - WARN_ON(btrfs_header_generation(src) != trans->transid); - WARN_ON(btrfs_header_generation(dst) != trans->transid); + WARN_ON(btrfs_header_generation(src) < trans->transaction->transid); + WARN_ON(btrfs_header_generation(dst) < trans->transaction->transid); src_nritems = btrfs_header_nritems(src); dst_nritems = btrfs_header_nritems(dst); @@ -1997,7 +1998,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans, btrfs_set_node_key(c, &lower_key, 0); btrfs_set_node_blockptr(c, 0, lower->start); lower_gen = btrfs_header_generation(lower); - WARN_ON(lower_gen != trans->transid); + WARN_ON(lower_gen < trans->transaction->transid); btrfs_set_node_ptr_generation(c, 0, lower_gen); @@ -2077,7 +2078,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans, u32 c_nritems; c = path->nodes[level]; - WARN_ON(btrfs_header_generation(c) != trans->transid); + WARN_ON(btrfs_header_generation(c) < trans->transaction->transid); if (c == root->node) { /* trying to split the root, lets make a new one */ ret = insert_new_root(trans, root, path, level + 1); @@ -3781,7 +3782,7 @@ static noinline int btrfs_del_leaf(struct btrfs_trans_handle *trans, { int ret; - WARN_ON(btrfs_header_generation(leaf) != trans->transid); + WARN_ON(btrfs_header_generation(leaf) < trans->transaction->transid); ret = del_ptr(trans, root, path, 1, path->slots[1]); if (ret) return ret; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2e61fe1..ef68108 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -912,6 +912,7 @@ struct btrfs_fs_info { struct mutex durable_block_rsv_mutex; u64 generation; + u64 sub_generation; u64 last_trans_committed; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ef6865c..54842fe 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1014,7 +1014,7 @@ int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf) { struct inode *btree_inode = root->fs_info->btree_inode; - if (btrfs_header_generation(buf) =+ if (btrfs_header_generation(buf) > root->fs_info->running_transaction->transid) { btrfs_assert_tree_locked(buf); @@ -1649,7 +1649,7 @@ static int transaction_kthread(void *arg) trans = btrfs_join_transaction(root, 1); BUG_ON(IS_ERR(trans)); - if (transid == trans->transid) { + if (transid == trans->transaction->transid) { ret = btrfs_commit_transaction(trans, root); BUG_ON(ret); } else { @@ -2064,6 +2064,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, csum_root->track_dirty = 1; fs_info->generation = generation; + fs_info->sub_generation = generation; fs_info->last_trans_committed = generation; fs_info->data_alloc_profile = (u64)-1; fs_info->metadata_alloc_profile = (u64)-1; @@ -2715,7 +2716,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf) int was_dirty; btrfs_assert_tree_locked(buf); - if (transid != root->fs_info->generation) { + if (transid < root->fs_info->generation) { printk(KERN_CRIT "btrfs transid mismatch buffer %llu, " "found %llu running %llu\n", (unsigned long long)buf->start, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 31f33ba..60f6109 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4455,7 +4455,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, list_for_each_entry_safe(block_rsv, next_rsv, &fs_info->durable_block_rsv_list, list) { - idx = trans->transid & 0x1; + idx = trans->transaction->transid & 0x1; if (block_rsv->freed[idx] > 0) { block_rsv_add_bytes(block_rsv, block_rsv->freed[idx], 0); @@ -4770,7 +4770,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (block_rsv->space_info != cache->space_info) goto out; - if (btrfs_header_generation(buf) == trans->transid) { + if (btrfs_header_generation(buf) >= trans->transaction->transid) { if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { ret = check_ref_cleanup(trans, root, buf->start); if (!ret) @@ -4820,7 +4820,8 @@ pin: if (ret) { spin_lock(&block_rsv->lock); - block_rsv->freed[trans->transid & 0x1] += buf->len; + block_rsv->freed[trans->transaction->transid & 0x1] ++ buf->len; spin_unlock(&block_rsv->lock); } } @@ -6252,7 +6253,8 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans, } /* make block locked assertion in clean_tree_block happy */ if (!path->locks[level] && - btrfs_header_generation(eb) == trans->transid) { + btrfs_header_generation(eb) >+ trans->transaction->transid) { btrfs_tree_lock(eb); btrfs_set_lock_blocking(eb); path->locks[level] = 1; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a4157cf..4cec4c9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2091,7 +2091,7 @@ void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans, * space than it frees. So we should make sure there is enough * reserved space. */ - index = trans->transid & 0x1; + index = trans->transaction->transid & 0x1; if (block_rsv->reserved + block_rsv->freed[index] < block_rsv->size) { num_bytes += block_rsv->size - (block_rsv->reserved + block_rsv->freed[index]); @@ -2115,7 +2115,7 @@ void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans, /* refill source subvolume''s orphan block reservation */ block_rsv = root->orphan_block_rsv; - index = trans->transid & 0x1; + index = trans->transaction->transid & 0x1; if (block_rsv->reserved + block_rsv->freed[index] < block_rsv->size) { num_bytes = block_rsv->size - (block_rsv->reserved + block_rsv->freed[index]); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index f580a3a..142a82d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2438,7 +2438,7 @@ static noinline long btrfs_ioctl_start_sync(struct file *file, void __user *argp trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) return PTR_ERR(trans); - transid = trans->transid; + transid = trans->transaction->transid; ret = btrfs_commit_transaction_async(trans, root, 0); if (ret) { btrfs_end_transaction(trans, root); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 58250e0..9350f91 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -468,7 +468,7 @@ static int update_backref_cache(struct btrfs_trans_handle *trans, return 0; } - if (cache->last_trans == trans->transid) + if (cache->last_trans >= trans->transaction->transid) return 0; /* @@ -1278,7 +1278,7 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans, BUG_ON(ret); btrfs_set_root_last_snapshot(&root->root_item, - trans->transid - 1); + trans->transaction->transid - 1); } else { /* * called by btrfs_reloc_post_snapshot_hook. @@ -2255,7 +2255,7 @@ static int record_reloc_root_in_trans(struct btrfs_trans_handle *trans, { struct btrfs_root *root; - if (reloc_root->last_trans == trans->transid) + if (reloc_root->last_trans >= trans->transaction->transid) return 0; root = read_fs_root(reloc_root->fs_info, reloc_root->root_key.offset); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index c571734..cb0103c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -58,9 +58,11 @@ static noinline int join_transaction(struct btrfs_root *root) if (!cur_trans) return -ENOMEM; root->fs_info->generation++; + root->fs_info->sub_generation = root->fs_info->generation; atomic_set(&cur_trans->num_writers, 1); cur_trans->num_joined = 0; cur_trans->transid = root->fs_info->generation; + cur_trans->sub_transid = cur_trans->transid; init_waitqueue_head(&cur_trans->writer_wait); init_waitqueue_head(&cur_trans->commit_wait); cur_trans->in_commit = 0; @@ -102,7 +104,7 @@ static noinline int join_transaction(struct btrfs_root *root) static noinline int record_root_in_trans(struct btrfs_trans_handle *trans, struct btrfs_root *root) { - if (root->ref_cows && root->last_trans < trans->transid) { + if (root->ref_cows && root->last_trans < trans->transaction->transid) { WARN_ON(root == root->fs_info->extent_root); WARN_ON(root->commit_root != root->node); @@ -122,7 +124,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans, return 0; mutex_lock(&root->fs_info->trans_mutex); - if (root->last_trans == trans->transid) { + if (root->last_trans >= trans->transaction->transid) { mutex_unlock(&root->fs_info->trans_mutex); return 0; } @@ -207,7 +209,7 @@ again: if (type != TRANS_JOIN_NOLOCK) mutex_unlock(&root->fs_info->trans_mutex); - h->transid = cur_trans->transid; + h->transid = cur_trans->sub_transid; h->transaction = cur_trans; h->blocks_used = 0; h->block_group = 0; @@ -1350,6 +1352,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, btrfs_prepare_extent_commit(trans, root); cur_trans = root->fs_info->running_transaction; + + root->fs_info->generation = cur_trans->sub_transid; + spin_lock(&root->fs_info->new_trans_lock); root->fs_info->running_transaction = NULL; spin_unlock(&root->fs_info->new_trans_lock); @@ -1393,7 +1398,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, cur_trans->commit_done = 1; - root->fs_info->last_trans_committed = cur_trans->transid; + root->fs_info->last_trans_committed = cur_trans->sub_transid; wake_up(&cur_trans->commit_wait); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index e441acc..6dcdd28 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -23,6 +23,7 @@ struct btrfs_transaction { u64 transid; + u64 sub_transid; /* * total writers in this transaction, it must be zero before the * transaction can end diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c index 992ab42..15f1912 100644 --- a/fs/btrfs/tree-defrag.c +++ b/fs/btrfs/tree-defrag.c @@ -139,7 +139,7 @@ done: if (ret != -EAGAIN) { memset(&root->defrag_progress, 0, sizeof(root->defrag_progress)); - root->defrag_trans_start = trans->transid; + root->defrag_trans_start = trans->transaction->transid; } return ret; } diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c50271a..4ecd2c5 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -134,9 +134,19 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, static int start_log_trans(struct btrfs_trans_handle *trans, struct btrfs_root *root) { + struct btrfs_transaction *cur_trans; int ret; int err = 0; + /* start a new sub transaction */ + mutex_lock(&root->fs_info->trans_mutex); + + cur_trans = root->fs_info->running_transaction; + cur_trans->sub_transid++; + root->fs_info->sub_generation = cur_trans->sub_transid; + + mutex_unlock(&root->fs_info->trans_mutex); + mutex_lock(&root->log_mutex); if (root->log_root) { if (!root->log_start_pid) { @@ -1985,7 +1995,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, } /* bail out if we need to do a full commit */ - if (root->fs_info->last_trans_log_full_commit == trans->transid) { + if (root->fs_info->last_trans_log_full_commit >+ trans->transaction->transid) { ret = -EAGAIN; mutex_unlock(&root->log_mutex); goto out; @@ -2062,7 +2073,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, * now that we''ve moved on to the tree of log tree roots, * check the full commit flag again */ - if (root->fs_info->last_trans_log_full_commit == trans->transid) { + if (root->fs_info->last_trans_log_full_commit >+ trans->transaction->transid) { btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark); mutex_unlock(&log_root_tree->log_mutex); ret = -EAGAIN; -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2011-May-19 08:11 UTC
[PATCH 2/9] Btrfs: update block generation if should_cow_block fails
Cause we''ve added sub transaction, if it do not want to cow a block, we also need to get new sub transid recorded. This is used for log code to find the most uptodate file extents. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/ctree.c | 34 +++++++++++++++++++++++++++++++++- 1 files changed, 33 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 0c3b515..7e21fa9 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -484,6 +484,33 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, return 0; } +static inline void update_block_generation(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct extent_buffer *buf, + struct extent_buffer *parent, + int slot) +{ + /* + * If it does not need to cow this block, we still need to + * update the block''s generation, for transid may have been + * changed during fsync. + */ + if (btrfs_header_generation(buf) == trans->transid) + return; + + if (buf == root->node) { + btrfs_set_header_generation(buf, trans->transid); + btrfs_mark_buffer_dirty(buf); + add_root_to_dirty_list(root); + } else { + btrfs_set_node_ptr_generation(parent, slot, + trans->transid); + btrfs_set_header_generation(buf, trans->transid); + btrfs_mark_buffer_dirty(parent); + btrfs_mark_buffer_dirty(buf); + } +} + static inline int should_cow_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf) @@ -524,6 +551,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, } if (!should_cow_block(trans, root, buf)) { + update_block_generation(trans, root, buf, parent, parent_slot); *cow_ret = buf; return 0; } @@ -1639,8 +1667,12 @@ again: * then we don''t want to set the path blocking, * so we test it here */ - if (!should_cow_block(trans, root, b)) + if (!should_cow_block(trans, root, b)) { + update_block_generation(trans, root, b, + p->nodes[level + 1], + p->slots[level + 1]); goto cow_done; + } btrfs_set_path_blocking(p); -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
We want to use btrfs_drop_extent() in log code. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/file.c | 9 +++++++-- fs/btrfs/inode.c | 6 +++--- fs/btrfs/ioctl.c | 4 ++-- fs/btrfs/tree-log.c | 2 +- 5 files changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ef68108..1ba3f91 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2575,7 +2575,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, int btrfs_check_file(struct btrfs_root *root, struct inode *inode); extern const struct file_operations btrfs_file_operations; int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode, - u64 start, u64 end, u64 *hint_byte, int drop_cache); + u64 start, u64 end, u64 *hint_byte, int drop_cache, + int log); int btrfs_mark_extent_written(struct btrfs_trans_handle *trans, struct inode *inode, u64 start, u64 end); int btrfs_release_file(struct inode *inode, struct file *file); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 75899a0..d19cf3a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -290,7 +290,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, * is deleted from the tree. */ int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode, - u64 start, u64 end, u64 *hint_byte, int drop_cache) + u64 start, u64 end, u64 *hint_byte, int drop_cache, + int log) { struct btrfs_root *root = BTRFS_I(inode)->root; struct extent_buffer *leaf; @@ -309,6 +310,10 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode, int recow; int ret; + /* drop the existed extents in log tree */ + if (log) + root = root->log_root; + if (drop_cache) btrfs_drop_extent_cache(inode, start, end - 1, 0); @@ -489,7 +494,7 @@ next_slot: extent_end - key.offset); extent_end = ALIGN(extent_end, root->sectorsize); - } else if (disk_bytenr > 0) { + } else if (disk_bytenr > 0 && !log) { ret = btrfs_free_extent(trans, root, disk_bytenr, num_bytes, 0, root->root_key.objectid, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4cec4c9..d823467 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -244,7 +244,7 @@ static noinline int cow_file_range_inline(struct btrfs_trans_handle *trans, } ret = btrfs_drop_extents(trans, inode, start, aligned_end, - &hint_byte, 1); + &hint_byte, 1, 0); BUG_ON(ret); if (isize > actual_end) @@ -1639,7 +1639,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, * with the others. */ ret = btrfs_drop_extents(trans, inode, file_pos, file_pos + num_bytes, - &hint, 0); + &hint, 0, 0); BUG_ON(ret); ins.objectid = inode->i_ino; @@ -3649,7 +3649,7 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size) err = btrfs_drop_extents(trans, inode, cur_offset, cur_offset + hole_size, - &hint_byte, 1); + &hint_byte, 1, 0); if (err) break; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 142a82d..d5a6a19 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2014,7 +2014,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, ret = btrfs_drop_extents(trans, inode, new_key.offset, new_key.offset + datal, - &hint_byte, 1); + &hint_byte, 1, 0); BUG_ON(ret); ret = btrfs_insert_empty_item(trans, root, path, @@ -2069,7 +2069,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, ret = btrfs_drop_extents(trans, inode, new_key.offset, new_key.offset + datal, - &hint_byte, 1); + &hint_byte, 1, 0); BUG_ON(ret); ret = btrfs_insert_empty_item(trans, root, path, diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 4ecd2c5..51d5024 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -563,7 +563,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans, saved_nbytes = inode_get_bytes(inode); /* drop any overlapping extents */ ret = btrfs_drop_extents(trans, inode, start, extent_end, - &alloc_hint, 1); + &alloc_hint, 1, 0); BUG_ON(ret); if (found_type == BTRFS_FILE_EXTENT_REG || -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
In multi-thread situations, writeback of a file may span across several sub transactions, and we need to introduce first_sub_trans to get sub_transid of the first sub transaction recorded, so that log code can skip file extents which have been logged or committed into disk. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/btrfs_inode.h | 9 +++++++++ fs/btrfs/inode.c | 13 ++++++++++++- fs/btrfs/transaction.h | 17 ++++++++++++++++- 3 files changed, 37 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 57c3bb2..fb5617a 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -79,6 +79,15 @@ struct btrfs_inode { /* sequence number for NFS changes */ u64 sequence; + /* used to avoid race of first_sub_trans */ + spinlock_t sub_trans_lock; + + /* + * sub transid of the trans that first modified this inode before + * a trans commit or a log sync + */ + u64 first_sub_trans; + /* * transid of the trans_handle that last modified this inode */ diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d823467..acd5a38 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6569,7 +6569,16 @@ again: set_page_dirty(page); SetPageUptodate(page); - BTRFS_I(inode)->last_trans = root->fs_info->generation; + spin_lock(&BTRFS_I(inode)->sub_trans_lock); + + if (BTRFS_I(inode)->first_sub_trans > root->fs_info->sub_generation || + BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans || + BTRFS_I(inode)->last_trans <= root->fs_info->last_trans_committed) + BTRFS_I(inode)->first_sub_trans = root->fs_info->sub_generation; + + spin_unlock(&BTRFS_I(inode)->sub_trans_lock); + + BTRFS_I(inode)->last_trans = root->fs_info->sub_generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; unlock_extent_cached(io_tree, page_start, page_end, &cached_state, GFP_NOFS); @@ -6763,6 +6772,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) ei->space_info = NULL; ei->generation = 0; ei->sequence = 0; + ei->first_sub_trans = 0; ei->last_trans = 0; ei->last_sub_trans = 0; ei->logged_trans = 0; @@ -6786,6 +6796,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) extent_io_tree_init(&ei->io_tree, &inode->i_data, GFP_NOFS); extent_io_tree_init(&ei->io_failure_tree, &inode->i_data, GFP_NOFS); mutex_init(&ei->log_mutex); + spin_lock_init(&ei->sub_trans_lock); btrfs_ordered_inode_tree_init(&ei->ordered_tree); INIT_LIST_HEAD(&ei->i_orphan); INIT_LIST_HEAD(&ei->delalloc_inodes); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 6dcdd28..d531aea 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -83,7 +83,22 @@ static inline void btrfs_update_inode_block_group( static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans, struct inode *inode) { - BTRFS_I(inode)->last_trans = trans->transaction->transid; + spin_lock(&BTRFS_I(inode)->sub_trans_lock); + + /* + * We have joined in a transaction, so btrfs_commit_transaction will + * definitely wait for us and it does not need to add a extra + * trans_mutex lock here. + */ + if (BTRFS_I(inode)->first_sub_trans > trans->transid || + BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans || + BTRFS_I(inode)->last_trans <+ BTRFS_I(inode)->root->fs_info->last_trans_committed) + BTRFS_I(inode)->first_sub_trans = trans->transid; + + spin_unlock(&BTRFS_I(inode)->sub_trans_lock); + + BTRFS_I(inode)->last_trans = trans->transid; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; } -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2011-May-19 08:11 UTC
[PATCH 5/9] Btrfs: still update inode trans stuff when size remains unchanged
Due to DIO stuff, commit 1ef30be142d2cc60e2687ef267de864cf31be995 makes btrfs not call btrfs_update_inode when it does not update i_disk_size, but in buffer write case, we need to update btrfs internal inode''s trans stuff, so that the log code can find the inode''s changes. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/inode.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index acd5a38..32eac29 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1773,7 +1773,8 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end) if (!ret) { ret = btrfs_update_inode(trans, root, inode); BUG_ON(ret); - } + } else + btrfs_set_inode_last_trans(trans, inode); ret = 0; out: if (nolock) { -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
When logging an inode _A_, current btrfs will a) clear all items belonged to _A_ in log, b) copy all items belonged to _A_ from fs/file tree to log tree, and this just wastes a lot of time, especially when logging big files. So we want to use a smarter approach, i.e. "find and merge". The amount of file extent items is the largest, so we focus on it. Thanks to sub transaction, now we can find those file extent items which are changed after last _transaction commit_ or last _log commit_, and then merge them with the existed ones in log tree. It will be great helpful on fsync performance, cause the common case is "make changes on a _part_ of inode". Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/tree-log.c | 177 ++++++++++++++++++++++++++++++++++++--------------- 1 files changed, 126 insertions(+), 51 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 51d5024..745933c 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2561,60 +2561,106 @@ again: } /* - * a helper function to drop items from the log before we relog an - * inode. max_key_type indicates the highest item type to remove. - * This cannot be run for file data extents because it does not - * free the extents they point to. + * a helper function to drop items from the log before we merge + * the uptodate items into the log tree. */ -static int drop_objectid_items(struct btrfs_trans_handle *trans, - struct btrfs_root *log, - struct btrfs_path *path, - u64 objectid, int max_key_type) +static int prepare_for_merge_items(struct btrfs_trans_handle *trans, + struct inode *inode, + struct extent_buffer *eb, + int slot, int nr) { - int ret; - struct btrfs_key key; + struct btrfs_root *log = BTRFS_I(inode)->root->log_root; + struct btrfs_path *path; struct btrfs_key found_key; + struct btrfs_key key; + int i; + int ret; - key.objectid = objectid; - key.type = max_key_type; - key.offset = (u64)-1; + /* There are no relative items of the inode in log. */ + if (BTRFS_I(inode)->logged_trans < trans->transaction->transid) + return 0; - while (1) { + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + for (i = 0; i < nr; i++) { + btrfs_item_key_to_cpu(eb, &key, i + slot); + + if (btrfs_key_type(&key) == BTRFS_EXTENT_DATA_KEY) { + struct btrfs_file_extent_item *fi; + int found_type; + u64 mask = BTRFS_I(inode)->root->sectorsize - 1; + u64 start = key.offset; + u64 extent_end; + u64 hint; + unsigned long size; + + fi = btrfs_item_ptr(eb, slot + i, + struct btrfs_file_extent_item); + found_type = btrfs_file_extent_type(eb, fi); + + if (found_type == BTRFS_FILE_EXTENT_REG || + found_type == BTRFS_FILE_EXTENT_PREALLOC) + extent_end = start + + btrfs_file_extent_num_bytes(eb, fi); + else if (found_type == BTRFS_FILE_EXTENT_INLINE) { + size = btrfs_file_extent_inline_len(eb, fi); + extent_end = (start + size + mask) & ~mask; + } else + BUG_ON(1); + + /* drop any overlapping extents */ + ret = btrfs_drop_extents(trans, inode, start, + extent_end, &hint, 0, 1); + BUG_ON(ret); + + continue; + } + + /* non file extent */ ret = btrfs_search_slot(trans, log, &key, path, -1, 1); - BUG_ON(ret == 0); if (ret < 0) break; - if (path->slots[0] == 0) + /* empty log! */ + if (ret > 0 && path->slots[0] == 0) break; - path->slots[0]--; + if (ret > 0) { + btrfs_release_path(log, path); + continue; + } + btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); - if (found_key.objectid != objectid) - break; + if (btrfs_comp_cpu_keys(&found_key, &key)) + BUG_ON(1); ret = btrfs_del_item(trans, log, path); BUG_ON(ret); btrfs_release_path(log, path); } btrfs_release_path(log, path); - return ret; + btrfs_free_path(path); + + return 0; } static noinline int copy_items(struct btrfs_trans_handle *trans, - struct btrfs_root *log, + struct inode *inode, struct btrfs_path *dst_path, struct extent_buffer *src, int start_slot, int nr, int inode_only) { unsigned long src_offset; unsigned long dst_offset; + struct btrfs_root *log = BTRFS_I(inode)->root->log_root; struct btrfs_file_extent_item *extent; struct btrfs_inode_item *inode_item; - int ret; struct btrfs_key *ins_keys; + int ret; u32 *ins_sizes; char *ins_data; int i; @@ -2622,6 +2668,10 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, INIT_LIST_HEAD(&ordered_sums); + ret = prepare_for_merge_items(trans, inode, src, start_slot, nr); + if (ret) + return ret; + ins_data = kmalloc(nr * sizeof(struct btrfs_key) + nr * sizeof(u32), GFP_NOFS); if (!ins_data) @@ -2725,6 +2775,34 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, return ret; } +/* + * a helper function to filter the old file extent items by checking their + * generation. + */ +static inline int is_extent_uptodate(struct btrfs_path *path, u64 min_trans) +{ + struct btrfs_file_extent_item *fi; + struct btrfs_key key; + struct extent_buffer *eb; + int slot; + u64 gen; + + eb = path->nodes[0]; + slot = path->slots[0]; + + btrfs_item_key_to_cpu(eb, &key, slot); + + if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY) + return 1; + + fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item); + gen = btrfs_file_extent_generation(eb, fi); + if (gen < min_trans) + return 0; + + return 1; +} + /* log a single inode in the tree log. * At least one parent directory for this inode must exist in the tree * or be logged already. @@ -2754,6 +2832,16 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, int nritems; int ins_start_slot = 0; int ins_nr; + u64 transid; + + /* + * We use transid in btrfs_search_forward() as a filter, in order to + * find the uptodate block (node or leaf). + */ + if (BTRFS_I(inode)->first_sub_trans > trans->transaction->transid) + transid = BTRFS_I(inode)->first_sub_trans; + else + transid = trans->transaction->transid; log = root->log_root; @@ -2784,30 +2872,12 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, mutex_lock(&BTRFS_I(inode)->log_mutex); - /* - * a brute force approach to making sure we get the most uptodate - * copies of everything. - */ - if (S_ISDIR(inode->i_mode)) { - int max_key_type = BTRFS_DIR_LOG_INDEX_KEY; - - if (inode_only == LOG_INODE_EXISTS) - max_key_type = BTRFS_XATTR_ITEM_KEY; - ret = drop_objectid_items(trans, log, path, - inode->i_ino, max_key_type); - } else { - ret = btrfs_truncate_inode_items(trans, log, inode, 0, 0); - } - if (ret) { - err = ret; - goto out_unlock; - } path->keep_locks = 1; while (1) { ins_nr = 0; ret = btrfs_search_forward(root, &min_key, &max_key, - path, 0, trans->transid); + path, 0, transid); if (ret != 0) break; again: @@ -2818,6 +2888,9 @@ again: break; src = path->nodes[0]; + if (!is_extent_uptodate(path, transid)) + goto filter; + if (ins_nr && ins_start_slot + ins_nr == path->slots[0]) { ins_nr++; goto next_slot; @@ -2826,15 +2899,17 @@ again: ins_nr = 1; goto next_slot; } - - ret = copy_items(trans, log, dst_path, src, ins_start_slot, - ins_nr, inode_only); - if (ret) { - err = ret; - goto out_unlock; +filter: + if (ins_nr) { + ret = copy_items(trans, inode, dst_path, src, + ins_start_slot, + ins_nr, inode_only); + if (ret) { + err = ret; + goto out_unlock; + } + ins_nr = 0; } - ins_nr = 1; - ins_start_slot = path->slots[0]; next_slot: nritems = btrfs_header_nritems(path->nodes[0]); @@ -2845,7 +2920,7 @@ next_slot: goto again; } if (ins_nr) { - ret = copy_items(trans, log, dst_path, src, + ret = copy_items(trans, inode, dst_path, src, ins_start_slot, ins_nr, inode_only); if (ret) { @@ -2866,7 +2941,7 @@ next_slot: break; } if (ins_nr) { - ret = copy_items(trans, log, dst_path, src, + ret = copy_items(trans, inode, dst_path, src, ins_start_slot, ins_nr, inode_only); if (ret) { -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
If a inode is a BTRFS_INODE_NODATASUM one, it need not to look for csum items any more. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/tree-log.c | 13 ++++++++----- 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 745933c..28b088b 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2652,7 +2652,8 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, struct inode *inode, struct btrfs_path *dst_path, struct extent_buffer *src, - int start_slot, int nr, int inode_only) + int start_slot, int nr, int inode_only, + int csum) { unsigned long src_offset; unsigned long dst_offset; @@ -2719,7 +2720,8 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, * or deletes of this inode don''t have to relog the inode * again */ - if (btrfs_key_type(ins_keys + i) == BTRFS_EXTENT_DATA_KEY) { + if (btrfs_key_type(ins_keys + i) =+ BTRFS_EXTENT_DATA_KEY && csum) { int found_type; extent = btrfs_item_ptr(src, start_slot + i, struct btrfs_file_extent_item); @@ -2833,6 +2835,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, int ins_start_slot = 0; int ins_nr; u64 transid; + int csum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ? 0 : 1; /* * We use transid in btrfs_search_forward() as a filter, in order to @@ -2903,7 +2906,7 @@ filter: if (ins_nr) { ret = copy_items(trans, inode, dst_path, src, ins_start_slot, - ins_nr, inode_only); + ins_nr, inode_only, csum); if (ret) { err = ret; goto out_unlock; @@ -2922,7 +2925,7 @@ next_slot: if (ins_nr) { ret = copy_items(trans, inode, dst_path, src, ins_start_slot, - ins_nr, inode_only); + ins_nr, inode_only, csum); if (ret) { err = ret; goto out_unlock; @@ -2943,7 +2946,7 @@ next_slot: if (ins_nr) { ret = copy_items(trans, inode, dst_path, src, ins_start_slot, - ins_nr, inode_only); + ins_nr, inode_only, csum); if (ret) { err = ret; goto out_unlock; -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The current code uses struct root''s last_log_commit to check if an inode has been logged, but the problem is that this root->last_log_commit is shared among files. Say we have N inodes to be logged, after the first inode, root-last_log_commit is updated and the N-1 remains will not be logged. As we''ve introduce sub transaction and filled inode''s last_trans and logged_trans with sub_transid instead of transaction id, we can just compare last_trans with logged_trans to determine if the processing inode is logged. And the more important thing is these two values are inode-individual, so it will not interfere with others. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/btrfs_inode.h | 5 ----- fs/btrfs/ctree.h | 1 - fs/btrfs/disk-io.c | 2 -- fs/btrfs/inode.c | 2 -- fs/btrfs/transaction.h | 1 - fs/btrfs/tree-log.c | 16 +++------------- 6 files changed, 3 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index fb5617a..d3a570c 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -94,11 +94,6 @@ struct btrfs_inode { u64 last_trans; /* - * log transid when this inode was last modified - */ - u64 last_sub_trans; - - /* * transid that last logged this inode */ u64 logged_trans; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1ba3f91..73aa36b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1114,7 +1114,6 @@ struct btrfs_root { atomic_t log_writers; atomic_t log_commit[2]; unsigned long log_transid; - unsigned long last_log_commit; unsigned long log_batch; pid_t log_start_pid; bool log_multiple_pids; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 54842fe..ac8d2ac 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1079,7 +1079,6 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize, atomic_set(&root->log_writers, 0); root->log_batch = 0; root->log_transid = 0; - root->last_log_commit = 0; extent_io_tree_init(&root->dirty_log_pages, fs_info->btree_inode->i_mapping, GFP_NOFS); @@ -1216,7 +1215,6 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans, WARN_ON(root->log_root); root->log_root = log_root; root->log_transid = 0; - root->last_log_commit = 0; return 0; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32eac29..40f6f8f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6580,7 +6580,6 @@ again: spin_unlock(&BTRFS_I(inode)->sub_trans_lock); BTRFS_I(inode)->last_trans = root->fs_info->sub_generation; - BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; unlock_extent_cached(io_tree, page_start, page_end, &cached_state, GFP_NOFS); @@ -6775,7 +6774,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) ei->sequence = 0; ei->first_sub_trans = 0; ei->last_trans = 0; - ei->last_sub_trans = 0; ei->logged_trans = 0; ei->delalloc_bytes = 0; ei->reserved_bytes = 0; diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index d531aea..e169553 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -99,7 +99,6 @@ static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans, spin_unlock(&BTRFS_I(inode)->sub_trans_lock); BTRFS_I(inode)->last_trans = trans->transid; - BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; } int btrfs_end_transaction(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 28b088b..912397c 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -1967,7 +1967,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, int ret; struct btrfs_root *log = root->log_root; struct btrfs_root *log_root_tree = root->fs_info->log_root_tree; - unsigned long log_transid = 0; mutex_lock(&root->log_mutex); index1 = root->log_transid % 2; @@ -2002,8 +2001,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, goto out; } - log_transid = root->log_transid; - if (log_transid % 2 == 0) + if (root->log_transid % 2 == 0) mark = EXTENT_DIRTY; else mark = EXTENT_NEW; @@ -2108,11 +2106,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, write_ctree_super(trans, root->fs_info->tree_root, 1); ret = 0; - mutex_lock(&root->log_mutex); - if (root->last_log_commit < log_transid) - root->last_log_commit = log_transid; - mutex_unlock(&root->log_mutex); - out_wake_log_root: atomic_set(&log_root_tree->log_commit[index2], 0); smp_mb(); @@ -3042,14 +3035,11 @@ out: static int inode_in_log(struct btrfs_trans_handle *trans, struct inode *inode) { - struct btrfs_root *root = BTRFS_I(inode)->root; int ret = 0; - mutex_lock(&root->log_mutex); - if (BTRFS_I(inode)->logged_trans == trans->transid && - BTRFS_I(inode)->last_sub_trans <= root->last_log_commit) + if (BTRFS_I(inode)->logged_trans >= trans->transaction->transid && + BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans) ret = 1; - mutex_unlock(&root->log_mutex); return ret; } -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
fsync will wait for writeback till it finishes, and last_trans will get the real transid recorded in writeback, so it does not need an extra +1 to ensure fsync''s process on the file. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> --- fs/btrfs/file.c | 13 ------------- 1 files changed, 0 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index d19cf3a..73c46e2 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1146,19 +1146,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, mutex_unlock(&inode->i_mutex); - /* - * we want to make sure fsync finds this change - * but we haven''t joined a transaction running right now. - * - * Later on, someone is sure to update the inode and get the - * real transid recorded. - * - * We set last_trans now to the fs_info generation + 1, - * this will either be one more than the running transaction - * or the generation used for the next transaction if there isn''t - * one running right now. - */ - BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; if (num_written > 0 || num_written == -EIOCBQUEUED) { err = generic_write_sync(file, pos, num_written); if (err < 0 && num_written > 0) -- 1.6.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
liubo
2011-May-19 08:14 UTC
Re: [PATCH 0/9] Btrfs: improve write ahead log with sub transaction
On 05/19/2011 04:11 PM, Liu Bo wrote:> I''ve been working to try to improve the write-ahead log''s performance, > and I found that the bottleneck addresses in the checksum items, > especially when we want to make a random write on a large file, e.g a 4G file. > > Then a idea for this suggested by Chris is to use sub transaction ids and just > to log the part of inode that had changed since either the last log commit or > the last transaction commit. And as we also push the sub transid into the btree > blocks, we''ll get much faster tree walks. As a result, we abandon the original > brute force approach, which is "to delete all items of the inode in log", > to making sure we get the most uptodate copies of everything, and instead > we manage to "find and merge", i.e. finding extents in the log tree and merging > in the new extents from the file. > > This patchset puts the above idea into code, and although the code is now more > complex, it brings us a great deal of performance improvement. > > Beside the improvement of log, patch 8 fixes a small but critical bug of log code > with sub transaction. > > Here I have some test results to show, I use sysbench to do "random write + fsync". > > ==> sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run] > ==> > Sysbench args: > - Number of threads: 1 > - Extra file open flags: 0 > - 2 files, 4Gb each > - Block size 4Kb > - Number of random requests for random IO: 10000 > - Read/Write ratio for combined random IO test: 1.50 > - Periodic FSYNC enabled, calling fsync() each 100 requests. > - Calling fsync() at the end of test, Enabled. > - Using synchronous I/O mode > - Doing random write test > > Sysbench results: > ==> Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total > Read 0b Written 39.062Mb Total transferred 39.062Mb > ==> a) without patch: (*SPEED* : 451.01Kb/sec) > 112.75 Requests/sec executed > > b) with patch: (*SPEED* : 4.3621Mb/sec) > 1116.71 Requests/sec executed > > > Liu Bo (10): > Btrfs: introduce sub transaction stuff > Btrfs: modify should_cow_block to update block''s generation > Btrfs: modify btrfs_drop_extents API > Btrfs: introduce first sub trans > Btrfs: still update inode transid when size remains unchanged > Btrfs: main log stuff > Btrfs: add checksum check for log > Btrfs: fix a bug of log check > Btrfs: kick off useless code > Btrfs: ship trans->transid to trans->transaction->transid > > fs/btrfs/btrfs_inode.h | 12 ++- > fs/btrfs/ctree.c | 71 ++++++++++----- > fs/btrfs/ctree.h | 5 +- > fs/btrfs/disk-io.c | 9 +- > fs/btrfs/extent-tree.c | 10 ++- > fs/btrfs/file.c | 22 ++--- > fs/btrfs/inode.c | 28 ++++-- > fs/btrfs/ioctl.c | 6 +- > fs/btrfs/relocation.c | 6 +- > fs/btrfs/transaction.c | 13 ++- > fs/btrfs/transaction.h | 19 ++++- > fs/btrfs/tree-defrag.c | 2 +- > fs/btrfs/tree-log.c | 222 ++++++++++++++++++++++++++++++++--------------- > 13 files changed, 279 insertions(+), 146 deletions(-) > >Sorry for the wrong analysis info, here is the right one: Liu Bo (9): Btrfs: introduce sub transaction stuff Btrfs: update block generation if should_cow_block fails Btrfs: modify btrfs_drop_extents API Btrfs: introduce first sub trans Btrfs: still update inode trans stuff when size remains unchanged Btrfs: improve log with sub transaction Btrfs: add checksum check for log Btrfs: fix a bug of log check Btrfs: kick off useless code fs/btrfs/btrfs_inode.h | 12 ++- fs/btrfs/ctree.c | 69 +++++++++++---- fs/btrfs/ctree.h | 5 +- fs/btrfs/disk-io.c | 9 +- fs/btrfs/extent-tree.c | 10 ++- fs/btrfs/file.c | 22 ++--- fs/btrfs/inode.c | 28 ++++-- fs/btrfs/ioctl.c | 6 +- fs/btrfs/relocation.c | 6 +- fs/btrfs/transaction.c | 13 ++- fs/btrfs/transaction.h | 19 ++++- fs/btrfs/tree-defrag.c | 2 +- fs/btrfs/tree-log.c | 222 ++++++++++++++++++++++++++++++++--------------- 13 files changed, 282 insertions(+), 141 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Liu Bo''s message of 2011-05-19 04:11:24 -0400:> Introduce a new concept "sub transaction", > the relation between transaction and sub transaction is > > transaction A ---> transid = x > sub trans a(1) ---> sub_transid = x+1 > sub trans a(2) ---> sub_transid = x+2 > ... ... > sub trans a(n-1) ---> sub_transid = x+n-1 > sub trans a(n) ---> sub_transid = x+n > transaction B ---> transid = x+n+1 > ... ... > > And the most important is > a) a trans handler''s transid now gets value from sub transid instead of transid. > b) when a transaction commits, transid may not added by 1, but depend on the > biggest sub_transaction of the last neighbour transaction, > i.e. > B->transid = a(n)->transid + 1, > (B->transid - A->transid) >= 1 > c) we start a new sub transaction after a fsync. > > We also ship some ''trans->transid'' to ''trans->transaction->transid'' to > ensure btrfs works well and to get rid of WARNings. > > These are used for the new log code.This is exactly what I had in mind. I need to read it harder and make sure it interacts well with the directory logging code, but I love it. Thanks! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/20/2011 08:23 AM, Chris Mason wrote:> Excerpts from Liu Bo''s message of 2011-05-19 04:11:24 -0400: >> Introduce a new concept "sub transaction", >> the relation between transaction and sub transaction is >> >> transaction A ---> transid = x >> sub trans a(1) ---> sub_transid = x+1 >> sub trans a(2) ---> sub_transid = x+2 >> ... ... >> sub trans a(n-1) ---> sub_transid = x+n-1 >> sub trans a(n) ---> sub_transid = x+n >> transaction B ---> transid = x+n+1 >> ... ... >> >> And the most important is >> a) a trans handler''s transid now gets value from sub transid instead of transid. >> b) when a transaction commits, transid may not added by 1, but depend on the >> biggest sub_transaction of the last neighbour transaction, >> i.e. >> B->transid = a(n)->transid + 1, >> (B->transid - A->transid) >= 1 >> c) we start a new sub transaction after a fsync. >> >> We also ship some ''trans->transid'' to ''trans->transaction->transid'' to >> ensure btrfs works well and to get rid of WARNings. >> >> These are used for the new log code. > > This is exactly what I had in mind. I need to read it harder and make > sure it interacts well with the directory logging code, but I love it. > > Thanks! >It''s so great that you like it. :) But I must NOTE again: Due to the bug which patch 8 fixed, the previous preformance statistics I posted sometime ago, like (*SPEED* : 4.7+ Mb/sec), are valueless and cannot be used as a basis any more... Hope that more people can get the patchset tested. thanks, liubo> -chris >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Chris Mason''s message of 2011-05-19 20:23:29 -0400:> Excerpts from Liu Bo''s message of 2011-05-19 04:11:24 -0400: > > Introduce a new concept "sub transaction", > > the relation between transaction and sub transaction is > > > > transaction A ---> transid = x > > sub trans a(1) ---> sub_transid = x+1 > > sub trans a(2) ---> sub_transid = x+2 > > ... ... > > sub trans a(n-1) ---> sub_transid = x+n-1 > > sub trans a(n) ---> sub_transid = x+n > > transaction B ---> transid = x+n+1 > > ... ... > > > > And the most important is > > a) a trans handler''s transid now gets value from sub transid instead of transid. > > b) when a transaction commits, transid may not added by 1, but depend on the > > biggest sub_transaction of the last neighbour transaction, > > i.e. > > B->transid = a(n)->transid + 1, > > (B->transid - A->transid) >= 1 > > c) we start a new sub transaction after a fsync. > > > > We also ship some ''trans->transid'' to ''trans->transaction->transid'' to > > ensure btrfs works well and to get rid of WARNings. > > > > These are used for the new log code. > > This is exactly what I had in mind. I need to read it harder and make > sure it interacts well with the directory logging code, but I love it.Ok, I hit a few problems with this, and since the transids are used everywhere for various reasons, I think we need to wait until 2.6.41. This code is really very close to right, but we have the delayed inode work, scrub, and the new inode number allocator all at once. I''d like to limit the size of the changes. The problems I hit: When an inode is dropped from cache (just via iput) and then read in again, the BTRFS_I(inode)->logged_trans goes back to zero. When this happens the logging code assumes the inode isn''t in the log and hits -EEXIST if it finds inode items. I patched it to just delete away all the logged items if the logged transid wasn''t set, which is probably safest given that we can now reuse inode numbers. Second, we use the generation number of the super to read in the log tree root after a crash. This doesn''t always match the sub trans id and so it doesn''t always match the transid stored in the btree blocks. There are a few solutions to this, we can use some of the reserved fields in the super for the generation numbers of the roots the super points to, and use whichever one is bigger when we read things in. Liubo, since we''ll leave this one for .41, I''ll take your smaller patch that just skips the csum items. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-May-23 16:43 UTC
Re: [PATCH 0/9] Btrfs: improve write ahead log with sub transaction
On 05/19/2011 04:11 AM, Liu Bo wrote:> I''ve been working to try to improve the write-ahead log''s performance, > and I found that the bottleneck addresses in the checksum items, > especially when we want to make a random write on a large file, e.g a 4G file. > > Then a idea for this suggested by Chris is to use sub transaction ids and just > to log the part of inode that had changed since either the last log commit or > the last transaction commit. And as we also push the sub transid into the btree > blocks, we''ll get much faster tree walks. As a result, we abandon the original > brute force approach, which is "to delete all items of the inode in log", > to making sure we get the most uptodate copies of everything, and instead > we manage to "find and merge", i.e. finding extents in the log tree and merging > in the new extents from the file. > > This patchset puts the above idea into code, and although the code is now more > complex, it brings us a great deal of performance improvement. > > Beside the improvement of log, patch 8 fixes a small but critical bug of log code > with sub transaction. > > Here I have some test results to show, I use sysbench to do "random write + fsync". > > ==> sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run] > ==> > Sysbench args: > - Number of threads: 1 > - Extra file open flags: 0 > - 2 files, 4Gb each > - Block size 4Kb > - Number of random requests for random IO: 10000 > - Read/Write ratio for combined random IO test: 1.50 > - Periodic FSYNC enabled, calling fsync() each 100 requests. > - Calling fsync() at the end of test, Enabled. > - Using synchronous I/O mode > - Doing random write test > > Sysbench results: > ==> Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total > Read 0b Written 39.062Mb Total transferred 39.062Mb > ==> a) without patch: (*SPEED* : 451.01Kb/sec) > 112.75 Requests/sec executed > > b) with patch: (*SPEED* : 4.3621Mb/sec) > 1116.71 Requests/sec executed >Have you run powerfail tests with this? I''d like to make sure you haven''t inadvertently messed something up. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
liubo
2011-May-24 01:29 UTC
Re: [PATCH 0/9] Btrfs: improve write ahead log with sub transaction
On 05/23/2011 12:43 PM, Josef Bacik wrote:> On 05/19/2011 04:11 AM, Liu Bo wrote: >> I''ve been working to try to improve the write-ahead log''s performance, >> and I found that the bottleneck addresses in the checksum items, >> especially when we want to make a random write on a large file, e.g a 4G file. >> >> Then a idea for this suggested by Chris is to use sub transaction ids and just >> to log the part of inode that had changed since either the last log commit or >> the last transaction commit. And as we also push the sub transid into the btree >> blocks, we''ll get much faster tree walks. As a result, we abandon the original >> brute force approach, which is "to delete all items of the inode in log", >> to making sure we get the most uptodate copies of everything, and instead >> we manage to "find and merge", i.e. finding extents in the log tree and merging >> in the new extents from the file. >> >> This patchset puts the above idea into code, and although the code is now more >> complex, it brings us a great deal of performance improvement. >> >> Beside the improvement of log, patch 8 fixes a small but critical bug of log code >> with sub transaction. >> >> Here I have some test results to show, I use sysbench to do "random write + fsync". >> >> ==>> sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run] >> ==>> >> Sysbench args: >> - Number of threads: 1 >> - Extra file open flags: 0 >> - 2 files, 4Gb each >> - Block size 4Kb >> - Number of random requests for random IO: 10000 >> - Read/Write ratio for combined random IO test: 1.50 >> - Periodic FSYNC enabled, calling fsync() each 100 requests. >> - Calling fsync() at the end of test, Enabled. >> - Using synchronous I/O mode >> - Doing random write test >> >> Sysbench results: >> ==>> Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total >> Read 0b Written 39.062Mb Total transferred 39.062Mb >> ==>> a) without patch: (*SPEED* : 451.01Kb/sec) >> 112.75 Requests/sec executed >> >> b) with patch: (*SPEED* : 4.3621Mb/sec) >> 1116.71 Requests/sec executed >> > > Have you run powerfail tests with this? I''d like to make sure you > haven''t inadvertently messed something up. Thanks, >Yes, I''ve done this before, and it has nothing serious but a few of "parent transid verify failed", just the same as Chris had mentioned in the thread. thanks, liubo> Josef >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from liubo''s message of 2011-05-25 06:21:04 -0400:> On 05/24/2011 11:56 PM, liubo wrote: > >> > The problems I hit: > >> > > >> > When an inode is dropped from cache (just via iput) and then read in > >> > again, the BTRFS_I(inode)->logged_trans goes back to zero. When this > >> > happens the logging code assumes the inode isn''t in the log and hits > >> > -EEXIST if it finds inode items. > >> > > > > > ok, I just find where the problem addresses. This is because I''ve put > > a check between logged_trans and transaction_id, which is inclined to > > filter those that are first logged, and I''m sorry for not taking the > > ''iput'' stuff into consideration. And it''s easy to fix this, as we > > can just kick this check off and put another check while searching > > ''BTRFS_INODE_ITEM_KEY'', since if we cannot find a inode item in a tree, > > it proves that this inode is definitely not in the tree. > > > > So I''d like to make some changes like this patch(_UNTEST_): > > I''ve thought of this problem more and came up with a _better and more efficient_ patch. > It will always get BTRFS_I(inode)->logged_trans correct value.Thanks, this makes sense.> > But I''m still trying to test it somehow... :Phttp://oss.oracle.com/~mason/synctest/synctest.c I used synctest -F -f -u -n 100 -t 32 . -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2011 10:40 AM, Chris Mason wrote:> Excerpts from Chris Mason''s message of 2011-05-19 20:23:29 -0400: >> Excerpts from Liu Bo''s message of 2011-05-19 04:11:24 -0400: >>> Introduce a new concept "sub transaction", >>> the relation between transaction and sub transaction is >>> >>> transaction A ---> transid = x >>> sub trans a(1) ---> sub_transid = x+1 >>> sub trans a(2) ---> sub_transid = x+2 >>> ... ... >>> sub trans a(n-1) ---> sub_transid = x+n-1 >>> sub trans a(n) ---> sub_transid = x+n >>> transaction B ---> transid = x+n+1 >>> ... ... >>> >>> And the most important is >>> a) a trans handler''s transid now gets value from sub transid instead of transid. >>> b) when a transaction commits, transid may not added by 1, but depend on the >>> biggest sub_transaction of the last neighbour transaction, >>> i.e. >>> B->transid = a(n)->transid + 1, >>> (B->transid - A->transid) >= 1 >>> c) we start a new sub transaction after a fsync. >>> >>> We also ship some ''trans->transid'' to ''trans->transaction->transid'' to >>> ensure btrfs works well and to get rid of WARNings. >>> >>> These are used for the new log code. >> This is exactly what I had in mind. I need to read it harder and make >> sure it interacts well with the directory logging code, but I love it. > > Ok, I hit a few problems with this, and since the transids are used > everywhere for various reasons, I think we need to wait until 2.6.41. > This code is really very close to right, but we have the delayed inode > work, scrub, and the new inode number allocator all at once. I''d like > to limit the size of the changes. >I agree with this, in fact, I''m a litter worried cause it is such an important role that the transids are playing in btrfs, which means to change it is dangerous, so it deserves more test.> The problems I hit: > > When an inode is dropped from cache (just via iput) and then read in > again, the BTRFS_I(inode)->logged_trans goes back to zero. When this > happens the logging code assumes the inode isn''t in the log and hits > -EEXIST if it finds inode items. >ok, I just find where the problem addresses. This is because I''ve put a check between logged_trans and transaction_id, which is inclined to filter those that are first logged, and I''m sorry for not taking the ''iput'' stuff into consideration. And it''s easy to fix this, as we can just kick this check off and put another check while searching ''BTRFS_INODE_ITEM_KEY'', since if we cannot find a inode item in a tree, it proves that this inode is definitely not in the tree. So I''d like to make some changes like this patch(_UNTEST_): diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 912397c..69ddbbd 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2569,10 +2569,6 @@ static int prepare_for_merge_items(struct btrfs_trans_handle *trans, int i; int ret; - /* There are no relative items of the inode in log. */ - if (BTRFS_I(inode)->logged_trans < trans->transaction->transid) - return 0; - path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -2622,6 +2618,11 @@ static int prepare_for_merge_items(struct btrfs_trans_handle *trans, if (ret > 0) { btrfs_release_path(log, path); + + /* There are no relative items of the inode in log. */ + if (key.type == BTRFS_INODE_ITEM_KEY) + break; + continue; }> I patched it to just delete away all the logged items if the logged > transid wasn''t set, which is probably safest given that we can now reuse > inode numbers. > > Second, we use the generation number of the super to read in the log > tree root after a crash. This doesn''t always match the sub trans id and > so it doesn''t always match the transid stored in the btree blocks. > > There are a few solutions to this, we can use some of the reserved > fields in the super for the generation numbers of the roots the super > points to, and use whichever one is bigger when we read things in. >All right, I''m going to dig it more.> Liubo, since we''ll leave this one for .41, I''ll take your smaller patch > that just skips the csum items. >ok, I see. Thank a lot for the review. :) thanks, liubo> -chris >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/24/2011 11:56 PM, liubo wrote:>> > The problems I hit: >> > >> > When an inode is dropped from cache (just via iput) and then read in >> > again, the BTRFS_I(inode)->logged_trans goes back to zero. When this >> > happens the logging code assumes the inode isn''t in the log and hits >> > -EEXIST if it finds inode items. >> > > > ok, I just find where the problem addresses. This is because I''ve put > a check between logged_trans and transaction_id, which is inclined to > filter those that are first logged, and I''m sorry for not taking the > ''iput'' stuff into consideration. And it''s easy to fix this, as we > can just kick this check off and put another check while searching > ''BTRFS_INODE_ITEM_KEY'', since if we cannot find a inode item in a tree, > it proves that this inode is definitely not in the tree. > > So I''d like to make some changes like this patch(_UNTEST_):I''ve thought of this problem more and came up with a _better and more efficient_ patch. It will always get BTRFS_I(inode)->logged_trans correct value. But I''m still trying to test it somehow... :P Here it is: diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 40f6f8f..d22b3bf 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1769,12 +1769,9 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end) add_pending_csums(trans, inode, ordered_extent->file_offset, &ordered_extent->list); - ret = btrfs_ordered_update_i_size(inode, 0, ordered_extent); - if (!ret) { - ret = btrfs_update_inode(trans, root, inode); - BUG_ON(ret); - } else - btrfs_set_inode_last_trans(trans, inode); + btrfs_ordered_update_i_size(inode, 0, ordered_extent); + ret = btrfs_update_inode(trans, root, inode); + BUG_ON(ret); ret = 0; out: if (nolock) { diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 912397c..92fe5dd 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3032,6 +3032,37 @@ out: return ret; } +static int check_logged_trans(struct btrfs_trans_handle *trans, + struct btrfs_root *root, struct inode *inode) +{ + struct btrfs_inode_item *inode_item; + struct btrfs_path *path; + int ret; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + ret = btrfs_search_slot(trans, root, + &BTRFS_I(inode)->location, path, 0, 0); + if (ret) { + if (ret > 0) + ret = 0; + goto out; + } + + btrfs_unlock_up_safe(path, 1); + inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_inode_item); + + BTRFS_I(inode)->logged_trans = btrfs_inode_transid(path->nodes[0], + inode_item); +out: + btrfs_free_path(path); + return ret; +} + + static int inode_in_log(struct btrfs_trans_handle *trans, struct inode *inode) { @@ -3084,6 +3115,18 @@ int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, if (ret) goto end_no_trans; + /* + * After we iput a inode and reread it from disk, logged_trans is 0. + * However, this inode _may_ still remain in log tree and not be + * committed yet. + * So we need to check the log tree to get logged_trans a right value. + */ + if (!BTRFS_I(inode)->logged_trans && root->log_root) { + ret = check_logged_trans(trans, root->log_root, inode); + if (ret) + goto end_no_trans; + } + if (inode_in_log(trans, inode)) { ret = BTRFS_NO_LOG_SYNC; goto end_no_trans; thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/24/2011 11:56 PM, liubo wrote:>> > >> > Second, we use the generation number of the super to read in the log >> > tree root after a crash. This doesn''t always match the sub trans id and >> > so it doesn''t always match the transid stored in the btree blocks. >> > >> > There are a few solutions to this, we can use some of the reserved >> > fields in the super for the generation numbers of the roots the super >> > points to, and use whichever one is bigger when we read things in. >> > > > All right, I''m going to dig it more. >I''ve got this resolved via ''log_root_transid'' of ''struct btrfs_super_block'', and it looks nice on both syntactic and functional side. :) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ac8d2ac..1006898 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2103,6 +2103,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, if (btrfs_super_log_root(disk_super) != 0 && !(fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR)) { u64 bytenr = btrfs_super_log_root(disk_super); + u64 log_root_transid = btrfs_super_log_root_transid(disk_super); if (fs_devices->rw_devices == 0) { printk(KERN_WARNING "Btrfs log replay required " @@ -2125,7 +2126,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, log_tree_root->node = read_tree_block(tree_root, bytenr, blocksize, - generation + 1); + log_root_transid); ret = btrfs_recover_log_trees(log_tree_root); BUG_ON(ret); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 912397c..b304ec1 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2089,6 +2089,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, log_root_tree->node->start); btrfs_set_super_log_root_level(&root->fs_info->super_for_commit, btrfs_header_level(log_root_tree->node)); + btrfs_set_super_log_root_transid(&root->fs_info->super_for_commit, + trans->transid); log_root_tree->log_batch = 0; log_root_tree->log_transid++; thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html